Monday, March 20, 2017

A Consumer’s Response to Amazon S3 Service Disruption of 2017

A Consumer’s Response to Amazon S3 Service Disruption of 2017

Only a handful of events across the Internet are impactful enough to become a topic that every news agency, blogger, and technology professional talks about. One of those events happens to be an interruption to Amazon’s Web Services platform, AWS. Chances are you remember where you were when one of these events happened, either as a consumer of a service that was impacted or a consumer of the AWS service that was impacted. In late Winter of 2017, Amazon had an incident with their S3 service that ended up impacting most of their services in the us-east-1 region. Here are some thoughts on Amazon’s public response to that outage.

Background

First, I encourage you to read through Amazon’s response of the incident, especially if you are unaware of it. It is a great summary of the event and what let up to it. I want to pick out a few values in the response that those of us in the industry should take to heart. 

Observations of Values

When reading the response from Amazon, I could not help but notice that the tone of the correspondence was very transparent. The summary starts off by clearly stating that an associate at the organization performed an action that directly triggered the event. There was no sugar coating, diversion, or deflection. They did not blame computers, blame some third party, or throw their associate under the preverbal bus. As an organization, they owned the event, and stated that a qualified associate simply made an error. As an error prone human who has worked on production systems for several decades, I could not help but empathize with that associate. The open admission of a misstep and the focus to move past that and on to what can be learned was forward thinking.
Throughout the summary the focus was on what the assumptions were and why the result did not match the assumption. While reading, it was hard not to pick-up on the blameless language that was used. For example, take this excerpt from their summary:
“While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”
Amazon built in some resiliency and regularly practiced small destructive events to ensure resiliency, recovery, availability, and stability. They continued on to suggest that the system failed the people. Rather than blaming the associate, the process, or some outdated documentation, AWS instead highlighted their mission to blamelessly make their associates successful. How? They indicated they modified some practices to , “remove capacity more slowly and added safeguards to prevent capacity from being removed…”. Further on, AWS admitted they eat their own dog food and that ironically impacted their ability to post status updates of their services. “…we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3.” These are very important observations and so is what they indicated they learned from it. 
Numerous times through the summary, Amazon articulated where an assumption broke down, but then continuously identified an actionable improvement to empower their educated associates to be more successful in making educated decisions. 
By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.

Thoughts

No matter how well you prioritize your work queue, there is always an opportunity cost. Sometimes we choose wisely, and sometimes even if the choice was wise the result has visible impact. I was comforted in knowing that some of the most talented and forward thinking engineers and leaders in the industry are just as human as I am and make mistakes. It is not the avoidance of mistakes that separates you, but rather how you handle the mistakes and move forward.
As humans we all make decisions, some easier than others. At Amazon they appear to try to setup their associates to be successful with decisions by allowing them to make educated choices and plan for possible human error. They achieve that scenario by transparently owning the incident, blamelessly evaluating each incident to identify areas where they can continuously improve
Face it, this kind of incident could have easily happened to you. Like you, the engineers at AWS try to juggle many items at the same time, and show up to work to do a good job and make a difference. Just like AWS, you too will make a mistake that will impact your customers or patrons. Questions you should ask yourself include: have you setup your team, colleagues, and partners for success? Are you transparently admitting your weak points, owning them, and taking the opportunity to continue improvement? Are you fostering a blameless culture to help empower future success?  The organization I work for is venturing to answer these questions; how empowering!