Wednesday, December 6, 2017

Adding Google Cloud Package to your apt sources via cloud-init

Install kubectl from Google via cloud-init

Quick-answer:

You need to add this to your cloud-init:
 sources:
    google.list:
      source: deb http://apt.kubernetes.io/ kubernetes-xenial main
      keyid: BA07F4FB

      keyserver: pgp.mit.edu

The TL;DR Story

Like you, I am a fan of cloud-init.  It is a very straight forward way to handle sending metadata to cloud provider instances.   A lot of changes have been made to cloud-init over the past few years so I took some time to look into a few of them.  I needed to install Kubernetes (k8s) tools and I wanted to use the Google Cloud Package deb repository as the source.

I could have used one of the many  curl methods to install k8s, or some other manual method with bash, but I wanted to do it a clean cloud-init way.   I also tried installing the GCE tools and using gcloud to install kubectl, but I am an AWS user and that did not seem to work well on my ec2 instance (hung with dpkg and did not do anything).

Here is what I found as clean approach to my problem:

Setup the cloud-init apt: configs

My cloud-init YAML for apt looks like the image below.  I'll try to explain each of the major pieces needed for adding the Google repo. Note, I stopped using the older format of apt-sources: and switched to this format that is in cloud-init v17.x+

Image of apt config


  • google.list:  This is the source that will get added to the /etc/apt/sources.list.d path on your Ubuntu instance.
  • source:  This is the deb repo path.  I obtained this path from this guy.
  • keyid:  This was the tricky part.  I used my gpg-keychain app on my Mac to search for the Google Cloud Packages Automatic Signing Key.   I knew I had to find this key because of these documents.  Once I found Google's entry in gpg-keychain, I got the Key ID like shown below.  I then stuffed it into this field in my cloud-init.
    gpg-keychain showing the entry for Google Cloud Package key
  • keyserver:  I added this for good measure to make sure that cloud-init could find the key, since that is where my gpg-keychain app had found it.  I probably did not need this. 

Making sure kubectl (Kubernetes) was installed

Simply adding the item to the cloud-init packages: list made sure it was installed.  The below list of packages is more than just for k8s; I shared my whole list for reference.
cloud-init package: config example

Logs to prove it

Here you can see that my repo was found and my packages were installed
show logs of proof that my Google apt repo was found and used

Monday, March 20, 2017

A Consumer’s Response to Amazon S3 Service Disruption of 2017

A Consumer’s Response to Amazon S3 Service Disruption of 2017

Only a handful of events across the Internet are impactful enough to become a topic that every news agency, blogger, and technology professional talks about. One of those events happens to be an interruption to Amazon’s Web Services platform, AWS. Chances are you remember where you were when one of these events happened, either as a consumer of a service that was impacted or a consumer of the AWS service that was impacted. In late Winter of 2017, Amazon had an incident with their S3 service that ended up impacting most of their services in the us-east-1 region. Here are some thoughts on Amazon’s public response to that outage.

Background

First, I encourage you to read through Amazon’s response of the incident, especially if you are unaware of it. It is a great summary of the event and what let up to it. I want to pick out a few values in the response that those of us in the industry should take to heart. 

Observations of Values

When reading the response from Amazon, I could not help but notice that the tone of the correspondence was very transparent. The summary starts off by clearly stating that an associate at the organization performed an action that directly triggered the event. There was no sugar coating, diversion, or deflection. They did not blame computers, blame some third party, or throw their associate under the preverbal bus. As an organization, they owned the event, and stated that a qualified associate simply made an error. As an error prone human who has worked on production systems for several decades, I could not help but empathize with that associate. The open admission of a misstep and the focus to move past that and on to what can be learned was forward thinking.
Throughout the summary the focus was on what the assumptions were and why the result did not match the assumption. While reading, it was hard not to pick-up on the blameless language that was used. For example, take this excerpt from their summary:
“While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”
Amazon built in some resiliency and regularly practiced small destructive events to ensure resiliency, recovery, availability, and stability. They continued on to suggest that the system failed the people. Rather than blaming the associate, the process, or some outdated documentation, AWS instead highlighted their mission to blamelessly make their associates successful. How? They indicated they modified some practices to , “remove capacity more slowly and added safeguards to prevent capacity from being removed…”. Further on, AWS admitted they eat their own dog food and that ironically impacted their ability to post status updates of their services. “…we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3.” These are very important observations and so is what they indicated they learned from it. 
Numerous times through the summary, Amazon articulated where an assumption broke down, but then continuously identified an actionable improvement to empower their educated associates to be more successful in making educated decisions. 
By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.

Thoughts

No matter how well you prioritize your work queue, there is always an opportunity cost. Sometimes we choose wisely, and sometimes even if the choice was wise the result has visible impact. I was comforted in knowing that some of the most talented and forward thinking engineers and leaders in the industry are just as human as I am and make mistakes. It is not the avoidance of mistakes that separates you, but rather how you handle the mistakes and move forward.
As humans we all make decisions, some easier than others. At Amazon they appear to try to setup their associates to be successful with decisions by allowing them to make educated choices and plan for possible human error. They achieve that scenario by transparently owning the incident, blamelessly evaluating each incident to identify areas where they can continuously improve
Face it, this kind of incident could have easily happened to you. Like you, the engineers at AWS try to juggle many items at the same time, and show up to work to do a good job and make a difference. Just like AWS, you too will make a mistake that will impact your customers or patrons. Questions you should ask yourself include: have you setup your team, colleagues, and partners for success? Are you transparently admitting your weak points, owning them, and taking the opportunity to continue improvement? Are you fostering a blameless culture to help empower future success?  The organization I work for is venturing to answer these questions; how empowering!