No doubt we’ve all been affected by the recent Amazon Web Services outage in one form or probably many. It’s had coverage the Internet over… now that the Internet is again running. We’re also starting to get a picture of what went wrong.
This article isn’t a big I told you so about public cloud, nor is it a jab at Amazon or the poor Engineer who must have had a rough couple of days this week.
I’m hoping that this article serves as a bit of a public service announcement for the way that we all operate within our Enterprise and Private clouds so that the same doesn’t happen to those of us running the infrastructure within our organisations.
Amazon has posted a summary of what went down. I want to take a look at a specific excerpt:
At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
The details are thin, but our immediate reaction might be to jump on that poor S3 team member. I hope his or her week is looking up already and I’m not sure that they’re individually to blame as much as the automation.
The industry is talking about automation and orchestration being a necessity due to the scale and complexity of the in-house systems we have. It’s unreasonable to expect that the complexity of one of the biggest pieces of technical infrastructure can be maintained in the heads of the engineers running it — especially being as fluid as it is.
Any automation or orchestration needs to put bounds around what the end user can do. Whether it be a strong, technical engineer, an end user, or someone in between.
What Can We Learn From This?
We’re all talking a lot about orchestration and automation. We’ve got self-service portals for end users and we’re putting new tools in front of helpdesk staff. We’re having to automate big chunks of what used to be standard IT stuff because the scale is too big to do these things manually anymore.
Whether the automation or orchestration is designed for the most technical of folks, or whether it’s for someone without the technical chops, it’s important to make sure that the tooling only allows stuff that should be allowed and prevents everything else. It sounds simpler than it really is, but it’s worth considering and limiting the potential use cases to avoid tragedy in-house.
A Simple, Contrived Example
To illustrate the point, let’s assume that we have some orchestration to allow an operator to control RAID arrays in servers in our data centre. One of their responsibilities is to monitor the number of write cycles to the SSDs in the RAID array. When a drive crosses a threshold, the operator marks it failed and organises a replacement drive.
[I told you that this illustrative example would be contrived…]
These fictional arrays are made up of RAID 6 stripe sets, which as we know supports two drive failures.
Now let’s say that the operator sees three drives cross the threshold at the same time and pushes the big red button on each of them. We now have an incomplete RAID set and an outage.
Sure, in the real world, we wouldn’t remove non-failed disks from an array before a replacement was there and would only ever do one at a time. However, if we expand this type of cluster-like redundancy to hosts, clusters and geographic regions, we’re starting to see how quickly we’re going to lose visibility of all of the moving parts.
The point is that if you have a piece of automation that accepts input from someone (including yourself late, late at night), try to preempt the valid use cases and and restrict or limit the rest.
[Image by William Warby and used unmodified under CC 2.0]