A Very Particular Set Of Skills

A Very Particular Set Of Skills

Has anybody not heard of the recent ransomware attack known as WannaCry? No? Good. Hopefully you’re only aware of it through news articles, but for far too many folks, this is not the case.

We all keep our patches up to date and we all use various levels of protection to limit the attack surface and potential spread of these kinds of attacks.

Unfortunately, and for various reasons, these kinds of attacks can still wreak havoc.

When this does happen, it doesn’t have to ruin your year.

For an individual virtual machine affected by this, simply:

  1. Revert your affected VM back to a previous snapshot using SyncVM
  2. Start the VM disconnected from the network
  3. Apply any updates to close the exploited security hole
  4. Reconnect to the network
  5. Don’t pay the ransom

In cases where there are a very large number of affected VMs, a lot of this process can be automated.

To misappropriate and misquote the famous speech from the movie Taken, our customers have a very particular set of skills. Skills we’ve been assisting them with over a long career.

[Ransom Note image by Sheila Sund and used unmodified under CC BY 2.0]

Could The AWS Outage Happen To You?

Could The AWS Outage Happen To You?

No doubt we’ve all been affected by the recent Amazon Web Services outage in one form or probably many. It’s had coverage the Internet over… now that the Internet is again running. We’re also starting to get a picture of what went wrong.

This article isn’t a big I told you so about public cloud, nor is it a jab at Amazon or the poor Engineer who must have had a rough couple of days this week.

I’m hoping that this article serves as a bit of a public service announcement for the way that we all operate within our Enterprise and Private clouds so that the same doesn’t happen to those of us running the infrastructure within our organisations.

What Happened?

Amazon has posted a summary of what went down. I want to take a look at a specific excerpt:

At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

The details are thin, but our immediate reaction might be to jump on that poor S3 team member. I hope his or her week is looking up already and I’m not sure that they’re individually to blame as much as the automation.

The industry is talking about automation and orchestration being a necessity due to the scale and complexity of the in-house systems we have. It’s unreasonable to expect that the complexity of one of the biggest pieces of technical infrastructure can be maintained in the heads of the engineers running it — especially being as fluid as it is.

Any automation or orchestration needs to put bounds around what the end user can do. Whether it be a strong, technical engineer, an end user, or someone in between.

What Can We Learn From This?

We’re all talking a lot about orchestration and automation. We’ve got self-service portals for end users and we’re putting new tools in front of helpdesk staff. We’re having to automate big chunks of what used to be standard IT stuff because the scale is too big to do these things manually anymore.

Whether the automation or orchestration is designed for the most technical of folks, or whether it’s for someone without the technical chops, it’s important to make sure that the tooling only allows stuff that should be allowed and prevents everything else. It sounds simpler than it really is, but it’s worth considering and limiting the potential use cases to avoid tragedy in-house.

A Simple, Contrived Example

To illustrate the point, let’s assume that we have some orchestration to allow an operator to control RAID arrays in servers in our data centre. One of their responsibilities is to monitor the number of write cycles to the SSDs in the RAID array. When a drive crosses a threshold, the operator marks it failed and organises a replacement drive.

[I told you that this illustrative example would be contrived…]

These fictional arrays are made up of RAID 6 stripe sets, which as we know supports two drive failures.

Now let’s say that the operator sees three drives cross the threshold at the same time and pushes the big red button on each of them. We now have an incomplete RAID set and an outage.

Sure, in the real world, we wouldn’t remove non-failed disks from an array before a replacement was there and would only ever do one at a time. However, if we expand this type of cluster-like redundancy to hosts, clusters and geographic regions, we’re starting to see how quickly we’re going to lose visibility of all of the moving parts.

The point is that if you have a piece of automation that accepts input from someone (including yourself late, late at night), try to preempt the valid use cases and and restrict or limit the rest.

[Image by William Warby and used unmodified under CC 2.0]

Hyper-V 2016 & Changed Block Tracking

Hyper-V 2016 & Changed Block Tracking

Changed Block Tracking, or CBT, is a mechanism that allows applications (such as backup software) to find out which blocks of a file have changed since a previous point in time.

If you consider a virtual hard disk that’s a terabyte in size, a differential backup would involve reading the whole file to see what had changed and only back up the blocks that have changed. Whether using direct-attached or network-attached storage, this full-file scan can take a long time, which increases already-tight backup windows.

CBT allows the backup application to seek to the parts of the virtual hard disk file that have changed and only read the changed blocks.

Unfortunately, in Windows Server 2012r2 (and earlier) with Hyper-V, there was no in-built support for CBT. So backup vendors (such as our friends over at Veeam) had to implement their own custom CBT infrastructure, which may or may not be compatible with the storage at the backend.

With Windows Server 2016, CBT (also referred to as Resilient Change Tracking, or RCT) has been standardised within Windows itself. Vendors such as Veeam are able to make use of it and it is more likely to work with a wider variety of storage arrays. Veeam 9.5 brings support for CBT/RCT with Windows Server 2016. With Tintri support for Windows Server 2016 announced and to be released to GA imminently, the combination of Tintri, Veeam and Windows Server 2016 will allow you to enjoy standardised Changed Block Tracking.

[Above image created and distributed by David Pacey under Creative Commons 2.0 License]