Orchestration for Enterprise Cloud part 5

Orchestration for Enterprise Cloud part 5

We’ve spent the past four installments in this series putting together a System Center Orchestrator runbook workflow to call into PowerShell to call Tintri Automation Toolkit cmdlets to do a bunch of stuff.

The stuff that it’s doing is solving a real business need for us — we want our developers to be able to test their code against a current copy of production data. Dump and restore operations are very expensive and error prone, so we’re taking advantage of Tintri’s SyncVM functionality to handle the data synchronisation for us. As we’ll see, this is going to take less than a minute to perform!

In this article, we’ll walk through executing this runbook and show how easy it makes the task. This simplicity makes it a great candidate for a task that can be delegated to someone with less in-depth knowledge (or access to) the cloud infrastructure. This is a big step forward toward self-service.

Orchestrator Web Console

If we now point our web browser at port 82 of our Orchestrator server (for example, http://scorch-2016.vmlevel.com:82/), you should be presented with the Orchestrator Web Console and you should see our new Runbook.

scorch-runbooks

Select the runbook and click the Start Runbook button.

scorch-start

It will prompt you for the required input — simply the name of the developer’s virtual machine. It doesn’t request any information about the VMstore that the VM is stored on, it doesn’t ask for the production VM name, it doesn’t ask which snapshot to sync from and it doesn’t ask which virtual disks to synchronise. All of that is taken care of inside the runbook. This drastically reduces the number of places we could accidentally mess something up when we’re in a hurry or if we delegate this task to someone else.

Should something go wrong with the destination VM as part of this process, the SyncVM process we’re using takes a safety snapshot automatically, so at worst, we can easily roll it back.

We’ll enter our VM name (vmlevel-devel) and kick off the runbook job.

scorch-vmname

Next we’ll click on the Jobs tab and should see a running job.

scorch-jobs

If the job doesn’t have an hour glass (indicating it’s running) or a green tick (indicating success), it’s worth checking that the Orchestrator Runbook service is started on your runbook servers (check your Services applet). I’ve noticed that at times it doesn’t start correctly by itself despite being set to be automatically started:

scorch-services

After a little while (it takes about 45 seconds in my lab), hit refresh and the job should have succeeded. At that point, click on View Instances and then View Details to view the details of the job.

scorch-jobdetails

If we click on the Activity Details tab and scroll down, we can see the parameters of the Run .Net Script activity that calls our PowerShell code. If you look closely, you’ll see the variables we have defined. This especially includes out TraceLog variable, which you can see in the above output gives us a very detailed run-down of the process executed.

Given that this has succeeded, we’ve achieved our goal. Our developer VM has our developer code and OS on it, but has a copy of the latest production data snapshot. The whole process took less than 60 seconds and the developer is now up and running with recent production data — all without costly dumps and restores.

Try it for yourself and see.

[Ovation image by Joi Ito and used unmodified under CC2.0]

Advertisements

Orchestration for Enterprise Cloud part 4

Orchestration for Enterprise Cloud part 4

Leading up to this point in this series, we’ve spoken a little about System Center Orchestrator and why we might want to deploy runbooks within it (or another orchestration tool). We also looked at how to create a runbook and pass parameters between runbook activities. We then looked at a Microsoft template for calling sophisticated pieces of PowerShell as part of that runbook workflow. As we covered both here and in our automation series, we’re generally doing all of this to solve a real business problem.

In this article, we’ll look at the portion of the sample code that we haven’t looked at yet. This is the code that actually calls into the Tintri Automation Toolkit for PowerShell and performs the magic that is data-copy management through SyncVM.

The Code

The use-case specific code that’s going to solve our business need is the code from line 119 to 194. This uses the Tintri Automation Toolkit for PowerShell (free download from the Support Portal) to use SyncVM to handle our zero-copy data synchronisation.

You’ll notice that each logical section of code is surrounded by a try { …. } catch { …. } block. The Tintri cmdlets will throw exceptions when an operation fails and using try and catch allows us to correctly handle those cases and collect any information needed for our trace log and to pass back to the user.

The rest is all pretty straightforward and just calls the following cmdlets to get the job done:

  1. Import-Module to import the Tintri PowerShell modules. In PowerShell 3.0 and later, this should automatically happen, but by explicitly trying to import it, it’s easy to tell when the module isn’t available. This module needs to be installed on each of our Runbook Servers.
  2. Connect-TintriServer creates a session with our Tintri VMstore. Note the use of the -UseCurrentUserCredentials option. This code is run as the Orchestrator service account (specified at install time) on the runbook servers. The -UseCurrentUserCredentials option allows the use of Kerberos Single Sign On (SSO) to authenticate against the VMstore. This means no hard-coded passwords and also means that if/when we change those service account credentials, we don’t need to track down all of the scripts that use the credentials and change those too. REST and PowerShell SSO is something that we covered in detail in a previous post.
  3. Get-TintriVM on line 148 retrieves an object representing our developer VM. We’ll use that further down.
  4. Get-TintriVM (line 161), Get-TintriVMSnapshot (line 163) and Get-TintriVDisk (line 165) get objects to represent the production VM, its most recent snapshot and the set of virtual disks within that snapshot respectively.
  5. Sync-TintriVDisk on line 178 is where the magic happens. We take the development VM object, and a subset of the vDisks attached to the latest production snapshot (data disks 1 and 2, skipping system disk 0), and performs the zero-copy data synchronisation. At the completion of this cmdlet, the development VM will have been booted with the production data from that latest snapshot.
  6. Disconnect-TintriServer on line 192 just closes our session to the Tintri VMstore. It’s always good practice to do so.

Tracing

Note the types of things that we’re logging to our trace log too. In the case of the Connect-TintriServer cmdlet call, this creates a connection and authenticates us. It will fail if either of those things goes wrong. As a result, we’re logging the VMstore we’re connecting to and the username we’re connecting as. On failure, we log the exception message so that we know why it failed.

In the case of the per-VM operations, we log the VM we’re operating on and the exception message.

What we’re trying to do is to leave a very clear trail of what happened leading up to a failure.

Ready To Roll

At this point, we’re ready to execute this whole workflow. We’ll demonstrate that in the next article with a bunch of screenshots just to break things up a little.

[Clones image by HJ Media Studios and used unmodified under SA2.0]

Automation for Private Cloud recap

Automation for Private Cloud recap

Earlier this year we looked at how to turn a business need into a piece of code to be used as part of an automation or orchestration workflow.

This (very) brief article is to briefly summarise the whole series in one easy-to-find place.

  1. Defining the high level business problem and breaking it down into smaller subtasks was the focus of the first article.
  2. The focus of the second article was to find the right tool for the job for our use case of efficient data-copy management. This gave us a few lines of PowerShell that made use of some Tintri REST APIs and Tintri’s space-efficient SyncVM feature.
  3. Our third installment looked at the concept of modularity. This allowed us to write one piece of code that could then be reused many times over without modification.
  4. We looked at separating the code from the configuration in our fourth part in the series. This allows us to start to safely delegate management of automated tasks to folks that may not need access to the code itself.
  5. Article number 5 finished up by spending time looking at the importance of error handling in our automation. We wouldn’t normally leave this to the end when developing something, but it’s a topic that deserved an article of its own.

Extending upon that, we’re moving into articles on using our automation within Orchestration frameworks for delegation and self-service. Keep an eye out for those.

[Clouds image by Tom Hall used unmodified under CC2.0]

Orchestration for Enterprise Cloud part 3

Orchestration for Enterprise Cloud part 3

In the first article in this series, we gave an overview of orchestration as an extension to our previous series around automation. We also used the second article to build an Runbook in System Center Orchestrator to call some PowerShell code. In this article and the next, we’ll look at the PowerShell code.

One of our API gurus, Rick Ehrhart, was kind enough to rent me some space in the Tintri Github repo of API examples to publish a complete worked example of the code we’ll be working with. A syntax-highlighted version can be found here for your reading pleasure, with a raw version also available if you wish to download and modify it.

This article will cover the Orchestrator plumbing part of the script, and the next article will cover the Tintri Automation Toolkit part that actually interacts with the VMs and the storage.

Adding the PowerShell code

Previously, we included a Run .NET Script activity that had some inputs and some outputs. We set the script language to PowerShell, but didn’t add any PowerShell code there. We’ll do that shortly. The text input box is very limited in size and functionality. There’s no syntax highlighting or online help or any of the other things we’ve become accustomed to. It’s also not the easiest to run code from in an iterative way.

What I suggest is to develop the code in something like PowerShell ISE and whenever it’s changed, update the copy in the runbook activity. There are a few different conventions between the two, but we’ll take that into consideration.

The Code

There are two main sections to the code example. There’s a bit in the middle that does all of the use-case specific stuff around Tintri SyncVM. This can be modified to do anything you like. The second part is all of the stuff at the top and the bottom of the script, which allows this to be run effectively under Orchestrator. That is what we’ll cover in this article and it borrows heavily from the Microsoft Best Practices example.

Standalone Mode

Line #48 sets a boolean (true/false) variable called $standalone. You’ll notice that that variable is used in a number of if statements throughout that look something like this:

if($standalone) {
   ....
} else {
   ....
}

By setting this variable to $true at the top of the script, this code will execute paths that are specific to running under PowerShell ISE or just PowerShell. When set to $false, it’s an indication to the rest of the script that we’re running under Orchestrator. There aren’t a lot of differences, but the way Orchestrator passes in parameters and handles script output differ. In the case of running under PowerShell ISE, we don’t have the Orchestrator Return Data activity to subscribe to our output data, so in that case, we write it to stdout to allow us to read it.

You’ll notice that when not running in standalone mode, the script sets a variable called $param to the string “{VMname from Initialize Data}“. Specific to Orchestrator, this sets this variable to the VM name that’s passed in from the Initialize Data workflow. You can set this without having to type it by right-clicking between the double quotes and clicking Subscribe:

scorch-psvar

When we copy/paste the code into Orchestrator, we need to remember to set $standalone to $false.

Script-wide variables and constants

From about line 51 through to 69, we’re setting a number of variables that control parts of our script execution and provide us with values for things like the name of the production VM we’re dealing with.

We said earlier that we wanted to limit how much control end users had over this workflow as a way to prevent accidents. Here, because we’re dealing with a single production VM on a single Tintri VMstore, we’ve just set these as constants. If we had a selection of production VMs across a number of VMstores, we might instead have another Orchestrator activity before this one, that uses whatever means is appropriate in your environment to select the correct production VM. Perhaps it looks up some tags or attributes on the named developer VM. Perhaps this is something that we can have the developer provide with some amount of validation.

We also have variables called $TraceLog, $ErrorMessage and $ResultStatus. Normal input/output mechanisms don’t apply in the Orchestrator case. So we’ll use these variables to store the final result ($ResultStatus), a user-friendly error message ($ErrorMessage) and a log of tracing information so that if anything goes wrong, we can take a look at it afterwards. These match the variable names we added as Returned Data when defining the Run .NET Script activity in the last article. These need to match.

A Script Inside A Script

System Center Orchestrator runs PowerShell scripts using a built-in PowerShell 2.0 interpreter. This means that a lot of the comforts we’ve become accustomed to with PowerShell 3.0 and later just don’t exist.

Lines 85, 96 and 211 call New-PSSession, Invoke-Command and Remove-PSSession to use PowerShell remoting to start a new PowerShell session on the same host, but using the default PowerShell instance, which will be 3.0 or later on Windows 2012r2 and later. The script that is run in this child session is the code within the -ScriptBlock { … } option.

Because this script is run as a separate process, variables outside the ScriptBlock aren’t accessible inside the ScriptBlock. As a result, we wrap up our input parameters inside an array called ArgsList and pass that into the ScriptBlock. We also wrap up the results (tracing, error message, result) in another array and return that back to the outer script.

If we wanted to pass more data in, we could add more to the $ArgsList array and use it inside the ScriptBlock. If we wanted to return more stuff from the ScriptBlock, we could add it to the $resultsArray array and use it outside the ScriptBlock.

Summary

In this brief article, we’ve looked at the plumbing needed to have some PowerShell code execute as part of a System Center Orchestrator activity. To summarise, here are the steps our code had to perform:

  1. If we’re being called from Orchestrator, collect the provided VM name as an input parameter.
  2. Set some script-wide variables and constants.
  3. Create an array of parameters and variables to pass into a child PowerShell session.
  4. Start the child PowerShell session, which takes the provided variables, does a bunch of Tintri stuff (that we’ll cover next) and returns some results in another array.
  5. Close down the child PowerShell session and place the results in a set of variables that we configured Orchestrator to pass on to the next runbook activity.

In the next article, we’ll cover the the lines of code that use Tintri’s SyncVM functionality to serve the use case we defined at the start of this series — to instantly make a copy of our production data available to our developer VMs. At that point, we have a working Orchestration runbook and can look at expanding it.

[Matryoshka image by fletcherjcm and used unmodified under SA2.0]

Orchestration for Enterprise Cloud part 2

Orchestration for Enterprise Cloud part 2

In our previous post, we set the scene for being able to use System Center Orchestrator for being able to bundle up automated functionality in a way that’s more consumable by others. We also defined the business problem we’re trying to solve and worked out which inputs we would rely on the user to provide and which things we will decide on their behalf.

In this article, we’ll start to look at the Orchestrator workflow and how to include some PowerShell automation as part of that. We’ll then follow up with some specifics around example PowerShell code, which has its share of peculiarities when being run under Orchestrator.

Orchestrator Workflows

System Center Orchestrator, like many orchestration tools, allows us to take automation modules, called Activities, and glue them together in a workflow called a Runbook. There are a bunch of pre-defined activities that ship with Orchestrator, and Microsoft provides extra activity bundles in a set of Integration Packs that are available for download. We’ll be keeping things quite simple in our example and will rely primarily on the Run .NET Script activity that comes standard.

Open Orchestrator’s Runbook Designer tool, select your Orchestrator runbook server and we’ll create a new runbook. In my example, I’ve called mine Prod-Dev Data Sync. Using the activities on the left, drag and drop the following activities into the runbook and connect them:

  • Initialize Data — this is where our runbook will start
  • Run .NET Script — this activity is where we’ll call out to PowerShell to do Tintri SyncVM magic
  • Return Data — at the end of the runbook, any returned data will be handled here and passed back to Orchestrator

It will end up looking like this:

scorch16.png

We can far more complex runbooks than this, and indeed ours could be improved, but this will work as a starting point.

Next, right-click the Initialize Data activity, select Properties and click the Details tab. This is where we set the inputs that this runbook will take from the end user. We’ll add one string input parameter and we’ll call it VMname. It should look as follows:

scorch-init

We’re going to blindly pass this along to the next activities in the runbook, but it would be better to perform some kind of basic checks here to make sure that we’re operating on a developer VM and not a production VM. We’ll come back to that.

Next, right-click the Run .NET Script activity, hit Properties and again select the Details tab. This is where we’ll add our PowerShell script shortly. For now, set the language to PowerShell and leave the the script pane blank. Click on the Published Data tab. This is where we take any values returned our script and pass them to the next activity in the runbook.

We’re going to add three variables to be returned:

  1. ResultStatus as an integer
  2. ErrorMessage as a string
  3. TraceLog as a string

The names don’t matter too much, but the name you use in the Variable field will need to match the variable names we use in our script, and the Name field will be plumbed into the Return Data activity shortly. The result should look like this:

scorch-scriptret

The plumbing for our runbook is almost done. We just need to take care of the data that we want to return and we’re then ready to start adding our PowerShell code.

First, right-click on the runbook’s title in the tab along the top, and click Properties, and select the Returned Data tab. This is where we’ll declare which parameters we plan to return. To begin with, these will match what we returned from the Run .NET Script activity earlier.

scorch-retdata

This defines the data we plan to return, but we still need to link that up with the previous activities in the runbook.

Right-click the Return Data activity, click Properties and select the Details tab. You’ll notice that it has already populated with the list of return parameters we defined a moment ago.

For each parameter to be returned, we need to subscribe to a piece of data published by earlier activities. In our case, we’ll right-click the text field, select Subscribe and Published Data. You should be able to find each of the variables we defined as output from the Run .NET Script activity earlier. Orchestrator will then fill in the text fields with template text that looks like ‘{ResultStatus from “Run .Net Script”}’ as seen below:

scorch-retsubscribe

At this point, our runbook is pretty much complete. To summarise what we’ve done:

  1. Declared the VM name as an input parameter and subscribed to it
  2. Passed that VM name to a PowerShell script (still to come)
  3. Took the returned data from the script and returned that at the end of our runbook.

We’ll cover the PowerShell script in two separate articles — our next will cover the plumbing needed for PowerShell scripts in Orchestrator, and then the one after will look at the use-case specific code.

Ideas for improvement

The runbook we have will be pretty functional once we’re doing with the PowerShell component. But it is about as simple as it could be. Here are a couple of ways that we could insert some more activities into the runbook to make things more robust:

  1. Have an activity after the Initialize Data activity that checks that the virtual machine name is a developer VM, rather than production or some other application.
  2. Extend that to check that the user requesting the activity owns the VM being operated on. Most hypervisors don’t have a concept of an owner of a VM in this context, so we’d need to think of ways to map that ourselves.
  3. Check the output of the script and if the operation failed, automatically email us with the trace log data and anything else helpful.

Time permitting, we’ll try to tackle one or more of these once we have the initial project up and running.

[Orchestra image by aldern82 and used without modification under SA2.0]

Orchestration For Enterprise Cloud

Orchestration For Enterprise Cloud

In our last series, we looked at taking a business problem or need and turning into a piece of automated code. We started out by breaking the problem down into smaller pieces, we then put together a piece of sortamation to demonstrate the overall solution, and then made it more modular and added some error handling.

In this series, we’re going to extend upon this and integrate our automation into an orchestration framework. This approach will apply to any orchestration framework, but we’ll use System Center Orchestrator 2016 in our examples.

Why Orchestration?

Primarily for delegation of tasks. We may have written a script that we can run to perform some mundane task, but for us to be able to successfully scale, we need to start putting some of these tasks into the hands of others.

As a dependency for delegation, we also want to use automation and orchestration as a way to guarantee us consistent results and general damage prevention. Sure, we could just allow everyone access to SCVMM or vCenter to manage their virtual machines, but that’s a recipe for disaster. Orchestration gives us a way to safely grant controlled access to limited sets of functionality to make other groups or teams more self-sufficient.

The Process

Much like in our earlier automation series, we want to start by defining a business problem and breaking it down to smaller tasks from there. We want to extend this too to include the safe delegation of this task and this will include thinking carefully about input we’ll accept from the user, information we’ll give back to the user, and what kind of diagnostic information we’ll need to collect so that if something goes wrong, we can take a look later.

The fictitious, but relevant, business problem that we’re going to solve in this series is a common DevOps problem:

Developers want to be able to test their code against real, live production data to ensure realistic test results.

The old approach would be to have someone dump a copy of the production application database and restore the dump to each of the developers’ own database instances. This is expensive in time and capacity and can adversely impact performance. It’s also error-prone.

Instead, we’ll look at making use of Tintri’s SyncVM technology to use space-efficient clones to be able to nearly-instantly make production data available to all developers. We’ll do this with some PowerShell and a runbook in System Center Orchestrator.

We can then either schedule the runbook to be executed nightly, or we can make the runbook available to the helpdesk folks, who can safely execute the runbook from the Orchestrator Web Console. [Later, we’ll look at another series that shows us how to make this available to the developers themselves — probably through a Service Manager self-service portal — but let’s not get too far ahead of ourselves]

Core Functionality

Our production VM and developer VMs all have three virtual disks:

  1. A system drive that contains the operating system, any tools or applications that the developer needs, and the application code either in production or in development.
  2. A disk to contain database transaction logs.
  3. A disk to contain database data file logs.

In our workflow, we’ll want to use SyncVM to take vDisks #2 and #3 from a snapshot of the production VM, and attach them to the corresponding disk slots on the developer’s VM. We want this process to not touch vDisk #1, which contains the developer’s code and development tools.

Inputs

For us to use SyncVM (as we saw previously), we need to pass in some information to the Tintri Automation Toolkit about what to sync. Looking at previous similar code, we probably need to know the following:

  • The Tintri VMstore to connect to.
  • The production VM and the snapshot of it that we wish to sync
  • The set of disks to sync
  • The destination developer VM

In order to limit potential accidents, we probably want to limit how much of this comes from the person we’re delegating this to. In our example, we’ll assume a single VMstore and a single known production VM. We’ve also established earlier that there is a consistent and common pattern for the sets of disks to sync (always vDisk #2 and #3). The only parameter where there should be any user input is the destination VM. The rest can all be handled consistently and reliably within our automation.

Output

After this has been executed, we’ll need to let the user know if it succeeded or failed, along with any helpful information they’ll need. We’ll also want to track a lot of detailed information about our automation so that if an issue arises, we have some hope of resolving it.

Summary

So far, we have again defined a business need or business problem (synchronisation of production data to developer VMs), defined the set of inputs we’ll need and where those will come from, and we’ve defined the outputs.

In the next installment, we’ll start to get our hands dirty with System Center Orchestrator, followed by PowerShell and the Tintri Automation Toolkit for PowerShell.

[Orchestra image by Sean MacEntee and used unmodified under CC2.0]

Could The AWS Outage Happen To You?

Could The AWS Outage Happen To You?

No doubt we’ve all been affected by the recent Amazon Web Services outage in one form or probably many. It’s had coverage the Internet over… now that the Internet is again running. We’re also starting to get a picture of what went wrong.

This article isn’t a big I told you so about public cloud, nor is it a jab at Amazon or the poor Engineer who must have had a rough couple of days this week.

I’m hoping that this article serves as a bit of a public service announcement for the way that we all operate within our Enterprise and Private clouds so that the same doesn’t happen to those of us running the infrastructure within our organisations.

What Happened?

Amazon has posted a summary of what went down. I want to take a look at a specific excerpt:

At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

The details are thin, but our immediate reaction might be to jump on that poor S3 team member. I hope his or her week is looking up already and I’m not sure that they’re individually to blame as much as the automation.

The industry is talking about automation and orchestration being a necessity due to the scale and complexity of the in-house systems we have. It’s unreasonable to expect that the complexity of one of the biggest pieces of technical infrastructure can be maintained in the heads of the engineers running it — especially being as fluid as it is.

Any automation or orchestration needs to put bounds around what the end user can do. Whether it be a strong, technical engineer, an end user, or someone in between.

What Can We Learn From This?

We’re all talking a lot about orchestration and automation. We’ve got self-service portals for end users and we’re putting new tools in front of helpdesk staff. We’re having to automate big chunks of what used to be standard IT stuff because the scale is too big to do these things manually anymore.

Whether the automation or orchestration is designed for the most technical of folks, or whether it’s for someone without the technical chops, it’s important to make sure that the tooling only allows stuff that should be allowed and prevents everything else. It sounds simpler than it really is, but it’s worth considering and limiting the potential use cases to avoid tragedy in-house.

A Simple, Contrived Example

To illustrate the point, let’s assume that we have some orchestration to allow an operator to control RAID arrays in servers in our data centre. One of their responsibilities is to monitor the number of write cycles to the SSDs in the RAID array. When a drive crosses a threshold, the operator marks it failed and organises a replacement drive.

[I told you that this illustrative example would be contrived…]

These fictional arrays are made up of RAID 6 stripe sets, which as we know supports two drive failures.

Now let’s say that the operator sees three drives cross the threshold at the same time and pushes the big red button on each of them. We now have an incomplete RAID set and an outage.

Sure, in the real world, we wouldn’t remove non-failed disks from an array before a replacement was there and would only ever do one at a time. However, if we expand this type of cluster-like redundancy to hosts, clusters and geographic regions, we’re starting to see how quickly we’re going to lose visibility of all of the moving parts.

The point is that if you have a piece of automation that accepts input from someone (including yourself late, late at night), try to preempt the valid use cases and and restrict or limit the rest.

[Image by William Warby and used unmodified under CC 2.0]