Keeping in Check: Fire Drills and Disaster Readiness

By Flux7 Labs
May 12, 2014

This past weekend, we solved two problems for two customers. They both had working configuration management solutions. One used Puppet; the other used Chef. One was Red Hat-based; the other was Debian-based. But, both of them had the same problem.

They had recipes that had been used to create environments. The recipes worked when created, but, in the interim, the dependencies for the packages had been changed. This ended in their recipes failing. Fortunately, this wasn’t during an actual disaster for either one of the clients, but it is a reminder of just how necessary regular fire drills are to testing the disaster plan and keeping it in sync.

There is the old mantra in software development: If you haven’t tested it, it’s broken. The same applies for checking your disaster readiness.

There are very few organizations that regularly test their disaster plans. Some of them don’t have an automated or even a semi-automated plan to rebuild an environment. Those that do have stable running environments that are left untested for long periods of time, and when they are needed they may or may not work. But, there are very few organizations, like Netflix, that actually test their system for failure on a regular basis.

So, why is continuous testing of your disaster plan so important? I have a few reasons why.

External Dependencies

You don’t continue to test a version of software already in production. If you’re not making any changes, there is no reason for bugs to appear.

However, the setup of the environment is completely driven by external tools. And so, changes happen over time. The dependencies for packages change, new incompatibilities arise, packages get deprecated and replaced. There is no guarantee for a recipe that works one day to keep on working a year or two later. It needs to be continuously tested, and the current production environment needs to be kept in sync with the latest version.

One of our clients, referenced above, tried their Puppet recipe on their new AWS environment.

Originally, when they ran Puppet, Ruby was being built with SSL support things working. When they ran the recipe again after one year to create a new server, the recipe passed, but the Ruby which got installed didn’t have SSL support. The Ruby version was the same. Lots of investigation revealed that Ruby will be compiled without SSL support when libssl and libssl-dev are absent.

Upon further inspection, we realized that the old server from over a year ago had libssl-dev, but the newer one didn’t. An “aptitude why” command revealed that libssl-dev existed on the old server because the package ruby-build depended on it. The newer version of ruby-build, released in January 2014, no longer specifies libssl-dev as its dependency.

Thus, the issue was that a change in ruby-build, that wasn’t even explicitly installed, caused our ruby to compile without SSL, thereby making it unusable for DR.

The Human Element

Your cloud disaster recovery plan inevitably needs to be executed by humans. This is especially true if your systems are not fully automated. But even if they are, it is going to be maintained by certain people.

Of course, people move and find new roles. They forget the work they did many months ago. Some people leave the organization. So, we need to ensure the people responsible for setting up and executing the disaster recovery plan have both the skills and institutional knowledge to execute on this plan. The only way to make sure of this is to regularly exercise the plan so the team is prepared to execute when the organization needs them to do so.

The worst and most depressing example we recently came across was when the operations lead at a company passed away while on vacation. Anything that was manually done, and only in his head, went with him. The new operations team was extremely challenged to reverse engineer everything.

Process Improvements

As a process is repeated multiple times, the issues with it begin becoming apparent. This leads to an investment in resolving these issues. Some of these issues are required for functionality. But, the value goes beyond fixing functionality. It leads to optimizing the process, which not only reduces the overhead for regularly conducting fire drills, but it also enables new functionality.

For example, if the process of bringing up a new environment is optimized enough, then the operations team can empower the developers by creating new ad-hoc environments. This greatly improves the development process by providing them with clean environments where the developers are not investing their time in fixing issues.

By running the same process over and over again, we kept on finding steps injected by human errors. Knowing where bugs occurred helped us decide which parts need to be automated first.