Improve IT Operational Excellence with Game Days
Our DevOps consulting team at Flux7 works with dozens of enterprises to help mature their IT programs and improve their operational excellence. In the process of moving from traditional IT to starting and scaling DevOps in the enterprise, we begin the process of moving to “everything as code” including infrastructure, configuration, pipeline, and security as code. While this approach may be applied to modern apps designed as microservices, or legacy monolithic apps, in either case, failures and incidents will happen. There should be a plan to handle them and that is where Game Days come in.
Moreover, according to Gartner, “Organizational learning practices are key to the success of scaling DevOps. Examples of practices that enable the creation, sharing, and retention of knowledge are:
- Communities of practice
- DevOps dojos
- Game days
- Job rotations
I&O leaders should pursue and nurture these practices to drive knowledge sharing around their organizations.” Gartner, How to Navigate Your DevOps Journey, (subscription may be required) Daniel Betts & Christopher Little, 22 October 2018
What’s a Game Day? Do I need one?
Much like firefighters regularly practice everything from hauling hose to May Day scenarios, the goal of Game Days is to rehearse so that your team is prepared to take the most effective action when emergency strikes. More concretely, a Game Day is a planned, scheduled event where IT teams actively practice recovering from an incident. The incident can be anything from a spike in network traffic, to the detection of malware or malicious software, or a component failure. Anything that is a common unplanned incident that your IT team would respond to is a good Game Day candidate.
When Flux7 plans a Game Day, our DevOps consulting team will plan the incident, failure or micro-outage ahead of time and manually trigger it at the start of the Game Day. However, to effectively simulate a real-life scenario, the Game Day incident should come as a surprise and only those triggering the incident should be aware of the details surrounding it.
Once triggered, relevant teams should be notified of the incident and asked to implement the necessary procedure to remediate it. If the remediation is already automated, the customer may simply be notified of the event and gain confidence in the system in place. Game Days are usually scheduled weekly or biweekly and they should only impact non-production environments. (If you’re asking yourself, ‘what about Chaos Monkey?’ read on…)
Acceptance Criteria for Game Day User Stories
Acceptance criteria can be generic and needn’t be too detailed, depending on the nature and desired outcome of the Game Day. For example, for a recent Game Day we held with a large hospitality organization, the criteria included the following:
- Flux7 designed a couple failure scenarios and prepared a Game Day document prior to the scheduled event.
- Customer evaluated their ability to investigate and understand the environment. Flux7 provided guidance as needed.
- Customer referred to the runbook and updated it if needed.
- Improvement ideas were discussed.
Create A Game Day Template Document
Before you get started, we recommend that you create a standard template document for Game Days. We highly suggest it contain the following sections, consistently outlining details for each Game Day.
- Date & Time
- Engineers Involved
- Environments Affected
- Scenario Trigger Schedule
- Expected Alerts
- Actual Alerts
- Investigation Steps
- Potential Improvements
Game Day Examples
For the customer we mentioned earlier, we have held several Game Days, with a wide variety of manufactured incidents. Some of the recent Game Day incidents we have triggered for their team to investigate and remediate are focused on Jenkins server failures; simulated AWS GuardDuty findings in the form of compromised EC2 instances and compromised IAM credentials; and a networking outage with deleted VPN connections. For each of these, the team was given two hours to identify the root cause of the incident, log their investigatory steps and flag areas for improvement. Note that AWS DevOps improvements can be technology, process, culture and/or teamwork related.
Game Days have healthy organizational outcomes. Teams deploying this DevOps best practice strategy can expect to see the following results:
- Clarity of roles and responsibilities. To be effectively resolved, an outage may require the involvement of several teams. As the introduction of DevOps may have shifted roles and responsibilities, Game Days help to clarify the roles and responsibilities for both teams and individual team members.
- Faster issue resolution. While it is important to have a runbook as a reference, Game Days provide an opportunity to practice and prepare. IT teams should be able to remediate outages faster, and with less stress as a result of Game Days.
- Ideas for improvement. Game Days have proven to generate a lot of healthy discussions and a lot of ideas for improvement. If we can identify a common failure, we should be able to automate the remediation procedure. Game Days may also identify flaws in your architecture and provide ideas for AWS architecture best practices.
- Fun team building. It’s just fun to break stuff. DevOps allows us to build and tear down infrastructure fast and at low cost. Game Days would not be feasible without DevOps.
- Measure system resilience. Game Days can shine a light on how you are measuring up to SLA, RTO, RPO and other measures of system resilience.
- Break the “firefighter hero” cycle. In many organizations, successful firefighters are celebrated for saving the day in difficult situations. It creates an environment where people are rewarded for being reactive versus proactive. Game Days help to share the knowledge, avoid silo or tribal knowledge and anticipate issues.
- Provide validation. Game Days can also effectively serve to validate your existing alert and monitoring system and/or help identify missing ones.
Chaos Monkey and Game Days
Some enterprises take Game Days one step further with Chaos Monkey to automate and randomize Game Days. In case you aren’t familiar with Chaos Monkey, it is an open source project available here. While some companies like Netflix (creator of Chaos Monkey) even implement Game Days in production, we highly discourage companies in the very early stages of DevOps maturity from doing so. Flux7 best practice, as based on our experience working with hundreds of customers, is to take small steps rather than big leaps when it comes to Game Days.
Analyzing the outcomes of your Game Days, especially your logs and metrics can help predict future failure, creating a virtuous cycle of IT operational excellence. From understanding processes, learning how to improve, and just as importantly, how to work as a team, Game Days have proven to be a very valuable step towards IT modernization and maturity. Interested in how Game Days can benefit your organization, reach out to us today.