A Few Weeks Ago….
Flux7.com is written in WordPress and runs inside a Docker container on a c3.large instance in the AWS us-east-1 region. We used to run the website on a VPS host until recently switching to AWS in order to avoid slow load times. When we made that move, in my enthusiasm for Docker I took what I think was a clever step. I put the entire site in a Docker container using a jbfinks WordPress container (https://github.com/jbfink/docker-wordpress), and then created a private Docker registry on another AWS instance in order to push the entire container there.
On February 11th….
I ran a simple script to commit the container and push it to our registry, just as I do every night. I ran the following code:
CONTAINER_ID=`sudo docker ps | grep $FULLNAME | cut -f1 -d' '`
IMAGE_ID=`sudo docker commit $CONTAINER_ID`
sudo docker tag $IMAGE_ID $FULLNAME
sudo docker push $REGISTRY/$NAME
That created a snapshot that we’ll call I0.
On February 15th….
I woke up that morning to the following message reporting that flux7.com was down.
<a href="https://flux7.com/">flux7.com</a> (<a href="https://flux7.com/">flux7.com</a>) is down since 02/15/2014 09:39:13AM"
I immediately got onto Skype and contacted Anuj Sharma, who was optimizing the website for SEO. Anuj told me that the crash resulted from a code change he’d pushed from the UI to the functions.php file, so I ssh’ed into the server and got a website backup running in about 60 seconds. That resulted in this message:
<a href="https://flux7.com/">flux7.com</a> (<a href="https://flux7.com/">flux7.com</a>) is UP again at 02/15/2014 9:41:13AM, after 2m of downtime.”
Within 60 seconds I was able to create a dev environment enabling Anuj to debug in a sandbox. When he gave me the green light a short time later, I was able to switch back to the latest version of the website, thanks to Docker and Linux Containers. I’m sure the issue could have been resolved in other ways, but it was Docker that saved us that morning.
Anuj’s change that brought the site down was made inside a container that we’ll call container X. It was running using Docker image I0, but several changes had been made during the previous four days due to site activity. So when I saw the “site down” message, I took the following steps:
Took a snapshot of X and named the image I2, which took about 15 seconds.
Stopped X, which took ~5 seconds.
Started a new container Y, using I0, the snapshot taken 4 days prior, at port 80. Flux7.com was then up and running, albeit with 4-day-old data. That took only 20 seconds.
Started another container Z using I2, the image of the broken container, at port 8080 on the same host and then created a sandbox so that Anuj could fix the problem that brought the site down. Taking a snapshot of X in step #1 was crucial as it allowed me to create a sandbox in no time. It was an exact replica of the production environment at the time it failed. Bug re-creation took 0 seconds.
Thanks to the zero bug re-creation time, Anuj was able to fix the problem with container Z running at port 8080 in only 20 minutes. I tested his change in container Z, ran some basic QA and then ran a “docker diff” on Z to make sure that only the functions.php file had changed.
Next, I followed these steps:
Ran docker diff Y to ensure no major changes occurred to flux7.com during down time. Fortunately, there were none, which points to one of Docker’s strengths.
Took a snapshot of container Z and named it image I3.
Stopped container Y, which was running flux7.com.
Started container F (which stands for “final”) at port 80 using image I3 from step #2 above.
Bingo! Flux7.com was restored to its actual state without bugs, and also included Anuj’s changes.
Other Possible Solutions…
Let’s look at how this problem could have been addressed in other ways.
1. I could have used AWS AMIs. We use AWS-based cold Disaster Recovery to snapshot our instance every night to both us-east-1 and us-west-1 regions. But, if we had used this approach, starting the DR instance would have taken 1.5 minutes or more. Next, I’d have had to test it and update the route 53 settings, and DNS propagation would have taken another 10 minutes. That would have resulted in a much longer downtime than the mere 2 minutes required for our Docker solution.
2. Since the entire PHP code and assets are in github, I could have pulled the “known to work” code and hoped the bug was in the code, not the DB. But I felt strongly that would be a risky solution, even though it would have taken only roughly 1–2 minutes.
3. I could have used a known good state of the code and DB, if it existed. However, repairing all that plumbing with mysql and wp-config would have been time consuming and more prone to error.
Where Docker Really Demonstrated Its Strengths… .
1. Extremely fast snapshots allowed sandbox creation for hot fixing without delaying restoration into production.
2. Extremely fast spinup minimized flux7.com’s downtime, as well as the time needed to run a backup. It took only two commands, and neither took over 10 seconds to run.
3. The ability to diff took care of the age-old issue of a DR that is tracking changes while running in DR and then porting them in. Docker diff made it super easy to recognize that no changes had been made to flux7.com.
The upshot is that here you are on flux7.com reading about our experience!
Where Docker Could Have Been Better….
1. The docker containers push kept failing with an internal server error, which is a known bug (https://github.com/dotcloud/docker/issues/4115). The error message was “The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.“
That was the price paid for using non-production ready software, but a price I’m willing to continue paying because I know that Docker will fix that bug soon. Until then, I’ll simply create backups using docker export.
2. While using Docker, I really wished that it had bash autocompletion. I frequently found myself running docker ps and docker images in order to look up information. You can read more about this at https://github.com/dotcloud/docker/issues/2180. What I didn’t realize at the time was that the bug has already been fixed, which shows how rapidly Docker’s pace of development is going.
Overall, it was an amazing experience! Kudos to the Docker guys! I believe I owe you lunch the next time we’re in the Bay Area.