Last week, we nearly pushed a bad configuration into production, which would have broken some things and made some code changes live that were not ready. Nearly, but not quite: while we were relieved that we’d caught it in time, it was still demoralising to find out how close we had come to trouble, and a few brave souls had to work into the evening to roll back the change and make it right.
Rather than shouting and pointing fingers, the team came together, cracked open the Post-Its and Sharpies and set to engineering. The problem to be solved: what one thing could we change to make this problem less likely, or less damaging?
The first step was for the team to build a cohesive view of what happened. We did that by using Post-Its on the wall to construct a timeline: everybody knew what they individually had done and had seen, and now we could put all of that together to describe the sequence of events in context. Importantly, we described the events that occurred not the people or feelings: “the tests passed in staging” not “QA told me there wouldn’t be a problem”.
Yes, the tests passed, but was that before or after code changes were accepted? Did the database migration start after the tests had passed? What happened between a problem being introduced, and being discovered?
Why was that bad?
Now that we know the timeline, we can start to look for correlation and insight. So the tests passed in staging, is that because the system was OK in staging, because the tests missed a case, because the wrong version of the system ran in testing, or because of a false negative in the test run? Is it expected that this code change would have been incorporated into that migration?
The timeline showed us how events met our expectations (“we waited for a green test run before starting the deployment”) or didn’t (“the tests passed despite this component being broken”, “these two components were at incompatible versions”). Where expectations were not met, we had a problem, and used the Five Whys to ask what the most…problemiest…problem was that led to the observed effect.
What do we need to solve?
We came out of this process with nine different things that contributed to our deployment issue. Nine problems are a lot to think about, so which is the most important or urgent to solve? Which one problem, if left unaddressed, is most likely to go wrong again or will do most damage if it does?
More sticky things were deployed as we dot-voted on the issues we’d raised. Each member of the team was given three stickers to distribute to the one-three issues that seemed highest priority to solve: if one’s a stand-out catastrophe, you can put all three dots on that issue.
This focused us a great deal. After the dots were counted, one problem (gaps in our understanding of what changes went into the deployment) stood out above the rest. A couple of other problems had received a few votes, but weren’t as (un)popular: the remaining six issues had zero or one dot each.
I got one less problem without ya
Having identified the one issue we wanted to address, the remaining question was how? What shall we do about it? The team opted to create a light-weight release checklist that could be used in deployment to help build the consistent view we need of what is about to be deployed. We found that we already have the information we need, so bringing it all into one place when we push a change will not slow us down much while increasing our confidence that the deployment will go smoothly.
A++++ omnishambles; would calamity again
The team agreed that going through this process was a useful activity. It uncovered some process problems, and helped us to choose the important one to solve next. More importantly, it led us to focus on what we as a team did to get to that point and what we could do to get out of it, not on what any one person “did wrong” and on finding someone to blame.
Everyone agreed that we should be doing more of these root cause analyses. Which I suppose, weirdly, means that everybody’s looking forward to the next big problem.