Don’t panic! A playbook for managing any production incident
Knowing how to handle it when things break is more important than preventing things from ever breaking.
No matter how strong your organization is, how detailed their planning, how deliberate their deployments…things will break and emergencies will send your teams scrambling. Systems will go down, key functionality will stop working, and at some point in everyone’s career as a developer, an issue will call for all hands on deck.
The nature of these challenges evolve as time goes on, but some things stay consistent in how you view the challenges and how folks can work to make sure that you get back to good reliably. And to be clear, we aren’t talking about run of the mill production bugs, we are talking about issues that are large and sweeping but at the same time delicate and brittle.
Having done more than my share of organizing and solving some of the big challenges organizations face when these events happen, I have a high-level playbook that I try to make sure my team follows when things fall apart. A lot of these expectations started to take shape during my first big outage as a developer. This helped me understand what people should do as developers, SREs, managers, and everything in between. And the cause of this first big outage: a brand new checkout process on an ecommerce site. All of these takeaways are applicable for folks at all levels, and hopefully can give some insights into what folks in other roles go through.
# Step 1. Don’t panic and identify your problem.
My very first “outage” came when I was a developer working on an application that had a new checkout process. It was nothing fancy, but like all applications at one point or another, this key piece of functionality stopped working with our newest launch. As if people not being able to check out and complete sales wasn’t bad enough, shopping carts lost items and product descriptions were showing up blank. Pieces that weren’t within scope or crossed our mind to test stopped working. We immediately grabbed folks into a room to get to work and figure it out.
Our first instinct right away was “Quick, roll it back!” This was an understandable feeling to have, we introduced problems, and naturally you want to take the problems away. But with quick actions come quick mistakes, and a seasoned senior developer stopped everyone from scrambling to ask the pertinent question: “Well, why isn’t it working?” In my mind I was screaming “Who cares! Our embarrassing mistake is out there for the whole world to see!” But the calm nature and analytical demeanor of this senior developer settled us down and assured us that what we were doing right now in that room was the right thing to do: ask questions and investigate.
# Step 2. Diagnosis and understand the source(s) of your problem
This sounds like an obvious thing, but with concern and panic overtaking the team, not enough folks asked us why things were breaking. The senior engineer left the problem out in the wild for a full 30 minutes after we found it to make sure we knew why it wasn’t working. We checked and double checked exception logs, we did a few different checks with separate workflows, and even checked if there was anything odd at a systems level. After all, we had good development environments setup to replicate production, and things were breaking so double and triple checking ourselves became important. Retracing these steps with new context from the errors we were seeing helped us go through all of these steps in a new light. After we had enough to know what we did wrong, and gathered enough confidence for next time we release, we then started our rollback. It is a delicate balance, but I learned to always take all the opportunities you can before a rollback before you lose your best source of information to find the root of your problem: the actual problem in the wild.
The same senior dev who was tempering our poorer instincts was the one who took “point” or tech lead during this time, while relying on our director to be the incident leader. You will hear many names for these roles, but they are someone who is technical and can help coordinate those efforts (usually a more senior developer) and someone who is responsible for communicating around it and giving air cover for those who may want to take time away from the fixes (usually a director or engineering manager). This is to protect the most valuable resource during a crisis: The time and focus of those who can actually implement the plan to fix.
The more technical person will be there to help set milestones and delegate or divvy up the work that needs to be done. The incident leader, as they are often ironically named, are there to facilitate and not to dictate. I remember hearing from my mentor at the time that the best incident leaders asked two questions: “Where are we at?” and “What do you need?” The first so they could keep people off our back, and the second so the last thing our engineers had to worry about was resources, including time.
# Step 3. Remediation: let’s start working on the problem.
We know we have a problem, we know the source of the problem, now let’s make a plan and fix it. We all love to jump to this step, go right into fixing it. And sometimes we have that luxury for simple issues where the problem is so apparent that confirming and understanding the source of the problem, or problems, are very quick steps, but most times if the problem has made it this far and is this impactful, we need to be more deliberate. Much like how we were potentially shooting ourselves in the foot by instinctually rolling things back too quickly, the same instinct to just fix it can come up,
This point person is going to help prioritize the work to do, find out where the biggest mitigation steps are, and make sure that other stakeholders have clear expectations of the impact. As a developer working on an issue, you also have a responsibility to hold this person accountable, make sure they give you the resources you need to help figure out the issue. This can be time, access, or other people who have answers you don’t. And this is an important theme throughout this phase: Give the engineers what they need to fix things. Arguably this should be a theme for all of engineering leadership, but nothing more pronounced than when things have gone down and vital workflows have gone silent.
When we were working on the checkout bug, the biggest piece missing was not information or other developers to help, but focus. This may sound odd, but I am willing to bet it is a familiar feeling to any who have been in the boat with leaders who are panicky or never understood the fallacies of the mythical man month. The leaders were eager for progress updates, and what better way to get those updates than to get everyone in a meeting (opens new window) together four times a day to tell us how things progressed. That means every two hours we lost 30 minutes, had to context switch, and update tracking sheets.When I told my tech lead about this, he immediately had the meeting moved down to once a day for developers, and optional at that. The speed gains from this alone were huge; being able to focus and remove distractions (opens new window) was the bigger factor for remediating the problem.
# Step 4. Verification and learnings
If all goes well, tests are confirmed, and all the valuable information you got from steps 1 and 2 have led to confidence in your new test plans, you can move the fix out to production. Once it is out live, lean on your teammates in all departments to confirm and explore. Interestingly, I have found time and again that if patience and freedom are given to the engineers at the beginning of these incidents, there is a correlated confidence and calmness to the subsequent release and fix.
However, once the fix is out live and everyone feels strongly about the current state, your work is only half done. Now you need to make sure that expensive, hard earned lessons from this problem grow your whole organization. People often take the measure of a good retrospective from big events like this as problems never happening again, but that is plainly unreasonable to any reasonable person. Often I have found the best learning is how we can DEAL with problems better, not pretend like we can make them go away.
In the end for our checkout issue, it all came down to a missed release step by our deployment team. An honest mistake that can happen to anyone. This doesn’t mean we ignore the issue: we thought of adding redundancy or perhaps trying to automate certain bits more, but that wasn’t the best bit that we learned. Our tech lead was far more focused not on preventing errors, but sharpening our ability to deal with them. Though they wanted to prevent future errors, they saw a lot more room to improve in how we respond to the error. What did we learn about engineer focus time? Where were we able to investigate quickly? Slowly? And even good questions outside of engineering such as who was best at handling comms and what was the info they needed?
There have been shades of this outage throughout almost two decades of my career, and I have no doubt the days of having to deal with things like it are coming to a comfortable middle. But the themes of how to approach it, process it, and most importantly enable my team to tack it tend to be the same.