What to do when shit hits the fan

One day I found myself looking at a giant spike of crash logs in our monitoring tool. A quarter of all users had a crash on launch when they tried to start the new version of our app. We made a mistake in migrating users to a new data model. The one star reviews came in, angry tweets started appearing all over the place and customer support calls shot up. I found myself in a classic “shit hits the fan”-moment.

Everybody encounters moments like this every once in awhile. It can be quite stressful and it might seriously piss off the client you work for. Going through mistakes like this I learned that you never should jump to a “no mistakes allowed”-style of working. You can’t avoid all mistakes, it’s all about how you handle the issue. If you do that right, you might even end up with a compliment instead of an angry person on the other end of the line.

It’s not that hard. Three basic steps could do the trick:

  1. Communicate
  2. Fix
  3. Learn

When everything explodes, the most important thing is to start and keep communicating. Explain to the client what you’re doing, help everybody understand the cause and impact of the issue. This will help all parties understand what they can do to fix the issue, such as what to communicate to users or how to work around the problem for now and at what time to expect a fix.

After informing your customers, fix the issue as soon as possible. Scramble your team and give them the physical and mental space to solve the root cause of the issue. Don’t ask for an update every minute or linger around their desks, but keep communicating with each other and check if you can facilitate anything to smoothen the process. Focus on creating a good fix, not a fast fix. You don’t want to accidentally break more.

Once everything is up and running again, learn about what caused the situation in the first place. The “5 Whys”-methodology is a nice way to figure out what the root cause was. Learning about why things broke down helps understand how they can be prevented. Implementing a safeguard or even just being aware of this really helps avoid making the same mistake twice.

After this session I like to summarize the issue in a single page document describing what happened, why it happened and how we resolved it. This is shared with the team, the client and the rest of the company so everybody can learn from it. Being open about it helps other teams avoid the same pitfalls.

By communicating, fixing and learning, things will be fixed as fast as possible, everybody is kept in the loop and we turn mistakes into learning experiences. One day you may even receive a compliment from a client instead of them being angry at you for screwing things up. In the end, clients and users also know everybody makes mistakes sometimes.