On January 24, 2014, the world ended. You may remember that fateful day if you were among the 42 million people affected. That was the day Google crashed.
Gmail, Google Drive, Google Calendar, and Google+ services went down around 11 a.m. PST. The world panicked, awash in a tidal wave of tweets, all calling for the end of times—for half an hour, anyway, until order was restored and the web giant was back on its feet.
The unfortunate reality about running a business is that serious outages like these are always a looming possibility.
There are so many moving parts in keeping your website standing that something is bound to slip once in a while.
As with many things, more important than the fumble is how you recover. Given a serious enough outage, writing a status follow-up is one of the better ways to get things back on track.
The What and Why of Outage Updates
Commonly called “postmortems,” outage updates are in-depth, publicly available technical reports about a downtime event, most often posted to a status page. They provide a timeline of when the website started having problems, as well as what was done to fix each issue.
There are a number of reasons why they are well worth your time.
Better understanding for the DevOps team. Conducting a thorough postmortem will give your DevOps team more information on how your infrastructure performs under certain conditions. The reasons for a website going down are rarely surface level; it's often a combination of several issues that are responsible for the outage.
A thorough investigation will give you the knowledge to better prepare for—and ultimately prevent—the same issue happening again in the future.
Re-instill trust in your customers. When your site or app is down for an extended period of time, your customers will lose some amount of faith that your team has the technical skills needed to keep everything available and working. At some point, if customers think you will continue to have downtime issues, they will consider switching to a competitor.
A well-written update can help you rekindle trust lost with customers during an outage.
Writing is challenging, and writing an outage update is no different, but there are a few guidelines you can follow to make them clear, coherent, and useful.
Writing a Great Outage Update
A good update does three things: apologizes for your screw up, demonstrates knowledge of exactly what happened and why, and presents a plan to help ensure that the same issue doesn't happen again.
Your downtime has probably thrown a wrench in your customers’ day, and they’re rightfully frustrated. Apologize directly and mean it. Don’t say, “We’re sorry for any inconvenience you may be having.” You’re deferring blame to the customer, and insincere talk will be spotted from a mile away. These are people who want to continue supporting your company—they deserve personal, authentic, and sincere language.
The next thing your outage update needs to do is demonstrate that you know exactly what happened and why it happened. A confusing explanation hurts your reputation in two ways: it makes you seem incompetent, and it leaves the customer with the impression that you just don’t care. If you did, you would have found a way to communicate clearly.
The amount of detail to include in your update depends on your audience. If you’re an e-commerce company, you’ll want to be extra careful to not bog customers down with confusing jargon. If, however, your customers are developers, they’ll appreciate hearing the nitty gritty of what happened: details like graphs, a timeline of events, and specific steps you took to remedy the problem.
Side note: if the "reason" you were down is due to an upstream provider that your infrastructure is built on, this is not an opportunity to condemn them or absolve yourself of any responsibility. It was your choice to use their service and to build your stack in such a way that you had this point of failure.
Owning up to your mistake never includes blaming other people for your problems.
Plan to Avoid Future Issues
Your customers want to know what you're going to do to ensure that the same problem doesn't happen again in the future. This is an opportunity for you to detail the fixes, changes, or updates you’re going to make and your timeline for doing so.
A secondary benefit to putting this in writing is it helps to keep you accountable for actually fixing the things you said that you would.
When Is an Outage Update Required?
One question I hear a lot is, "How serious does an incident need to be to require an update to customers?" The answer is: it depends. Both the situation and the service are factors.
When and why you issue a postmortem depends on the service you provide and the number of people that rely on you. Services such as Google and Amazon Web Services are expected to function 100% of the time. If AWS has any downtime at all, those using the service can become completely unavailable to their own customers. In these cases, even a minor infraction or a couple minutes of downtime warrants a postmortem.
Calling back to our previous Google example, their services were interrupted for 30 minutes, prompting Ben Treynor, Google's vice president of engineering, to write an elegant follow-up for everyone affected. The apology was well-received and the public was generally grateful for the detailed report.
Important, But Non-Critical Services
We use Wistia to host all of our marketing and knowledge-base videos. The availability of those videos to help our users is important, but the impact of an outage for Wistia isn’t the same as the crippling situation you’re put in when AWS is down.
Services such as Wistia should write an outage update after a downtime of 30 minutes or more.
An Outage Update That Nailed It
In December 2014, domain management service DNSimple experienced an extended outage due to a sizable DDoS attack. Their website—and subsequently their customers’ websites—became unavailable for several hours at a time.
This was a huge deal, as many of their customers were completely unable to function that day. Those customers expected a postmortem from DNSimple that detailed exactly what happened, why it happened, and what they were going to do about it.
The apology. The first step to making amends is apologizing. You have to fall on your sword and accept responsibility for the problem.
Anthony, their CEO, does a great job of this. He doesn’t deflect blame, and he accepts responsibility himself early on in the postmortem. It feels authentic—it isn’t overly dramatic, but it acknowledges that many customers were let down by an event DNSimple needed to own up to.
What happened and why? DNSimple makes a technical product. A technical product has technical customers, and technical customers expect technical postmortems. There’s always a risk of including excessive details, but most people underestimate what they need to explain.
This postmortem does a great job of cutting right to what happened and why. They talk about the scale of the DDoS attack and how and why the equipment they had on hand was not able to handle the traffic that followed.
We’re working hard to avoid another incident. You can never (and shouldn’t ever) guarantee that downtime isn’t going to happen again in the future. What you can do, however, is commit to taking steps that will help you avoid making the same mistake twice.
Anthony communicates this by talking about a third-party service they’re going to start using. He also details a few features they plan on adding that will provide a work around to their customers should they be affected by a DDoS attack again in the future.
The Right Thing to Do
Downtime is a sad reality of running a web service. Even the best of us experience it, and it can put a dent in our otherwise favorable reputations. A well-written outage update helps mitigate the negative repercussions that come with an extended outage.
Your customers are rooting for you. Don’t leave them hanging.