Incident Response: 5 Best Practices for the Modern Enterprise

incident-response
ops
doc-review

(David Hayes) #1

After our talks at Atlassian Summit Europe, ServiceNow’s Knowledge17 and the Gartner ITOS Summit, we’ve had a number of requests for this information; so we’ve assembled a PDF that’s ready to send to your boss. It’s the same information but I find printing a PDF gives it a little bit extra credibility (you’ll also hear some of these points if you purchase our on-premise training). You can find more about how we do response internally at response.pagerduty.com


Incident Response vs ITSM
PagerDuty manages your unplanned work and ties that back to the planned work in your ITSM tool like Jira, ServiceNow or Remedy. You can’t add minutes during an outage, so the key is prioritizing your planned work effectively:

  • Information flows from ITSM into PD so that responders know what has changed and who is reporting an impact.
  • Follow-up items from PD are sent back into ITSM: outcomes of the postmortem to be prioritized.

A given employee may have dozens of prioritized tickets in an ITSM tool, but they should only ever have 1 or ideally 0 assigned to them in PagerDuty at a given time so they can focus on customer-impacting issues requiring an immediate response. Similarly the concept of unassigned incidents doesn’t exist in PD. If there’s a problem, someone is responsible for that problem.

Calculating your cost of downtime
Every business is different, but here’s a starter for calculating the financial damage per hour of outage:

  • “your annual revenue / 2040” — assuming that most of your revenue occurs during the workweek; “revenue / 8760” if you’re truly a 24/7 international shop, or somewhere in between. We use revenue instead of profit, as you’re still paying rent and salaries even when you’re down.
  • Double it, because your competitors took the sale. You may want to more than double it, if you have a product where someone may visit your site back multiple times before buying since you’ll have more chances to lose the sales.
  • Add in the costs of 1 person-day for each hour of response time, to account for randomization and follow up. Double that number if you’re in a time crunch, or the team is on the critical path for a release since you’ve now added uncertainty that will ripple through the org.
  • Some incident types such as security breaches can have a far greater cost down the line.
  • Add 30% to account for service degradations, not just downtime (over 53% mobile users abandon a site if it takes >3 seconds to load):

In the Fortune 1000, the cost of outages can easily exceed $500,000/hour (source)

1: Proactive Mobilization

  • The easiest way to speed up your response is to start it sooner.
  • Track what affects your customers, not what affects your machines. Use Real User Monitoring to know validate that users can successfully load, download or buy your tool.
  • Monitor the underlying infrastructure too.You’re primarily looking to detect problems before they affect users (at the cost of some false positives) and identify the cause of a customer facing problem.
  • Automate your awareness: every problem should have an owner and they should find out from your monitoring tools automatically. Display the status in your NOC, but if there’s an issue affecting your revenue, don’t wait for a person to act only after they’ve noticed. Automatically assign and notify someone about all issues above a certain priority, via their preferred communications methods (phone call, SMS, etc.). To make this easier, we integrate with hundreds of monitoring tools.
  • Automatic increase from Sev4 to Sev3 to Sev2. If the tool detects that your shopping cart has gone from slow to non-responsive, automatically escalate the priority so the responder has all the information.
  • Automate response for Sev2 and Sev1. If you need more senior engineers for a Sev1, configure your tooling to do that automatically.

2: Have a defined process
Define your process and clarify the different roles to remove ambiguity, confusion, and wasted time during a response. We recommend the following roles:
Incident Commander + Deputy, Scribe, Customer Liaison and obviously Subject Matter Experts. You can find more detail at: https://response.pagerduty.com/before/different_roles/

Key things to remember:

  • Poll for strong objections: Ask for objections, not consensus. "Sounds like it might be the size of the log file, any strong objections to purging the logs?” This ensures you don’t get stuck in inaction, waiting for non-urgent discussions and consensus building.
  • Time-box and assign tasks to individuals: “Ok, we’re going to try and purge the logs. Eric will do it and report back if it takes longer than 5 minutes.”
  • Standardize lingo & etiquette: Ensure everyone knows when and how they can speak up.

Along with roles, the actual lingo etiquette of the meeting is important. Keeping the tone and the discussion practical and focused on the issue, without emotion, is key to effective response.

In an outage, the organizational hierarchy takes a backseat to the response roles. When executives start to randomize the defined process, we refer to that as “Swooping and Pooping” — the terminology is intentionally derisive and serves to highlight that you need communication that’s clear, concise and actionable.

3: Build your communication strategy
You need a process around communication, outside of the core response team, that is defined as well. Depending on the type of incident, you could be dealing with internal customers (we often call them stakeholders), external customers, and even the market at large. You could be responding to a security incident, which may require looping in legal and other executives.

All of these groups (on an as needed basis) need to be kept up to speed, but the wrong place to do that is where the responders are working. The last thing you want is someone joining the call, and asking for status. This disrupts the people trying to fix the call. And to the “Swooping and Pooping” point earlier, you don’t want an executive getting on a call and saying “fix this in 10 minutes”. This implies the team is not already working as quickly as they can. It’s demotivating and doesn’t contribute anything helpful for the response. This is the role of the customer liaison.

  • Have a Conference Bridge for internal discussion. Humans are social animals and this seems to be the most natural format. Use the conference call tool that your users are already familiar with, as an outage is not the time to learn a new tool. Automatically attach the conference call information to major incidents.
  • Have a chat room for logging actions. This gives those jumping into a response an ability to get up to speed without asking repetitive questions, and provides a timestamped record of the response. Additionally, many companies are starting to trigger response actions directly from bots in the chat room.
  • Provide proactive, scheduled updates for your stakeholders. Give them an incident status page to stay up-to-date on relevant, real-time information. This is essential to prevent stakeholders’ urges to swoop and poop.
  • Decide ahead of time what criteria and timeline stakeholders should use to notify your customers or downstream users

4: Postmortems
Postmortems are how you fix the long term problem. They give closure to people after a particularly stressful event, and guarantee that you can take well-thought out and productive action on some of the immediate patches you took in the heat of the moment to solve a problem.

  • Focused on prevention and learning, you’re looking to understand what can be changed to avoid this issue in the future.
  • Transparent, blameless and apolitical. The goal is to get all the relevant information, and the last thing you want to do is foment grudges. Blame impedes information flow. The only acceptable blame is if you’ve uncovered an intentionally malicious employee, which is an exceedingly rare outcome.
  • Oriented around improvement of both the system’s resiliency and of the response process, with a goal of always getting better.
  • Targeting a root cause. We find the “five whys” helpful here.
  • Required for major incidents, and streamlined to save time. No-one wants to do postmortems, but they are an essential tool to maximize the impact of your planned work. To make them easier, we’ve built a [postmortem tool] modeled on our customers’ existing processes. It can save you hours toggling between tools to collate information, as it automatically creates a timeline with relevant PagerDuty and chat activity.

We post all of our postmortems internally using our Postmortem tool. We translate any customer facing issues and post them on status.pagerduty.com, as we deeply value 100% transparency and our customers’ trust. Be completely open with your company and customers. We view this not only as learning for our team, but if our customers can learn from our experiences, even better.

For more postmortem tips, visit our detailed ebook.

5: Training and Practice

Do the training and practice. You can’t expect your incident response process to be fantastic if you only use it every once in awhile. Not every service fails often. Some people get more practice than others. But everyone should be practiced, so when something does happen, you are ready.

  • Shadowing comes in a few flavors, but traditionally new responders shadow an experienced responder for a few weeks, then the experienced responder shadows them for a few more. A tool like PagerDuty makes it easy for overwhelmed responders to pull in help. One of our braver customers starts everyone on-call solo — if a new hire can’t figure it out from the runbook, they can add their mentor as a responder, and over time the percentage of incidents they need help with drops.
  • Record your outages to use for training. These recordings are a goldmine, helping teams understand what actually happens in real failure scenarios. They are also useful for your postmortems.
  • Pre-mortems (“if this broke, what would I look for?”) are valuable not only as a training exercise, but also help in identifying places where you can add additional monitoring for root causes or pre-emptive warning. For instance, if you’d check network latency first because it might cause an outage, monitor on it and trigger a Sev3 if it goes out of bounds — even if the app isn’t affected.
  • Chaos engineering is probably beyond most organizations at this time, but we get a lot of mileage out of our Failure Fridays.

What is the result?
The less time you spend on unplanned outages, the happier your customers are due to better-performing services, and the more time you spend on innovating and out competing.

Customer-impacting incidents are arguably the worst thing that can happen to a business. They damage brand reputation, cause huge losses in customers and revenue, inhibit employee productivity, and slow down morale, among other things. If you can get to a point that you are as efficient as possible, responding to major incidents without chaos and stress, and with the attitude that you’ll learn and improve from each and every one, you will have achieved a winning and empowering culture that stands to delights both your customers and employees.