Response Cycle - Step 5: Preventing problems from recurring

Forum|Forum|1 year ago
August 6, 2024
0 replies
145 views

+2

PagerDuty Community Team
Community Manager 💚

There are 5 steps in the Incident Response Cycle:

Response Cycle - Step 1: Optimizing your Alerting ↗️
Response Cycle - Step 2: Notifying the right people at the right time ↗️
Response Cycle - Step 3: Assessing the impact of an incident ↗️
Response Cycle - Step 4: Resolving Incidents Through Collaboration ↗️
Response Cycle - Step 5: Preventing problems from recurring ↗️

Addressing each step in the incident response cycle will ultimately help your team drive down incident resolution times by (1) focusing on actionable alerts, (2) getting the right people to review those alerts, and (3) understanding the full impact of the problem so that (4) the appropriate remediation steps can be taken and (5) reviewed to make sure the same problem does not happen again.

In this article, we will discuss Step 5: Prevent.

What does “Prevent” mean?
Preventing means reviewing metrics, uncovering trends and patterns, and making improvements to your infrastructure, monitoring thresholds, or team performance. When you make proactive changes to your incident response process and identify ways to stop technical problems from resurfacing, your team can spend less time firefighting and more time focusing on the projects that will help your business grow and stay competitive.

What kinds of common business or operational challenges come up when we do not optimize our alerting?

We need to understand how we should be adjusting our monitoring/alerting thresholds.
We need to understand how teams/individuals are performing so they can improve.
We need to understand which events need to be re-routed to different teams.
We need a process for summarizing and learning from past incidents.

We need to understand how we should be adjusting our monitoring/alerting thresholds.

When a major incident happens with limited warning signals, it may be time to review how your monitoring and alerting thresholds are configured to detect issues before they become a major problem. On the flip side, if you have too many alerts triggering that are contributing to unnecessary noise during an incident or are not contextual enough to help your team predict a problem, then you will also want to review how your alerts are configured.

Feature: Spreadsheet template
PagerDuty’s Google Spreadsheet Template allows you to visualize and aggregate your incident data to help answer questions like -

What are my most common incidents?
Which service is the noisiest?
What kinds of alerts are flapping?
At what times are we triggering the most incidents?

With this information, you can start to paint a picture of your monitoring environment by identifying alerting gaps before major incidents occur or noise that can be controlled or enriched to empower your team with the information they need to stop a small problem from becoming a major incident.

Read more about how to use this spreadsheet template

We need to understand how teams/individuals are performing so they can improve.

The tooling that you have is only as powerful as the people who use it to improve on their processes. Even when you’ve got the right thresholds configured on your monitoring stack, your efforts to maintain and manage your services are futile if you lack the right team response. This is why many teams incorporate response and resolution times as important operational metrics to track their performance.

Feature: System, Team, and User reports
With PagerDuty’s System, Team, and User reports, you can view response (mean time to acknowledge) and resolution (mean time to resolve) metrics per individual or team to get a holistic view of how incidents are being managed and where resources need to be allocated or trained.

System report: Visualizes incident load across escalation policies (teams) and services to highlight trends in your incident load and prioritize engineering work.
Team report: Visualizes incident response and resolution against your incident load to identify where teams need support.

User report: Understand how individuals are responding to incidents to see if you have a resource problem (too many incidents escalated), a training problem (responders manually escalating/reassigning incidents or not acknowledging), or alerting problem (several incidents being reassigned because they don’t belong to the right team)
With seamless access to understanding individual and team metrics, you can work to implement the appropriate changes within your team’s organization or response workflow to prevent major incidents from becoming a resource or performance problem.

Read more about the system and team reports

We need to understand which events need to be re-routed to different teams.

If the wrong people are being notified about an event, then the time to resolve that problem is delayed as teams try to figure out who is supposed to be the owner of the problem. Not only that, but it impacts the team that was wrongly notified as time is taken away from them completing projects to focus on re-routing an incident to the appropriate team.

Feature: User report and google spreadsheet template
Using a combination of the User report and the Google spreadsheet template, you can:

Download a list of incidents that were reassigned (meaning an incident was reassigned from a user to somebody else). You can do this my navigating to the User report and clicking on the number/percentage of incidents for a user under the “Reassignments” column.

From here, you can view the list of incidents from the web app and click into each incident to view the full incident details and timeline, or you can download a CSV file of those incidents and import them into the Google spreadsheet template for further analysis to identify trends or patterns in the incidents that are being reassigned away from users to somebody else.

By analyzing these reassigned incidents, you can start to identify events that may need to be re-routed to different teams or to different users on your team with the skill set to own and resolve those problems immediately.

We need a process for summarizing and learning from past incidents.

Many operationally mature teams will write up a postmortem report after a major incident to describe what went wrong, how the problem was solved, and what they’re going to do to make sure the problem doesn’t happen again. These reports are not only a great way to reflect and learn from a past incident, but they are also used in some organizations to internally publicize the full impact of an issue.

However, this process is time consuming and even the most operationally mature teams end up running out of time to finish these reports or don’t bother writing them at all.

Feature: Postmortem builder
PagerDuty’s postmortem builder enables teams to create postmortem reports directly in PagerDuty by automatically bringing together PagerDuty incident information and chat messages (from Slack and HipChat) to create a timeline of events for analysis and documentation. No copy/pasting is involved and teams are able to standardize their postmortem reports with customizeable templates. Learn more about the postmortem builder.

We need to understand how we should be adjusting our monitoring/alerting thresholds.

We need to understand how teams/individuals are performing so they can improve.

We need to understand which events need to be re-routed to different teams.

We need a process for summarizing and learning from past incidents.

Join PagerDuty Commons

Login to PagerDuty Commons

Scanning file for viruses.

This file cannot be downloaded