Response Cycle - Step 3: Assessing the impact of an incident

Forum|Forum|1 year ago
August 6, 2024
0 replies
142 views

+2

PagerDuty Community Team
Community Manager 💚

There are 5 steps in the Incident Response Cycle:

Response Cycle - Step 1: Optimizing your Alerting ↗️
Response Cycle - Step 2: Notifying the right people at the right time ↗️
Response Cycle - Step 3: Assessing the impact of an incident ↗️
Response Cycle - Step 4: Resolving Incidents Through Collaboration ↗️
Response Cycle - Step 5: Preventing problems from recurring ↗️

Addressing each step in the incident response cycle will ultimately help your team drive down incident resolution times by (1) focusing on actionable alerts, (2) getting the right people to review those alerts, and (3) understanding the full impact of the problem so that (4) the appropriate remediation steps can be taken and (5) reviewed to make sure the same problem does not happen again.

In this article, we will discuss Step 3: Assess

What does "Assess" mean?
Assessing means understanding the full scale and impact of an alert or wide-spread incident. It answers the question “How bad is it?” and determines how the alert or incident should be resolved. Will you need to coordinate a response and collaborate with other teams? Are there stakeholders that you will need to notify? Or is this a one-off issue that you can remediate without causing any ripple effects throughout the business and other systems? Being able to answer these questions quickly using contextual information from PagerDuty will help guide your next steps toward incident resolution.

What kinds of common business or operational challenges come up when we do not optimize our alerting?

We have a hard time assessing the blast radius of an incident
We don’t understand which business critical services are impacted.
We don’t understand which alerts are related to the same incident.

We have a hard time assessing the blast radius of an incident.

IT Operations professionals today require infrastructure-wide context to effectively remediate incidents, decrease non-actionable alerting, and continuously improve incident management capabilities. With the proliferation of microservice architectures, applications are rapidly growing in complexity and are generating ever more telemetry. These trends compound the difficulty in gaining broad operations awareness and understanding business impact. As a result, incident responders often lack visibility into the blast radius of incidents.

Feature: Infrastructure health application
PagerDuty’s Infrastructure Health Application provides a visual overview of all of the alert clusters across the services and hosts in your IT infrastructure. These visualizations can be used in several capacities to not only aid in incident response, but to help you improve the overall health and performance of your applications. During an incident to quickly assess the scale of the issues at hand. For example, is a single service down? Or are you facing a multi-service cascade type incident that will require additional teams and resources to be marshaled?

We don't understand which business critical services are impacted.

As IT teams transition from a monolithic structure to an environment with several microservices that are being monitored by a diverse set of specialized tools, it becomes more difficult to correlate events and translate data streaming from multiple sources. Because of this, teams spend more time decoding their alerts to understand exactly which part of their infrastructure and services is being impacted.

Feature: Service groups
PagerDuty allows you to group multiple integrations under a single service, allowing you to represent your services in PagerDuty as they exist in your environment. When your PagerDuty services are representative of the actual services and systems that a team is responsible for, it becomes easier to see what parts of your infrastructure are directly impacted by an incident. This means reducing the time it takes to look at an alert and assess what part of your business is at stake. Armed with this knowledge, you’ll be able to focus on remediating the service that is being impacted or take steps to collaborate with other teams to orchestrate the appropriate response.

We don't understand which alerts are related to the same incident.

If you have redundant monitoring systems configured, or a single point of failure or degradation that causes a domino effect of multiple tools simultaneously firing off alerts, then you’ve probably experienced an alert storm which happens when a single problem causes a flood of alerts. When dealing with several, disparate alerts at the same time during an incident, it becomes increasingly difficult to sift through the noise and figure out which alerts are related to the same incident, making it more difficult to assess the root cause of a problem.

Feature: Merging incidents
PagerDuty allows you to merge incidents together, grouping alerts under a single incident object to enable true end-to-end incident management.

Responders no longer get paged on individual, siloed symptoms. Instead, resolution workflows are now centered around an incident object that is truly representative of a real, service-impacting problem or outage. This capability redefines how customers can intelligently triage and interact with the data from their systems to reduce noise, improve cross-functional collaboration, and drive down resolution times.

Read more about merging incidents

We have a hard time assessing the blast radius of an incident.

We don't understand which business critical services are impacted.

We don't understand which alerts are related to the same incident.

Join PagerDuty Commons

Login to PagerDuty Commons

Scanning file for viruses.

This file cannot be downloaded