There are 5 steps in the Incident Response Cycle:
Addressing each step in the incident response cycle will ultimately help your team drive down incident resolution times by (1) focusing on actionable alerts, (2) getting the right people to review those alerts, and (3) understanding the full impact of the problem so that (4) the appropriate remediation steps can be taken and (5) reviewed to make sure the same problem does not happen again. In this article, we will discuss Step 1: Optimize.
"What does "Optimize" mean?
Optimizing means configuring your monitoring and alerting such that only actionable events receive immediate attention and non-actionable events are categorized differently or not surfaced at all. Being able to optimize what kinds of alerts your team is notified about translates into less time understanding how to prioritize alerts and more time focused on those business critical events that are directly related to your uptime and performance.
What kinds of common business or operational challenges come up when we do not optimize our alerting?
- We get too many non-actionable alerts that we don't care about.
- We have a hard time separate high urgent alerts from low urgent alerts.
- We have multiple redundant alerts that are triggered for the same issue.
We get too many non-actionable alerts that we don't care about.
When there are too many non-actionable alerts flooding your team’s inboxes and phones, more time is spent reviewing non-urgent alerts, delaying the detection of a more critical one or even letting the critical alerts completely fall through the cracks.
As a best practice, only actionable alerts and events should notify your team, which means removing non-actionable events from your team’s view. This will allow your team to focus on the alerts and events which are mission critical to your business.
PagerDuty has 2 key features that helps teams focus on actionable alerts:
- Email filters
- Suppression event rules
Feature: Email filters
If you have set up an email integration on your PagerDuty service, then use email filters to filter out emails that are non-actionable alerts. With regular expressions, you can control what kinds of emails should trigger actionable incidents and notifications based on the from address, the subject line of the email, or the body of the email.
Read more about email filters
Feature: Suppression event rules
Suppression event rules are the equivalent of email filters for API integrations. With suppression event rules, you can suppress alerts sent via the API from triggering an incident and notifying the on-call. This means more control over events that are sent via the API so that only the most important events are surfaced as relevant issues that need to be fixed by your team.
Read more about suppression event rules.
NOTE: Suppression event rules can only be used by enabling a setting on your service to create alerts and incidents. Turning on this setting at the service level is not recommended if your service is set up with any one of these bi-directional integrations. If you are not sure whether to turn on this setting, please contact email@example.com
We have a hard time separating high urgent alerts from low urgent alerts.
When it’s hard to distinguish alerts that need to be addressed now versus alerts that can be looked at later, it becomes difficult for users to prioritize their work. This may result in low-level issues being worked on first while a more urgent issue exponentially grows into a larger problem.
With PagerDuty’s urgencies, you can clearly identify which alerts are high- vs. low-urgent, allowing users to prioritize which events are worked on immediately and which alerts should be monitored and reviewed on the side. Ultimately, this means critical events can be surfaced and identified more quickly, resulting in faster response and resolution times for high-urgent events.
To bring in both high and low urgent alerts for one of your services (no matter the time of the day), you will need to create 2 separate services.
The urgency levels of incidents are clearly displayed in the web and mobile app, with high urgent incidents taking priority over low urgency incidents in the list views.
Notifications for high and low urgent incidents are also customizable under each user’s profile, which means that users don’t need to be bothered in the middle of the night when a low urgent event is detected.
Read more about urgencies
We have multiple redundant alerts that are triggered for the same issue.
Some monitoring solutions are configured to repeatedly notify you about an event either as part of their built-in settings or as part of a configurable setting that somebody set up for your team. Instead of having to reconfigure those settings, you can use PagerDuty’s features to bundle redundant alerts into one PagerDuty incident. This means less alerts in your inbox and more time to focus on resolving the source of the problem. Fortunately, one of these features is built-in to PagerDuty and requires no configuration.
Feature: Incident keys
PagerDuty uses “incident keys” to de-duplicate multiple alerts and bundle them under one incident. Many of PagerDuty’s built-in integrations are pre-configured with a defined incident key based on the event details passed by tools such as NewRelic or AppDynamics. This means that by default, PagerDuty will de-duplicate your redundant alerts and immediately help reduce the alert noise for your team without requiring any manual reconfiguration efforts on your monitoring systems.
If you are building a custom integration using PagerDuty’s events API, then you will also be able to use incident keys to de-duplicate incidents. Simply pass through the same incident key for those events that you want to de-duplicate.
If multiple events are sent with the same incident key, then they will bundled under the same incidents and viewable under the Timeline for each incident.
Feature: Email management rules
You can also apply the same incident key feature for incidents that trigger via email.
By default, all email integrations de-duplicate based on the subject line of the email (the 2nd option below). However, you can also configure your integration to trigger incidents on every new email (1), if there are no open incidents (3), or based on custom rules (4)
If triggering based on custom rules, you will be able to define your own custom incident key based on the subject or body for the email.
These rules give you more control over what types of emails trigger incidents and how additional emails should be de-duplicated under the same PagerDuty incident to prevent redundant alerts from triggering and distracting your on-call person as they work to resolve the problem.
Read more about triggering incidents using email management rules.
Read more about resolving incidents using email management rules.