There are 5 steps in the Incident Response Cycle:
Response Cycle - Step 1: Optimizing your Alerting
Response Cycle - Step 2: Notifying the right people at the right time
Response Cycle - Step 3: Assessing the impact of an incident
Response Cycle - Step 4: Resolving Incidents Through Collaboration
Response Cycle - Step 5: Preventing problems from recurring
Addressing each step in the incident response cycle will ultimately help your team drive down incident resolution times by (1) focusing on actionable alerts, (2) getting the right people to review those alerts, and (3) understanding the full impact of the problem so that (4) the appropriate remediation steps can be taken and (5) reviewed to make sure the same problem does not happen again.
In this article, we will discuss Step 4: Resolve.
What does “Resolve” mean?
Resolving means coordinating a response, collaborating with others, and surfacing as much information as possible to try to resolve an incident quickly. When you maximize the efficiency of your team to resolve problems faster in a controlled and meaningful way, you decrease downtime in your organization and minimize overall customer impact so that your team can focus on building features and maintaining the infrastructure that keeps your business running.
What kinds of common business or operational challenges come up when we do not optimize our alerting?
- Key stakeholders don’t have visibility into critical issues that impact them or their customers.
- We don’t have enough information in our alerts to fix the problem.
- We don’t have a standardized way of coordinating with other teams to fix an issue.
- We have trouble getting people up to speed during an incident.
Key stakeholders don't have visibility into critical issues that impact them or their customers.
Oftentimes, the teams responsible for resolving an incident are also responsible for notifying key stakeholders about the status of an ongoing incident. Without any specific tooling in place to make this happen, some may resort to manually emailing distribution lists or publishing updates in Slack. However, these processes are manual and can take precious time away from focusing on actually resolving the incident. By building in automated processes to keep stakeholders up to speed, teams can dedicate more of their time to incident resolution rather than incident “public relations”.
Feature: Webhooks (Chat and Status Pages)
PagerDuty allows you to configure webhooks on your services to post automated updates about an incident’s status. Details surrounding the incident will be sent to a URL that you specify, such as Slack, HipChat, a Status Page, or your own custom PagerDuty webhook processor.
For example, let’s say you’ve integrated one of your business critical services to the customer support team’s Slack channel. Every time an incident is triggered on your service, a PagerDuty notification will push immediately into that Slack channel, notifying support that a problem has been detected, meaning they need to go all hands on deck to manage angry customer tickets. Once somebody on your team acknowledges the incident, support will know that the incident is being actively investigated. When either you or your monitoring system resolves the PagerDuty incident, support will get another update via Slack letting them know that the issue was resolved so that they can update their customers that their problems have been remediated.
This kind of automated workflow requires no human effort to push incident status updates to teams, meaning more time can be dedicated to resolving an incident.
Tip: You can add multiple webhooks to a single PagerDuty service, automatically pushing PagerDuty incident status updates to multiple chat channels or status pages at the same time.
Read more about webhooks
Feature: Stakeholder notifications
PagerDuty stakeholder notifications allow you to subscribe individuals and teams to a PagerDuty incident and publish updates to them directly from the PagerDuty incident.
Subscribers will receive notifications from PagerDuty every time an update is published and they can also view the status page of the incident in PagerDuty to view the full incident details.
This is a great way keep stakeholders informed about the finer details of an incident without having to navigate away from the PagerDuty web app, which you might be using to track incident notes and coordinate a response (add responders) to resolve the incident quickly.
Read more about adding and notifying subscribers
We don't have enough information in our alerts to fix the problem.
It becomes difficult for an incident responder to remediate a problem when the alert they are looking at lacks the contextual information needed to understand the full scale of the problem or the necessary troubleshooting steps. This kind of delay increases the overall time it takes to resolve an incident, putting a strain on your systems, team, and your customers.
Feature: Custom incident actions
Custom incident actions provide a user assigned to an incident with a quick way to execute custom logic housed outside of the PagerDuty system. Teams can host their own custom scripts that execute when these actions are executed, which provides users with a nearly limitless number of possible actions.
The Restart Server action above is a good example of how you can use custom incident actions to execute immediate tasks to remediate an issue or rollback code deployments. If additional information is needed to understand the full scale of the problem though, you can create a custom action that called “Add Diagnostics” that fetches diagnostic data on the affected infrastructure and appends the data as a note on your PagerDuty incident. This gives your team visibility into key infrastructure data right in their PagerDuty incidents.
Read more about custom incident actions
Feature: Add-ons
Add-ons can introduce functionality to your account that is outside of the core product, but is still hosted within PagerDuty. Add-ons can be used to embed additional incident information into an incident (must be able to display in an iFrame) that can help your team resolve incidents faster.
Below are examples of PagerDuty’s status page as an Add-On in PagerDuty. Add-ons can be added as a separate page within the PagerDuty web app–
– or they can be embedded within your PagerDuty incidents.
By being able to access additional, contextual information in the web app or within the incident details, an incident responder can pull the data they need to quickly resolve an incident.
Read more about add-ons
Feature: Rich incidents - Contexts (for custom API integrations)
If you have set up a custom API integration with PagerDuty, you can use the “Contexts” parameter to embed additional information in your incidents such as graphs, images, or links to wikis to help your team access the information that they need to fix problems faster.
Feature: Adding descriptions to services
Use the “Description” field on your services to add links to wikis or runbooks. These service descriptions are visible in the web app, the mobile app, and in PagerDuty email notifications for quick and easy access during an incident.
We don't have a standardized way of coordinating with other teams to fix an issue.
In some cases, an incident cannot be resolved without bringing in reinforcements; that is, other teams that may also be impacted by a single incident or development leads who have the skill set to help resolve an issue. Looking up and dialing a phone number or messaging someone in Slack not only cuts into your time, but also may prove ineffective if a phone number is out of date or you can’t reach that person and need to figure out who to manually reach out to next.
PagerDuty has functionality that helps automate this process by using their reliable notifications and built-in escalations to bring the right people together during an incident. The following are some of the features that you can use in PagerDuty to help your team create a standardized process for coordinating with other teams during an incident.
Feature: Adding responders
PagerDuty allows you to add individuals and escalation policies to incidents, with notifications being sent to each individual via email, phone calls, text messages and push notifications with just a few clicks. Users can send a personalized message to provide the right contextual information about the problem, giving the added responders the relevant details to join the incident and begin collaborating on this issue.
By being able to trigger mass notifications to individuals and team escalation policies, PagerDuty can help you orchestrate a response around your incident faster than before, automating manual processes that directly impact resolution times.
Read more about adding responders
Feature: Response bridge
PagerDuty can also help bring people together around an incident by disseminate conference bridge details within incidents. Getting this information out to multiple responders ASAP is critical to getting the right people in a virtual room to talk through an incident and collaborate on a faster resolution. Not only that, but including conference bridge information within an incident allows responders to view the full context of the incident and all of its notes to understand what’s going on before joining the discussion on the bridge.
Tip: When conference bridge details are included in an incident, they are also displayed in Slack.
We have trouble getting up to speed during an incident.
Live chat and conference bridging is great for interactive conversations during an incident, but when trying to get new responders up to speed on an incident, it’s time-consuming to repeat everything that has already been discussed or for the person to scroll through a chat conversation.
Feature: Response notes
With response notes in PagerDuty, you can document all actions taken and decisions being made so that new people involved in incidents can quickly get up to speed and contribute to your investigation to resolve your incidents faster.