Responding to Incidents


(Hailey Hickman) #1

This article walks through how to respond to an incident when you’re on call. First, make sure your user profile is set up to receive notifications. You’ll define your contact methods (where you want to be contacted) and your notifications rules (how you want to be contacted).

Next, let’s review how incidents are created. Typically a monitoring tool will detect an issue and send an event to PagerDuty. PagerDuty receives this event and creates an incident against a PagerDuty service. Tools can integrate with PagerDuty via API or email, or incidents can be manually triggered in the web app. If you are on-call for the service with the triggered incident, you will receive a notification via your preferred notification rules.

Acknowledge or Reassign

  • When you receive the notification, the first step is to acknowledge the incident if you are going to be taking ownership. This will assign the incident to you and stop it from escalating to the next person on call.
  • If the incident belongs to another person or team, you can reassign to a different escalation policy (or user) which will restart the notification process to that team.
  • Note: If you do not acknowledge the incident by the escalation timeout period, the incident will automatically escalate to the next person on call.

Assess Impact and Assign a Priority

  • Once the incident has been acknowledged, you’ll determine actionable next steps as quickly as possible. You can assign a priority to an incident which defines the impact and provides more visibility by color coding incidents and moving the most critical ones to the top of the incident dashboard view.

Coordinate a Response

  • Now that you’ve determined the priority, you may find you need assistance from other users or teams. To loop in additional help, you can Add Responders. A Responder is any individual who is directly involved with responding to and resolving an incident. When adding a responder, you will have the option to include a conference bridge number or video conferencing link.
  • You may also need to inform stakeholders. Stakeholders are often C-level executives concerned about the health of the company or a Support team interacting with customers during an outage. To notify these users or teams, you can subscribe them to an incident. Once subscribed to an incident, stakeholders can expect to receive all subsequent manual status updates on that incident.

Use an Automated Response Play for Common Issues or SEV-1 Outages

  • If you are involving the same response team frequently, or if you have a standard response process for all SEV-1s, a manager can create a Response Play. Response plays let you mobilize a multi-team response and subscribe affected stakeholders to incident updates - all with a single click.

Resolve the Incident

  • Once you are confident that service has been fully restored, resolve the PagerDuty incident and leave optional resolution notes.

Video

Watch a demo of this here!

In Review

When an incident is assigned to you, you have the option to acknowledge the incident if you are taking ownership, or you can reassign if it belongs to another team. Once acknowledged, you can add responders if the incident requires a multi-team response, you can notify business stakeholders by adding them as subscribers, and ultimately, you will resolve the incident once service has been fully restored.

Mobile App

All of these actions can be taken directly in the mobile app without ever needing to go to the web app. Check out this video: Responding to Incidents in the Mobile App