Help needed in using PagerDuty as an alerts-hub.

data
rest-api
questions
howto

#1

Hi all,

I am planning a shift in the alerts management process in my organization and I would like to share my plan on using PagerDuty and get some feedback.

Please let me know if you identify holes in the plan, or anything that might be difficult to accomplish, or if you have any suggestions.

Of course, I’m open to answer any questions.


My current solution for managing alerts from various sources

We have monitoring setups in NewRelic, Pingdom, Anodot, Grafana, Sensu and Sentry systems.
Most alerts in these systems are configured to notify a specific Slack channel.
For example, a NewRelic alert “Slow Response Time” is configured to notify #pengiuns-team channel in Slack.

Our internal Slack bot is identifying when these alert messages are sent to Slack, writes them to a database “as is” and adds response buttons(‘Acknowledge’, ‘Resolve’) to these Slack messages. Once somebody clicks one of the buttons, we update the entry in our database.
Our database is queried by our internal status page for reflection of production.

We have ~800 potential responders to alerts.

Problems with the current solution
  • Alerts data coming from the monitoring systems is not processed well and saved “as is” (Slack message format), thus, performing analysis on these alerts is very difficult.
  • The solution relies on the availability of Slack.
  • Missing a standard way to setup alerts.
  • Other issues…

Goals:

  • Stop using the solution described above, replacing it with a more robust, extendable and resilient one.
  • Route all the alerts directly to PagerDuty, and from there to the appropriate Slack channels.
  • Not escalating at this point (due a bad signal/noise ratio).
  • Be able to query and analyze the alerts in real time to reflect production in a Status Page.
  • Every slack user in the organization can respond to the alerts (by clicking the “Ack” or “Ignore” buttons).

The Plan

Per each service, use the monitoring system API to:

  1. Find all the alerts that are routed directly to a specific Slack channel.
  2. Re-route these alerts to PagerDuty using the integration key created for the service.
  3. Create a polling mechanism to query PagerDuty about active alerts.

Waiting to read your thoughts and ideas.

Thanks,
Dima


(Thomas Roach) #2

After you configure PagerDuty to integrate with your monitoring tools (and Slack), you can use our REST API to request information about open incidents.

Of the tools you mentioned, the only tool we don’t have an out-of-the-box integration with is Grafana, but as long as Grafana can send emails or make API calls, it can integrate with PagerDuty. You can read more about integration with PagerDuty, and view specific integration guides here: https://www.pagerduty.com/integrations/

Information about our REST API can be found here: https://v2.developer.pagerduty.com/v2/docs/getting-started, and you can view our API Reference documentation to see how API calls are made https://v2.developer.pagerduty.com/v2/page/api-reference#!/API_Reference/get_api_reference

If you have any more specific inquiries as you’re attempting to set up your integrations, you can always direct them to support@pagerduty.com!


#3

Thanks, @tom. I will definitely be in touch :hugs:


(Simon Fiddaman) #4
Of the tools you mentioned, the only tool we don’t have an out-of-the-box integration with is Grafana

Luckily, Grafana implements PagerDuty event creation directly!

I’d recommend you setup separate PagerDuty Services for each kind of thing where it makes sense to make arbitrary divisions, rather than lump everything together in one Service (which is what we did originally due to laziness and/or misunderstandings about the policy and/or because that’s what I setup for people and see point 1).

This gives you a bit more flexibility and control over things, but takes more time to setup (it is possible to programmatically create your Services and Integrations(?) but not the Extensions (outbound integration with Slack). You can do things like mark a Service for Scheduled Maintenance and prevent alerts from being generated.

If you have these Services very specific, you can actually bind PagerDuty Services (note: not a specific alert/incident in the Service, but any) to StatusPage. We use this for a couple of 3rd party services provided to us, and monitored by us, but not supported by us, where we want to notify Stakeholders (including the 3rd party support team) about any issues, but not actively resolve them ourselves. We have a separate PagerDuty Service (Low Urgency) for each individual service, mapped to a single (sub)component in StatusPage, with generic/shared StatusPage templates. It’s working well for us for this use-case, but because the rest of our Services are full of everything, we don’t go any further than this.

Each Service is only going to make sense having a single Extension (e.g. Slack integration), so you’ll at least need to divide by the responsible teams.

NOTE: You’ll get better reporting out of PagerDuty if you have all of the possible acknowledger/resolvers in PagerDuty, but it is possible to leave the Slack Extension settings on “allow ALL Slack users to …” and it’ll show up inside the Incident as Acknowledged by Slack user @xxxxxx. If you have it configured, this will also be posted back to the Slack Channel, along with the Resolved message (these are optional, along with reassignments).

For us, the responsibility for the on-call lies with the PagerDuty user, so we limit the Slack Extension to only allow PagerDuty users to ack/resolve via Slack (they prompt for an OAuth binding the first time each user tries this), but where we have 3rd parties who are responsible for some aspect, we could leave this open and set it up as you’ve described, allowing or encouraging them to perform the Acknowledgement (but hopefully not the manual Resolve!).

I was thinking about a Slack bot for listing active Incidents but never got around to it (because generally if you need to know about our alerts, you’re already in PagerDuty), but one of our engineers built his own PagerDuty interface which has desktop and overhead screen views which is pretty cool. It’s unauthenticated and read-only so it can be used/displayed anywhere internally.

Good luck!
@sfiddaman


#5

Thanks, @simonfiddaman for this very comprehensive reply. Much appreciated!

I would like to address a few points:

I’d recommend you setup separate PagerDuty Services for each kind of thing

Advise taken, thank you for that! My intention was to do the same, and driven, likewise, by laziness. I believe that setting it up front shouldn’t take much more effort and the benefits of more granular controls are worth it.

You’ll get better reporting out of PagerDuty if you have all of the possible acknowledger/resolvers in PagerDuty

Eventually, I do intend to have all responders in PagerDuty. The reason for not doing it now is that many teams in the organization currently have uncalibrated, false-positive and other non-actionable alerts that mostly generate clutter and noise, so it makes no sense to attach escalation policies to these teams at the moment.

My team is leading an effort to align these teams to good monitoring standards and Devops practices. We took a decision to start measuring all the alerts and responses, and provide the relevant teams with data about the quality of their production monitoring. We found that this is the best technique to get their cooperation and avoid resistance. (This is the most difficult part in this whole project)

I was thinking about a Slack bot for listing active Incidents

We have a slack bot that does just that, but currently not integrated with PagerDuty. Once alerts move to PagerDuty, I will rewrite it to support it and release it as open source. Will keep you posted. :wink:

Thanks,
Dima


(system) #6