I am planning a shift in the alerts management process in my organization and I would like to share my plan on using PagerDuty and get some feedback.
Please let me know if you identify holes in the plan, or anything that might be difficult to accomplish, or if you have any suggestions.
Of course, I’m open to answer any questions.
My current solution for managing alerts from various sources
We have monitoring setups in NewRelic, Pingdom, Anodot, Grafana, Sensu and Sentry systems.
Most alerts in these systems are configured to notify a specific Slack channel.
For example, a NewRelic alert “Slow Response Time” is configured to notify
#pengiuns-team channel in Slack.
Our internal Slack bot is identifying when these alert messages are sent to Slack, writes them to a database “as is” and adds response buttons(‘Acknowledge’, ‘Resolve’) to these Slack messages. Once somebody clicks one of the buttons, we update the entry in our database.
Our database is queried by our internal status page for reflection of production.
We have ~800 potential responders to alerts.
Problems with the current solution
- Alerts data coming from the monitoring systems is not processed well and saved “as is” (Slack message format), thus, performing analysis on these alerts is very difficult.
- The solution relies on the availability of Slack.
- Missing a standard way to setup alerts.
- Other issues…
- Stop using the solution described above, replacing it with a more robust, extendable and resilient one.
- Route all the alerts directly to PagerDuty, and from there to the appropriate Slack channels.
- Not escalating at this point (due a bad signal/noise ratio).
- Be able to query and analyze the alerts in real time to reflect production in a Status Page.
- Every slack user in the organization can respond to the alerts (by clicking the “Ack” or “Ignore” buttons).
Per each service, use the monitoring system API to:
- Find all the alerts that are routed directly to a specific Slack channel.
- Re-route these alerts to PagerDuty using the integration key created for the service.
- Create a polling mechanism to query PagerDuty about active alerts.
Waiting to read your thoughts and ideas.