How to use Dead Man's Snitch to enable heartbeat monitoring


(Dan Birck) #1

This solution originated due to a number of requests to be able to detect if/when a monitoring tool is no longer communicating with PagerDuty.

  1. The purpose of Dead Man’s Snitch (i.e. DMS) is to generate an alarm if a job/system/app does not “check in” with DMS.

  2. DMS alerts are sent out on the turn of desired interval values: 15 min, 30 min, 1 hour, 1 day, 1 week or 1 month, if your job goes missing. In our case, DMS could notify our customers in the event that a given PagerDuty Service has not received an event from a given monitoring tool for any of those amounts of time.

  3. To enable this, create a DMS “snitch” for each monitoring tool. Configure the DMS interval to the desired time threshold as to when you expect PagerDuty to hear from a given service (i.e. 1 month, 1 week or even 1 day is likely too long in IT monitoring. Choose an interval that is close to the amount of time you expect to hear from a monitoring tool under correct operating scenarios).

  4. Create a webhook for each Service within PagerDuty. Use a unique DMS snitch URL per Service as the endpoint URL in the webhook.

  5. When an incident is either Triggered, Reassigned or Resolved, the webhook will hit the snitch URL, telling DMS that all is well. When an incident is not Triggered, Reassigned or Resolved, the webhook will not be triggered, and in the event that the amount of time between this “silence” exceeds the snitch interval, DMS will generate an alarm to the user directly (outside/out of band of PagerDuty).

  6. Optional step: Integrate DMS with PagerDuty as a service itself, allowing customers to use PagerDuty to tell them that a monitoring tool hasn’t checked in! This should be used with the caveat that in the event that PagerDuty is the reason the snitch has not checked in, the user may not get notified by PagerDuty…

(system) #2