How to know your PD integration is up and running without a periodic check in


#1

Hi all

I have recently joined a team which relies heavily on PD. We are reviewing our notifications/services as part of our continuous improvement process.

I have noticed that our most offending notification is a Pager Duty Test Check: this gets triggered twice a day to confirm the integration with PD and our phone is up and running. The person on-call expects it, acknowledges and moves on. But it feels that there must be a different way to tackle this.

Does everyone use this type of check in? What alternatives do we have?

I have seen from the FAQ that PD recommends this:
“We advise using an external ping service such as BasicState or Wormly to monitor your network connection and mail server. Of course, you can forward the error messages from these monitoring services to one of your PagerDuty services. This way, if your site loses network connectivity or your mail server crashes, the on-call engineer will be immediately notified by PagerDuty.”

Thanks
Enrique


(Demitri Morgan) #2

Hi Enrique,

Something one could do (while an ugly solution) is to set up Support Hours on a service. Then, run two processes as Cron tasks:

  1. Trigger a PagerDuty incident from a different host, i.e. an email system, during the time that it’s a low urgency incident (“check monitoring site connectivity”)
  2. Resolve the same incident from the same host that runs the monitoring service, by sending a resolve event with the same deduplication key as the original event that triggered the incident. Have this happen before the support hours time at which urgency turns to high.
  3. If the resolve event never gets through, the incident turns to a high-urgency incident, and auto-escalation will become enabled for it.

At the very least, it will eliminate the alert fatigue of having to resolve the same incident every day, because the monitoring system is expected to resolve it, and so there’s only a problem if it stays open.

Another way is to set up a secondary site that uses a different network provider, similar to using a third-party monitoring service. The secondary site has monitoring of the primary monitoring systems, and vice versa.

While this may seem silly, i.e. begging the question of “who monitors the monitors”, the essential fact remains that your monitoring infrastructure has a higher fault tolerance against network failures, because both sites would have to go down simultaneously for your organization to experience a catastrophic outage in which nobody gets notified.