How to know your PD integration is up and running without a periodic check in


#1

Hi all

I have recently joined a team which relies heavily on PD. We are reviewing our notifications/services as part of our continuous improvement process.

I have noticed that our most offending notification is a Pager Duty Test Check: this gets triggered twice a day to confirm the integration with PD and our phone is up and running. The person on-call expects it, acknowledges and moves on. But it feels that there must be a different way to tackle this.

Does everyone use this type of check in? What alternatives do we have?

I have seen from the FAQ that PD recommends this:
“We advise using an external ping service such as BasicState or Wormly to monitor your network connection and mail server. Of course, you can forward the error messages from these monitoring services to one of your PagerDuty services. This way, if your site loses network connectivity or your mail server crashes, the on-call engineer will be immediately notified by PagerDuty.”

Thanks
Enrique


(Demitri Morgan) #2

Hi Enrique,

Something one could do (while an ugly solution) is to set up Support Hours on a service. Then, run two processes as Cron tasks:

  1. Trigger a PagerDuty incident from a different host, i.e. an email system, during the time that it’s a low urgency incident (“check monitoring site connectivity”)
  2. Resolve the same incident from the same host that runs the monitoring service, by sending a resolve event with the same deduplication key as the original event that triggered the incident. Have this happen before the support hours time at which urgency turns to high.
  3. If the resolve event never gets through, the incident turns to a high-urgency incident, and auto-escalation will become enabled for it.

At the very least, it will eliminate the alert fatigue of having to resolve the same incident every day, because the monitoring system is expected to resolve it, and so there’s only a problem if it stays open.

Another way is to set up a secondary site that uses a different network provider, similar to using a third-party monitoring service. The secondary site has monitoring of the primary monitoring systems, and vice versa.

While this may seem silly, i.e. begging the question of “who monitors the monitors”, the essential fact remains that your monitoring infrastructure has a higher fault tolerance against network failures, because both sites would have to go down simultaneously for your organization to experience a catastrophic outage in which nobody gets notified.


(Arpanbhagat5) #3

Hello @demitri
The first solution looks great for checking PD integration on a service, assuming PD is up all the time.
But, for a service which necessitates it be up all the time, setting up support hours wouldn’t work, right?
Or have I missed something here?

Thank you
Arpan


(Demitri Morgan) #4

Hi @Arpanbhagat5,

It really all depends on the desired frequency of running the connectivity check. The important thing to keep in mind with regard to the above hacky solution is how in this case, support hours don’t actually represent the time during which something needs to be up or supported. Rather, it is really just there to act as part of a countdown mechanism of sorts that runs once for each uptime check.

There will need to be one transition from off-hours-low-urgency to on-hours-high-urgency for each check that happens per day. To get more frequent checks, if the desired schedule of checks isn’t possible to implement using support hours and the above technique, another possible work-around would be setting up multiple services each with their own urgency schedule.