Filter a specific alert and stop triggering an PD call


(yugandhar ) #1

Hi All,
We have below nagios alert which is triggering PagerDuty alert
PROBLEM Service Alert: – ABC Time Monitoring is CRITICAL *
***** Nagios *****

Notification Type: PROBLEM

Service: ABC Time Monitoring
Address: x.x.x.x
Private IP:x.x.x.x

ServiceCommand: check_nrpe!check_fpa_update
Date/Time: Mon Aug 28 16:29:25 UTC 2017

Check Result:
CHECK_NRPE: Socket timeout after 60 seconds.

Additional Info:

Perf Data:

From the Above alert , Need to check CheckResult . Stop triggering alert if CheckResult is as below
CHECK_NRPE: Socket timeout after 60 seconds.

Please advise, how do we configure it?

(Simon Fiddaman) #2

I think there’s three ways to fix this:

  1. Use Nagios contacts with pdagent to determine where your alerts end up. Create multiple Services + Escalation Policies to match what you want to do - e.g. one Service for “High Urgency”, another for “Low Urgency”. Send this alert to the “Low Urgency” Service. You probably don’t even need a different Escalation Policy if you still want to alert the same on-call.
  2. Use PagerDuty Event Rules to match that alert when it comes in and Suppress it (if you don’t want to be alerted for the alert at all, but still want it to be present in PagerDuty in the Alerts tab).
  3. Don’t send that alert to PagerDuty.

PDAgent is quite flexible, and creating new channels to pass notifications into PagerDuty is as simple as configuring a new Nagios contact with the integration key from your target Service, and assigning that contact to your Nagios Hosts and Services.

NOTE: Nagios contact and contact group assignment is cumulative. Any contacts or contact groups you specify in any included template will be present for all Hosts/Services which use those templates. It’s often safer to have no defaults set in the base layer and apply another template layer a little further up which adds a contact or contact group. This can make moving the target PagerDuty Service as easy as changing the template fragment in the chain of imports (making it the last and only adding the contact field is probably the easiest way to do this).

Good luck!

(Daryl Monge) #3

On a slightly different strategy, we changed our Nagios plugins to return status “unknown” rather than “critical” when we had socket timeouts to avoid triggering alarms, including PagerDuty alerts. Needless to say, there still has to be a monitoring service somewhere that will detect a real network problem!

(Simon Fiddaman) #4

Of course you need to exclude notifications for Unknown from the PagerDuty contact for that one! Some of our teams use this to quiet cascade issues (usually specifically where we have checks which rely on graphite being up to obtain data). As you said, as long as you have alerting for graphite itself, it’s all good.

In our org, people are unconscionably used to using Nagios Downtime as a synonym for “make it go away”, so we added support for Nagios Downtime => PagerDuty Resolve in pdagent.

(yugandhar ) #5

Hi Daryl,

Can you please let me know that how do we configure to make unknown pagerduty for socket timeout issues.
Can you please give me if you have any link which describes me in a better way.


(Daryl Monge) #6

We are using check_by_ssh for remote checks.
Many of that suite of applications use a configuration file that allows you to set options globally.
And we set a timeout

$ cat plugins.ini

and invocation example:

services_commands.cfg: command_line $USER1$/check_by_ssh --extra-opts -H $HOSTADDRESS$ -l $_SERVICEUSEROWNER$

Make sure your system side nagios.cfg service check plugin timeout (SERVICE CHECK TIMEOUT STATE) within Nagios is longer than the timeout for the plugin. That way you will still alarm if the plugin itself is faulty and does not timeout at all.

We use events to notify Pagerduty. Simply don’t fire off a Pagerduty alert on unknown. Using an event allows us to disable Pagerduty notifications without disabling the service check itself. Here is a simplistic example of an event shell script.
$ cat services_pagerduty_service

#service event handler
#Event handler for a service down event sending a pagerduty alert.
#Template here points to a test service.

case “$1” in
#Is this a “soft” or a “hard” state?
case “$2” in
#We’re in a “soft” state, meaning that Nagios is in the middle of retrying the
#check before it turns into a “hard” state and contacts get notified…
echo “Nagios Service $4 is $1” | mailx -s"Host $4 is down or unreachable"
exit 0

(system) #7