trouble with Perl Nagios integration


(cori schlegel) #1

I’m trying to add a second PagerDuty service to an already-working self-hosted Nagios 4x installation. We’re using configuration that’s more or less identical to that listed in Integrating Nagios with Multiple PagerDuty Services, although we want to duplicate notifications to both PD instances during a transition period instead of having different notification route to each PD instance. We’ve also followed We have also connected each PD instance to a different Slack domain (although I don’t think that’s relevant to the problem we’re seeing).

The issue is that the new PD instance is not receiving alerts even though they appear to be sent by Nagios. This is based on the fact that we’re not seeing messages in the new Slack even though when manually creating an incident in the new PD service a corresponding message does show up in Slack. Additionally when I look at the incident history on the new service only my manually-created incident appears even though we had a Nagios notification this morning that showed up in our existing PD and Slack instances.

Additionally the logs from pagerduty_nagios on our Nagios server seems to indicate that notifications were sent to both PD instances:

Aug 23 10:39:20 nag nagios: SERVICE NOTIFICATION: pagerduty-dcny;hbny4.fogcreek.local;heartbeat_http;OK;notify-service-by-pagerduty;HTTP OK: Status line output matched "HTTP/1." - 595 bytes in 2.768 second response time
Aug 23 10:39:20 nag nagios: SERVICE NOTIFICATION: aurea-pagerduty-dcny;hbny4.fogcreek.local;heartbeat_http;OK;notify-service-by-pagerduty;HTTP OK: Status line output matched "HTTP/1." - 595 bytes in 2.768 second response time
Aug 23 10:39:22 nag pagerduty_nagios[20727]: Nagios event in file /tmp/pagerduty_nagios/pd_1535035161_20727.txt ACCEPTED by the PagerDuty server.
Aug 23 10:39:22 nag pagerduty_nagios[20729]: Nagios event in file /tmp/pagerduty_nagios/pd_1535035161_20729.txt ACCEPTED by the PagerDuty server.
Aug 23 10:39:30 nag nagios: SERVICE NOTIFICATION: pagerduty-dcny;hbny4.fogcreek.local;fogbugz_web_gens;OK;notify-service-by-pagerduty;OK: Gen 821853000 Port 85: OK
Aug 23 10:39:30 nag nagios: SERVICE NOTIFICATION: aurea-pagerduty-dcny;hbny4.fogcreek.local;fogbugz_web_gens;OK;notify-service-by-pagerduty;OK: Gen 821853000 Port 85: OK
Aug 23 10:39:31 nag pagerduty_nagios[20940]: Nagios event in file /tmp/pagerduty_nagios/pd_1535035170_20940.txt ACCEPTED by the PagerDuty server.
Aug 23 10:39:32 nag pagerduty_nagios[20939]: Nagios event in file /tmp/pagerduty_nagios/pd_1535035170_20939.txt ACCEPTED by the PagerDuty server.

I’ve double-checked the pager integration keys for the new contact and they’re correct. I’m sure I’m missing something else obvious, but I cannot grasp what it might be? Anyone else run into this problem?

(Demitri Morgan) #2

Hi @cori ,

This arrangement, as far as I can tell, requires the following configuration to be in place:

  • There are two Nagios contacts, with distinct values for their pager field.
  • Both contacts are referenced as notification targets of the Nagios service or hosts (in the Nagios config, not in PagerDuty):
    • Directly, via contacts, i.e. naming both contacts to notify in the service or host definition, or:
    • Indirectly, via contact_groups, i.e. notifying a group defined as having both contacts as members

It sounds like you have set up two contacts already as notification targets, but can you please confirm they both have distinct values for pager?

If they have the same value for pager, then what will happen is that the first notification that reaches PagerDuty will trigger an incident, but the second one will be deduplicated/merged into the existing one, rather than triggering a duplicate incident, since they’ll both end up having the same incident key and thus PagerDuty will treat them as the same exact issue in order to reduce notification noise.

I suspect this may be what’s happening if you see an indication of the events being transmitted to PagerDuty for both contacts but only one incident is being triggered.

If one goes to the Timeline view of the incident (or if Alerts are enabled, the Alert Log on the individual alert view page), then the duplicate event that got merged in should be visible there in the timeline.

The relevant Nagios 4 documentation I’ve been using:

  • Object Definitions - the available directives and properties of each object that one can set (i.e. services, hosts, contacts and contact groups)
  • Notifications: according to this, all members of a contact group would be notified, so if both contacts are in the same group, then that should suffice to get them to both be notified.

(cori schlegel) #3

Hi @demitri, thanks. Yes I can confirm that both contacts are assigned to the same contact_group in Nagios and each one has its own pager value which matches the integration key for the corresponding service’s Nagios integration. There do not appear to be duplicates in the timeline for the incidents in the PD instance that’s working properly.

(Demitri Morgan) #4

Hi @cori,

Further troubleshooting here will require discussing details that are rather sensitive, being particular to your PagerDuty account. For that reason, I am reaching out to you privately to further troubleshoot this and will follow up here (with your consent) if there are any non-sensitive learnings we obtain from investigating it that would be worth sharing with the community.

(cori schlegel) #5

that sounds perfect, thanks

(cori schlegel) #6

This is resolved.

For anyone else encountering this problem in the future, our Nagios commands that fed PagerDuty were missing the -f CONTACTPAGER="$CONTACTPAGER$" flag in the command definition. This worked fine when there was only one pagerduty contact, but as soon as there were more than one PagerDuty Integration keys in play Nagios was only writing notifications to one of the keys.

(Demitri Morgan) #7

Also, regarding how it works without this flag: it will obtain the various fields, including the integration keys, from environment variables named as NAGIOS_* (and among other fields, * includes CONTACTPAGER). This is implicit from how environment variables are accessed in the source code and how the option enable_environment_macros=1 must be set in the Nagios configuration for the Perl-based integration (see: Nagios Integration Guide – Agentless).

It seems that the root cause of this issue (and therefore, why adding the macro to the command line fixed it) is that if more than one contact is sent a notification, the environment variable macros won’t be updated or reevaluated for the second execution of the notify-host-by-pagerduty and/or notify-service-by-pagerduty commands.

Rather (apparently), the environment variable is set based on the first value of pager encountered, and this is used for both executions of the command. That would explain why both events were sent with the same integration key even though the contact objects each had their own distinct value of pager.

This almost seems like a subtle bug in Nagios, if this is indeed the case. At any rate, I’m really glad we found a solution!

(system) #8