Dealing With Nagios Flapping and Downtime Events

nagios

(Aaron Goulet) #1

Summary:
I’ve been exploring a good way to deal with incoming FLAPPINGSTART, FLAPPINGSTOP, DOWNTIMESTART, and DOWNTIMEEND events; specifically, I’d like to have it so that if a DOWNTIMESTART event comes in, it closes any open alerts that match the dedup_key/incident_key. Likewise, with flap detection, I’d like to close an alert if a FLAPPINGSTOP notification event comes through and the service/host status = OK.

Attempted Approach:
I figured one method might be to route events through Global Event Rules instead of the Nagios integration. There’s benefits and downsides to this approach, namely that it’s unwieldy to configure and looks kind of hack-ish:

define contact {
	   contact_name                             pagerduty
	   alias                                    PagerDuty Pseudo-Contact
	   service_notification_period              24x7
	   host_notification_period                 24x7
	   service_notification_options             w,u,c,r
	   host_notification_options                d,r
	   service_notification_commands            notify-service-by-pagerduty
	   host_notification_commands               notify-host-by-pagerduty
	   pager                                    <Global Event Rule Key Here>
}

# Used to send flaps and downtimes.
define contact {
	   contact_name                             pagerduty_fs
	   alias                                    PagerDuty Secondary Actions
	   service_notification_period              24x7
	   host_notification_period                 24x7
	   service_notification_options             f,s
	   host_notification_options                f,s
	   service_notification_commands            notify-service-by-pagerduty-always-trigger
	   host_notification_commands               notify-host-by-pagerduty-always-trigger
	   pager                                    <Global Event Rule Key Here>
}

define command {
	   command_name     notify-service-by-pagerduty
	   command_line     /usr/share/pdagent-integrations/bin/pd-nagios -n service -k $CONTACTPAGER$ -i $HOSTNAME$_$SERVICEDESC$ -t "$NOTIFICATIONTYPE$" -f HOSTNAME="$HOSTNAME$" -f HOSTDISPLAYNAME="$HOSTDISPLAYNAME$" -f HOSTALIAS="$HOSTALIAS$" -f HOSTADDRESS="$HOSTADDRESS$" -f HOSTSTATE="$HOSTSTATE$" -f HOSTDOWNTIME="$HOSTDOWNTIME$" -f HOSTGROUPNAMES="$HOSTGROUPNAMES$" -f HOSTACTIONURL="$HOSTACTIONURL$" -f HOSTNOTESURL="$HOSTNOTESURL$" -f HOSTNOTES="$HOSTNOTES$" -f SERVICEDESC="$SERVICEDESC$" -f SERVICEDISPLAYNAME="$SERVICEDISPLAYNAME$" -f SERVICESTATE="$SERVICESTATE$" -f SERVICESTATETYPE="$SERVICESTATETYPE$" -f SERVICEATTEMPT="$SERVICEATTEMPT$" -f MAXSERVICEATTEMPTS="$MAXSERVICEATTEMPTS$" -f SERVICEEVENTID="$SERVICEEVENTID$" -f SERVICEPROBLEMID="$SERVICEPROBLEMID$" -f SERVICEDURATION="$SERVICEDURATION$" -f SERVICEDURATIONSEC="$SERVICEDURATIONSEC$" -f SERVICEDOWNTIME="$SERVICEDOWNTIME$" -f SERVICEGROUPNAMES="$SERVICEGROUPNAMES$" -f SERVICEOUTPUT="$SERVICEOUTPUT$" -f LONGSERVICEOUTPUT="$LONGSERVICEOUTPUT$" -f SERVICEACTIONURL="$SERVICEACTIONURL$" -f SERVICENOTESURL="$SERVICENOTESURL$" -f SERVICENOTES="$SERVICENOTES$" -f LONGDATETIME="$LONGDATETIME$" -f SHORTDATETIME="$SHORTDATETIME$" -f DATE="$DATE$" -f TIME="$TIME$" -f TIMET="$TIMET$" -f LOGFILE="$LOGFILE$" -f NOTIFICATIONTYPE="$NOTIFICATIONTYPE$"
}

define command {
	   command_name     notify-service-by-pagerduty-always-trigger
	   command_line     /usr/share/pdagent-integrations/bin/pd-nagios -n service -k $CONTACTPAGER$ -i $HOSTNAME$_$SERVICEDESC$ -t "PROBLEM" -f HOSTNAME="$HOSTNAME$" -f HOSTDISPLAYNAME="$HOSTDISPLAYNAME$" -f HOSTALIAS="$HOSTALIAS$" -f HOSTADDRESS="$HOSTADDRESS$" -f HOSTSTATE="$HOSTSTATE$" -f HOSTDOWNTIME="$HOSTDOWNTIME$" -f HOSTGROUPNAMES="$HOSTGROUPNAMES$" -f HOSTACTIONURL="$HOSTACTIONURL$" -f HOSTNOTESURL="$HOSTNOTESURL$" -f HOSTNOTES="$HOSTNOTES$" -f SERVICEDESC="$SERVICEDESC$" -f SERVICEDISPLAYNAME="$SERVICEDISPLAYNAME$" -f SERVICESTATE="$SERVICESTATE$" -f SERVICESTATETYPE="$SERVICESTATETYPE$" -f SERVICEATTEMPT="$SERVICEATTEMPT$" -f MAXSERVICEATTEMPTS="$MAXSERVICEATTEMPTS$" -f SERVICEEVENTID="$SERVICEEVENTID$" -f SERVICEPROBLEMID="$SERVICEPROBLEMID$" -f SERVICEDURATION="$SERVICEDURATION$" -f SERVICEDURATIONSEC="$SERVICEDURATIONSEC$" -f SERVICEDOWNTIME="$SERVICEDOWNTIME$" -f SERVICEGROUPNAMES="$SERVICEGROUPNAMES$" -f SERVICEOUTPUT="$SERVICEOUTPUT$" -f LONGSERVICEOUTPUT="$LONGSERVICEOUTPUT$" -f SERVICEACTIONURL="$SERVICEACTIONURL$" -f SERVICENOTESURL="$SERVICENOTESURL$" -f SERVICENOTES="$SERVICENOTES$" -f LONGDATETIME="$LONGDATETIME$" -f SHORTDATETIME="$SHORTDATETIME$" -f DATE="$DATE$" -f TIME="$TIME$" -f TIMET="$TIMET$" -f LOGFILE="$LOGFILE$" -f NOTIFICATIONTYPE="$NOTIFICATIONTYPE$"
}

define command {
	   command_name     notify-host-by-pagerduty
	   command_line     /usr/share/pdagent-integrations/bin/pd-nagios -n host -k $CONTACTPAGER$ -i $HOSTNAME$ -t "$NOTIFICATIONTYPE$" -f HOSTNAME="$HOSTNAME$" -f HOSTNAME="$HOSTNAME$" -f HOSTDISPLAYNAME="$HOSTDISPLAYNAME$" -f HOSTALIAS="$HOSTALIAS$" -f HOSTADDRESS="$HOSTADDRESS$" -f HOSTSTATE="$HOSTSTATE$" -f HOSTSTATETYPE="$HOSTSTATETYPE$" -f HOSTATTEMPT="$HOSTATTEMPT$" -f MAXHOSTATTEMPTS="$MAXHOSTATTEMPTS$" -f HOSTEVENTID="$HOSTEVENTID$" -f HOSTPROBLEMID="$HOSTPROBLEMID$" -f HOSTDURATION="$HOSTDURATION$" -f HOSTDURATIONSEC="$HOSTDURATIONSEC$" -f HOSTDOWNTIME="$HOSTDOWNTIME$" -f HOSTGROUPNAMES="$HOSTGROUPNAMES$" -f HOSTOUTPUT="$HOSTOUTPUT$" -f LONGHOSTOUTPUT="$LONGHOSTOUTPUT$" -f HOSTPERFDATA="$HOSTPERFDATA$" -f HOSTCHECKCOMMAND="$HOSTCHECKCOMMAND$" -f HOSTACTIONURL="$HOSTACTIONURL$" -f HOSTNOTESURL="$HOSTNOTESURL$" -f HOSTNOTES="$HOSTNOTES$" -f LONGDATETIME="$LONGDATETIME$" -f SHORTDATETIME="$SHORTDATETIME$" -f DATE="$DATE$" -f TIME="$TIME$" -f TIMET="$TIMET$" -f NOTIFICATIONTYPE="$NOTIFICATIONTYPE$"
}

define command {
	   command_name     notify-host-by-pagerduty-always-trigger
	   command_line     /usr/share/pdagent-integrations/bin/pd-nagios -n host -k $CONTACTPAGER$ -i $HOSTNAME$ -t "PROBLEM" -f HOSTNAME="$HOSTNAME$" -f HOSTDISPLAYNAME="$HOSTDISPLAYNAME$" -f HOSTALIAS="$HOSTALIAS$" -f HOSTADDRESS="$HOSTADDRESS$" -f HOSTSTATE="$HOSTSTATE$" -f HOSTSTATETYPE="$HOSTSTATETYPE$" -f HOSTATTEMPT="$HOSTATTEMPT$" -f MAXHOSTATTEMPTS="$MAXHOSTATTEMPTS$" -f HOSTEVENTID="$HOSTEVENTID$" -f HOSTPROBLEMID="$HOSTPROBLEMID$" -f HOSTDURATION="$HOSTDURATION$" -f HOSTDURATIONSEC="$HOSTDURATIONSEC$" -f HOSTDOWNTIME="$HOSTDOWNTIME$" -f HOSTGROUPNAMES="$HOSTGROUPNAMES$" -f HOSTOUTPUT="$HOSTOUTPUT$" -f LONGHOSTOUTPUT="$LONGHOSTOUTPUT$" -f HOSTPERFDATA="$HOSTPERFDATA$" -f HOSTCHECKCOMMAND="$HOSTCHECKCOMMAND$" -f HOSTACTIONURL="$HOSTACTIONURL$" -f HOSTNOTESURL="$HOSTNOTESURL$" -f HOSTNOTES="$HOSTNOTES$" -f LONGDATETIME="$LONGDATETIME$" -f SHORTDATETIME="$SHORTDATETIME$" -f DATE="$DATE$" -f TIME="$TIME$" -f TIMET="$TIMET$" -f NOTIFICATIONTYPE="$NOTIFICATIONTYPE$"
}

You then need to set several inbound rules in the Global Event Rules section of the UI. Of course, this means losing a lot of the functionality that the Nagios integration gives you out-of-the-box.

Next Step:
I plan on forking the Nagios integration and adding handling for these event types, as it’s ultimately cleaner, reduces complexity of use, and still lets us leverage the default Nagios integration. The downside is needing to curate a separate fork of the integration and keep it current with upstream changes.

Thoughts, and Open Question to the Community:
Has anyone else come up with a way of elegantly handling flap detection and downtime events from Nagios? What does your process look like?


(Joe Calcada) #2

Thanks for posting Aaron! Hopefully you get some good feedback


(Demitri Morgan) #3

Hi @AaronGoulet ,

There are two other approaches I can think of that one could take.

The first, which I recommend, would be creating a wrapper script. This would require far less long-term maintenance than an outright fork of the pd-nagios script interface with PagerDuty Agent, but would nevertheless have the potential of requiring an update in the future (if the CLI ever changes).

The other approach you could probably take, which would be tricky albeit compact, would be some command substitution in the command_line directive using backticks. You could thus inject some logic into the Nagios configuration itself, i.e. for the host notifier command (disclaimer: the following has not been tested):

command_line   /usr/share/pdagent-integrations/bin/pd-nagios -n service -k  $CONTACTPAGER$  \
    -t `if [[ $NOTIFICATIONTYPE$ -eq FLAPPINGSTOP && $HOSTSTATE$ -eq OK  ]]; then
            echo OK
        else
            echo $NOTIFICATIONTYPE$
        fi` \
    -f SERVICEDESC="$SERVICEDESC$" -f SERVICESTATE="$SERVICESTATE$" \
    -f HOSTNAME="$HOSTNAME$" -f HOSTDISPLAYNAME="$HOSTDISPLAYNAME$" \
    -f SERVICEDISPLAYNAME="$SERVICEDISPLAYNAME$" \
    -f SERVICEPROBLEMID="$SERVICEPROBLEMID$" \
    -f SERVICEOUTPUT="$SERVICEOUTPUT$"

Either way, one could put logic in between Nagios and the script interface to translate the extra notification types into the basic notification types that the script interface recognizes. Hopefully that helps!

Also, since this would be great to have out-of-box, I’m going to convey this need to our product team.

Cheers


(Aaron Goulet) #4

Thank you @demitri! I don’t know why I didn’t think of this myself! I’m going to implement it as a wrapper script, I think, as it feels a little cleaner to me. I greatly appreciate the assistance.


(Aaron Goulet) #5

I spent a little time on a wrapper yesterday and today:

It’s not super pretty and if I’d had more time I would’ve done some things differently, but I wanted to get it done before our trial was over and figured I’d post it in case it helps someone else down the line.


(Demitri Morgan) #6

@AaronGoulet fantastic! Thank you so much for sharing!


(system) #7

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.