Delay escalation for set time period

questions

#1

Hi,

We’re looking at integrating various services with PagerDuty, however, we have a requirement where certain services (Nagios, for example) are only to be escalated after an alert has been up for 30 minutes or longer as a good percentage of our alerts can resolve within this time frame and we have a fair few false positivies (monitored 600+hosts with 3000+ service checks).

Is there a way to accomodate for this requirement without amending the Nagios environment?
Looking at the escalation policies it seems you can only set ‘Immediately after an incident is triggered’.

I was thinking about setting a dummy contact up which is notified first, then setting the delay of 30 minutes to escalate on, however, I think with this, if a team re-assigns an alert to another, the 30 minute window would kick in again?

Thanks.


How can I delay notification while I wait to see if it resolves?
(Simon Fiddaman) #2

Hi!

I know you said “without modifying the Nagios environment” but honestly, adding a first_notification_delay to your base template if all of your checks are so volatile is probably best. Unfortunately you can’t set first_notification_delay on a contact, which would be perfect. Either that or fix the checks so they aren’t so trigger happy false positive.

Creating a dummy contact is the unofficial next best practice I’ve seen suggested and what will probably work best for you.

I also think that you’ll end up in a much better place if you leave it as-is and fix the flappy checks as they occur. This is exactly what I tell my on-calls.

Cheers,
Simon


#3

Thanks Simon.

I saw the ‘first_notification_delay’ option yesterday and had a play around with it, however, with this setting active it appears to prevent the calls being sent to PagerDuty.

I monitored the PagerDuty queue using the pd-queue command on Linux, and submitted a passive check from the XI web interface to make the host appear down, but the queue did not move.
Removing the setting and submitting the same passive check works fine though.

Is this a known issue? If we could get this setting to work then it would be pefect.

Thanks.

Edit:
Just tried this again, this time using an active check (just checking the calculator program is running on Windows) and it still does not work, so possibly a bug.

The notification delay was set to 1 minute, after the alert goes critical the pd-queue does not increase.
Removing the notification delay and triggering the same critical, the issue is sent to PagerDuty.


(Simon Fiddaman) #4

So first_notification_delay should just delay the first notification, which is what PagerDuty is using to trigger the Incident in the first place. It’s possible that you need to wait for the next check interval to pass before it’ll trigger the notification.

If it didn’t reach pdagent then it’s a Nagios thing. You could check the Notifications (either Nagios global or from the Host/Service itself) to see if it attempted to send one.

If you really want to have the Incident created in PagerDuty you must send the notification (hence anything which delays the sending of the notification into PagerDuty will be counter to what you’re trying to achieve).


#5

Thanks, it does appear that you need to wait for the next check interval, I thought it would queue the first notifiction and just send after x minutes.

Not quite sure this method would work now, as for example, we have some checks with a check interval of every 4 hours, which would mean PagerDuty would not know about this issue until the 2nd check is performed, so 8 hours after.


(Simon Fiddaman) #6

Given your specific scenario, you’ll probably want to set it up with the dummy user in Level 1 and set your escalation level timeout to 30 minutes.

I’d still suggest that fixing the thresholds in your checks is better, as you’ll reduce your false positives to begin with, but the dummy user will do what you’re looking for - delay notifications to a real person for the timeout setting.


#7

I may look into that option, but I beleive this impact an on-call from being able to assign the issue to another team as it would also hit the first 30 minute delay.


(Simon Fiddaman) #8

You could have separate Escalation Policies - one which contains the dummy user in Level 1, and one which does not, and use Reassign... to Escalation Policy to route to the next Team.

We implement something similar - High Priority Escalation Policies will target the on-call first, falling back to our NOC, and Normal Priority will target the NOC first, with an escalation to Level 2 hitting the on-call. If they need to cross-escalate to another on-call team, they’ll reassign to the team’s High Priority EP.


(Jonathan Curry) #9

This topic was automatically closed after 19 hours. New replies are no longer allowed.