Incidents auto-resolution custom logic?

questions

(Tom Avitan) #1

Hey guys,
I’m having the following issue, not sure how to resolve it according to the best practices with PagerDuty.

We’re getting alerts from our monitoring service that are not self-resolving, meaning the service only triggers. The auto-resolution configuration is great for us - we set the alert to be kept sending every 20 minute and the auto-resolution for 20 minutes, meaning if the incident is still relevant there will be an open PagerDuty incident.

But that makes it hard to follow incidents. A responder might ack an incident and it’ll be closed while he’s working on it, the PagerDuty analytics would be hard to follow, etc. Having the incident auto-resolved x minutes after the last appended alert (or something like that) would solve it all. We would like to preserve the beauty of the auto-resolution: if an alert is no longer triggering, the incident was automatically solved, otherwise, there’s an open incident.

What should we do?

Thanks!


(Simon Fiddaman) #2

Hi @TomAvitan,

What sort of alerts are you receiving that aren’t self-resolving?

We have some systems which trigger on a certain number of failures in a particular system, e.g. restarts of an auto-provisioned instance – something which indicates the deployment process has failed, or the configured resource constraints are breaking, but not the entire service (we have proper state checks for the services globally).

Each of these requires investigation because there’s additional tuning to be done to clean it up, but as they are event-driven (restart of instance), they never have an accompanying resolve. This works fine, because they engineer will pick it up, investigate it, perform the tuning and resolve the incident.

I’m concerned that you have what seem to be state-based (i.e. the ability to determine an ok/not-ok state of your service) but your alerting isn’t sending resolves along with that. What system is generating your alerts? Is it capable of sending resolve notices, but it’s not working?

I’d be happy to chat further about it - I think your alerting can be improved at the source. :slight_smile:

Cheers,
@simonfiddaman


(Tom Avitan) #3

Thanks @simonfiddaman!

We use Azure. The classic alerts (for metrics) are self-resolving, but the log search alerts aren’t. We have a few log search alerts that are sent every x minutes while they are still relevant.
We thought we could use the auto-resolution feature so that it would close the alert if it is not triggering anymore, but we would like to keep the incident open while it still does.

If we won’t use the auto-resolution feature, we would end up manually checking if the incident is still relevant and be closing the incident, a process which we wished to avoid.

Now as I said, our current PagerDuty does close those alerts automatically, but opens up a new incident every x minutes while the alert keeps triggering. It’s quite hard to follow and work with.

Does that make sense?


(Simon Fiddaman) #4

Hi @TomAvitan,

That makes perfect sense - it’s the same use case as we have (just a different provider). Looks like you’re implementing https://docs.microsoft.com/en-us/azure/monitoring-and-diagnostics/monitor-alerts-unified-log ?

I think the first issue is you have alerts being generated by a system which is probably not sending a dedup_key so every alert will always be considered unique by your PagerDuty Service (as a random identifier will be assigned for each alert).

Given you’re also resolving Incidents after 20 minutes, you may not be able to validate this question, but: in any given Incident does the Incident Log/Alert Log show multiple entries / alert messages? If it does, it’s sending a consistent dedup_key (or incident_key if using the old v1 Events API); if not, they’ll always be unique Incidents regardless of the auto-resolve. You can check in the Alert or Alert Log View Message if the field is being set/sent and if it’s consistent across Alerts.

You could try moving those alerts to a separate PagerDuty Service (if they aren’t already) and enabling the Intelligent Alert Merging to overcome the unique-id-per-alert, but you’ll still be auto-resolving them after 20 minutes, so you’d have to increase or disable that to reduce noise and prevent the still-active-and-being-merged Incident from being resolved too quickly (I appreciate your initial question is actually a feature request to change the auto-resolve behaviour; this or consistent dedup_keys would be a prerequisite).

Would it be possible (assuming you can send consistent dedup_keys, especially across multiple alert configurations) to create another alert which is inverted and sends a resolve if it’s under threshold? Answer: probably not, if Azure events don’t understand state and only trigger alerts, they can’t send a different alert kind (trigger vs resolve).

I still think this is a deficiency of the alerting mechanism and Azure should add a resolve when the configured alert returns to a non-alerting state because there’s already a configured threshold and the other side is clearly “OK”!. Assuming they don’t, what if the results were handled by e.g. a lambda function which set the consistent dedup_key and flipped to resolve if the result is under threshold? Probably not cost effective.

What about an inverse check which triggered against Dead Man’s Snitch? Hmm, again probably no, because you’d only end up receiving the “did not check in”, not the actual alert message which isn’t very useful.

Are your On-calls not already investigating the logs to determine the issue? i.e. wouldn’t they already be in the Azure Log Search area validating their work is having the desired effect, rather than just waiting for the PagerDuty alert to resolve? This is how we handled these kind of event-driven alerts without knowledge of state - the PagerDuty Incident will drive the On-call to the Reporting (e.g. Kibana dashboard) which they’ll use to identify the type, volume, activeness and hopefully cause of the issue and they’ll refer back to that to validate it’s been solved before resolving the PagerDuty Incident (manually).

Personally, I don’t ever use or recommend the auto-resolve, or auto-unacknowledge features because I think they generate more noise from alerts which are already being handled, but I recognise my opinion differs from others in this matter. :slight_smile:

In summary:

  1. Check if you’re receiving consistent dedup_key values in your Alert message entries - this determines if each new alert will update an existing Incident or always create a new Incident.
  2. If you aren’t, and it cannot be configured, enable Incident+Alerts and Intelligent Alert Grouping and train it to merge the new Alerts into the existing Incident.
  3. Reduce the time between alerts, and increase the auto-resolve time to reduce noise from the auto-resolve.
  4. Have your On-calls resolve the PagerDuty Incident once it’s solved (it’ll re-alert if it’s still firing); or
  5. Submit your feature request for “auto-resolve Incident after X minutes without new alert”.

Cheers,
@simonfiddaman