That makes perfect sense - it’s the same use case as we have (just a different provider). Looks like you’re implementing https://docs.microsoft.com/en-us/azure/monitoring-and-diagnostics/monitor-alerts-unified-log ?
I think the first issue is you have alerts being generated by a system which is probably not sending a
dedup_key so every alert will always be considered unique by your PagerDuty Service (as a random identifier will be assigned for each alert).
Given you’re also resolving Incidents after 20 minutes, you may not be able to validate this question, but: in any given Incident does the Incident Log/Alert Log show multiple entries / alert messages? If it does, it’s sending a consistent
incident_key if using the old v1 Events API); if not, they’ll always be unique Incidents regardless of the auto-resolve. You can check in the Alert or Alert Log
View Message if the field is being set/sent and if it’s consistent across Alerts.
You could try moving those alerts to a separate PagerDuty Service (if they aren’t already) and enabling the Intelligent Alert Merging to overcome the unique-id-per-alert, but you’ll still be auto-resolving them after 20 minutes, so you’d have to increase or disable that to reduce noise and prevent the still-active-and-being-merged Incident from being resolved too quickly (I appreciate your initial question is actually a feature request to change the auto-resolve behaviour; this or consistent
dedup_keys would be a prerequisite).
Would it be possible (assuming you can send consistent
dedup_keys, especially across multiple alert configurations) to create another alert which is inverted and sends a
resolve if it’s under threshold? Answer: probably not, if Azure events don’t understand state and only trigger alerts, they can’t send a different alert kind (
I still think this is a deficiency of the alerting mechanism and Azure should add a
resolve when the configured alert returns to a non-alerting state because there’s already a configured threshold and the other side is clearly “OK”!. Assuming they don’t, what if the results were handled by e.g. a lambda function which set the consistent
dedup_key and flipped to
resolve if the result is under threshold? Probably not cost effective.
What about an inverse check which triggered against Dead Man’s Snitch? Hmm, again probably no, because you’d only end up receiving the “did not check in”, not the actual alert message which isn’t very useful.
Are your On-calls not already investigating the logs to determine the issue? i.e. wouldn’t they already be in the Azure Log Search area validating their work is having the desired effect, rather than just waiting for the PagerDuty alert to resolve? This is how we handled these kind of event-driven alerts without knowledge of state - the PagerDuty Incident will drive the On-call to the Reporting (e.g. Kibana dashboard) which they’ll use to identify the type, volume, activeness and hopefully cause of the issue and they’ll refer back to that to validate it’s been solved before resolving the PagerDuty Incident (manually).
Personally, I don’t ever use or recommend the auto-resolve, or auto-unacknowledge features because I think they generate more noise from alerts which are already being handled, but I recognise my opinion differs from others in this matter.
- Check if you’re receiving consistent
dedup_key values in your Alert
message entries - this determines if each new alert will update an existing Incident or always create a new Incident.
- If you aren’t, and it cannot be configured, enable Incident+Alerts and Intelligent Alert Grouping and train it to merge the new Alerts into the existing Incident.
- Reduce the time between alerts, and increase the auto-resolve time to reduce noise from the auto-resolve.
- Have your On-calls resolve the PagerDuty Incident once it’s solved (it’ll re-alert if it’s still firing); or
- Submit your feature request for “auto-resolve Incident after X minutes without new alert”.