Hey folks! We have the following use case, and I’m wondering what are the known best practices, or how you’d suggest to setup service response for it.
We have a Pagerduty-Datadog integration, with Datadog monitors triggering Pagerduty alerts whenever it goes over the threshold. We want to make sure that:
- A triggered but unacknowledged page pages again after some time (30m)
- Acknowledged but unresolved pages page again after some time (10m)
I was hoping to set it up through a combination of “Open incidents resolve after 10 minutes” and “Re-trigger acknowledged incidents after 30 minutes and re-notify assigned responders” settings in Pagerduty, coupled with Datadog monitors escalating every 10 minutes. The logic was that Datadog would report to Pagerduty every 10 minutes if monitor is still in red, and Pagerduty autoresolves incidents to make way for these new notification. I didn’t initially realize that “open incident” != “triggered incident”, and the setup was autoresolving not just triggered, but also acknowledged incidents after 10 minutes, creating unnecessary spam for oncalls.
This feels like a reasonable enough setup that there should be some practices around it - care to share? Thanks!