Skip to main content

Our use case seems to be the opposite of what PagerDuty currently supports.   I know we can use the Dedup Key to merge alerts into an already open incident and help with grouping alerts.     

However our use case is to not send a notification unless we have received X number of alerts in Y time.       What we see is that a run will fail, but the next run in 15 mins is successful, so our on-call person is getting paged and then resolving the ticket because the next run was successful.    Would be good to not page the On-call person unless the next run also fails.    But haven’t seen how to handle this in PD.  The Auto-Resolve doesn’t work in our case because they are individual alerts - from seperate runs of a pipeline - the dedup key is what links them. 

Has anyone dealt with a use case like this?   Have you been able to solve it?

Hi @bschiff!

 

This feature is included in AIOps → Event Orchestration, specifically at the Service Orchestrations level. Once an event is routed to a service, the orchestration rules for that service can include thresholds that must be met for reporting that event. You don’t need to use a Global Orchestration to use this Service Orchestration feature, you can create these orchestrations for the service and send your events to the integration endpoint directly. 

If you’d like a longer intro to Event Orchestration, reach out to your account team for a full demo. We also have some videos on our YouTube channel:

Pausing Incidents: https://www.youtube.com/watch?v=rkb_ut95Irw&ab_channel=PagerDutyInc.

EO tips and tricks: https://www.youtube.com/watch?v=MeQsfrhgD2k&t=21s&ab_channel=PagerDutyInc

 

HTH,

--mandi


Is that logic based on the Dedup key or is it events to a specific service?     We may have multiple events going to the same service, but the type of event (based on depud key) is where we would want this logic.

Sounds like we would need to make our service setup more granular


Yes, it would be events to a specific service after they’ve potentially cascaded through other rules.

There’s effectively no limit to how many services you can have in your account, so making things more granular based on the actions you team expects to take, or the SDLC step represented, or the environment, or whatever other characteristic can help you manage alerts and information better. 

So if you were trying to capture, say, failed software builds, you’d separate those out from the production alerts for the service they build for sure, and possibly use change events to track failures onto the production services for information.

Couple common examples:

  • You can have a service based on environment, so “Production App A” and “Staging App A” get the same types of alerts for testing and verification, but staging is configured to notify at a lower urgency and to not notify overnight.
  • For stuff like maybe ETL jobs, where the run is effectively the production workload, folks might divide those up with a different service for each step in the pipeline, especially if some part is particularly troublesome.

What you definitely don’t want is to have a single service that’s “TEAM A” and dump all kinds of random stuff in there and try to do more sophisticated things via rules. That makes the longterm management of the environment much more difficult, especially adding new services, decommissioning old services, or changing ownership of services, and gives you no way of using the Service Graph feature in a way that makes sense. 

Happy to chat in real time or set up a call if you want to talk through what you’re up to. Our team is community-team@pagerduty.com. 


Reply