Email Integration - resolve all open incidents

email
(Marcus Schorow) #1

We have a job that periodically (every 10 minutes) scans our systems and reports error codes; if there are any errors, it creates an email and sends it to PD. Since we want to be alerted if a new error comes up, but not if the error code is the same as an already open incident, we use an alert key regex for the email that grabs the error codes from the email body.

Our question is the best way to automatically resolve incidents once the errors are fixed. If our job detects no errors, it will send an all clear email that we would like to use to resolve all open incidents. We don’t track the error codes that have been sent so this all clear does not contain any string that can easily be mapped to the alert codes of the open incidents. What’s the best way to have a resolve rule that will resolve all open incidents (or all incidents whose email matches a subject or on something different than the alert key)?

0 Likes

(David Shackelford) #2

Hi Marcus!

Thanks for your question. I’m the product management lead for our monitoring data ingestion. I checked around with our support and engineering teams and looked for creative workarounds, but unfortunately there’s not a way right now to accomplish what you’re looking for. You can resolve email alerts using custom email rules, but only if you’re able to find a common key to match the triggers and resolves. You can also use our REST APIs to grab open incidents on a service and resolve them, but that won’t work via email.

One piece that occurs to me is you mentioned your job scans for errors every 10 minutes. This means you could set an auto-resolve timer for 10 minutes, so that any error codes still present after 10 minutes clear themselves. If the job continually runs, presumably error codes that happen again will create new alerts.

I’m happy to bring the use case back to the engineering team. It’s not currently on our roadmap, but we’re always listening to customer feedback and we can reach out if and when we prioritize it for work. If there’s anything more about the use case and current pain you can share, that would also be really helpful to bring to the team.

Best,
David

1 Like

(Marcus Schorow) #3

Considered auto-resolving after 10 minutes, but I think this would have the consequence of triggering a new page every 10 minutes while the error is not resolved.

We ended up just using the API to query for all triggered or acknowledged incidents under the service and resolve them all. Thanks.

0 Likes

(system) closed #4

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

0 Likes