How to track and resolve a set of alerts raised by cluster instances?


(John Christofolakos) #1

Hi, I’m working out how to use the Events V2 API to address the following use case:

  • clustered service running in AWS
  • instances can trigger alerts which can be specific to a sub-function and tenant. I would use these to construct a dedup key for the alert, e.g. PERSISTENCE-TENANT1.
  • other instances are likely to trigger the same alert, would like them all to be aggregated under a common incident as separate alerts. I thought the dedup key would do that for me, but it appears not.
  • someone is assigned the incident, and restarts the DB.
  • as each instance detects that the DB is now fine, I would like them to be able to clear the particular alert that each one raised.
  • once all the alerts have been resolved, then the incident resolves.

The problem is that I don’t see any way to use the API to get the alerts grouped under a common incident, based on their sub-function and tenant. If I use the dedup key constructed as above, then I get separate events in the log for a single alert. But these events are not useful in terms of being able to resolve them individually.

Any suggestions how to achieve this use case, which after all seems pretty vanilla for a highly-available service running in the cloud?

Separate alerts for custom integration
(Demitri Morgan) #2

Hi John,

This will require merging incidents; that is (as yet) the only way to have alerts with different deduplication keys grouped under a similar incident. You can, fortunately, do this programmatically, through the REST API. This would be done by a PUT request to the resource /incidents/{id}/merge, where {id} is the ID of the incident into which to merge other incidents (specified in the body).

Hopefully that helps!

(John Christofolakos) #3

Hi Demetri, thanks for the quick response.

Sorry, it seems I wasn’t clear - I would like alerts that share a common dedup key to be grouped under one incident. They represent a single problem that is being detected by multiple instances. But I would like them to be separate alerts that can be individually resolved by the instance that raised them. Whereas they actually just get grouped as events under a common alert. Alerts with a different dedup key should create a new incident and I would not want to merge those incidents as they would represent a different problem.

I could probably achieve this, as you say, by using different dedup keys for each instance and merging incidents. But this would likely have a lot of ‘update conflict’ problems, as the expectation is that when a situation arises, all the instances would be detecting it at roughly the same time. This would also be complex because the creation of the incidents is asynchronous and seems to take a few seconds.

Also the intent is to avoid excessive notifications - i.e. a specific failure should just trigger one notification to the escalation policy subscribers even though several instances have detected it. Triggering a notification per-instance, and then later merging those incidents would be a second-best solution.

I suspect you are right, that this is not currently supported. If there is any discussion on how PagerDuty users handle alarming by clustered services, I would welcome a pointer to it.

Thanks and regards,

(Demitri Morgan) #4

The nature of event handling and deduplication makes this a problem one cannot solve without multiple deduplication keys and incident merging, unfortunately. That’s because automatic resolution by sending a resolve event uses the same deduplication key as the trigger event to identify the existing alert to act upon.

Thus, there cannot be multiple alerts open with the same deduplication key; otherwise the key could not uniquely identify any given one of them, and there would be no way to tell a-priori which alert (among the multiple alerts with the same key) to resolve for any given resolve event.

However, you have contributed some interesting concepts, and I appreciate you sharing information about your use case. Hence, I’m going to ask our product team to review this.

(John Christofolakos) #5

Yes, that’s basically the conclusion I came to as well. Was hoping for something I missed, or some bit of deviousness perhaps :slight_smile:

Thanks for the time taken analysing this.


(system) #6