Alert grouping under incidents from Datadog


(Matěj šusta) #1

We currently have a pretty simple setup where a PagerDuty service for application server is integrated with some Datadog monitors. However during investigation of one of the incidents I noticed that PagerDuty is processing the events coming from Datadog in an unexpected way for me that is causing some problems to us.

The PD service is configured with modern Incident Behavior setting Create alerts and incidents: Will create an alert and then add it to a new incident. These incidents can be merged.
Datadog monitor is the usual for disk space monitoring - sending alert when there’s under 20% free and alert under 10%.

My expectation was that there’s going to be incident created, however each change would be listed as separate alert. Instead there’s just single incident with single alert. All the updates coming from Datadog are only listed in the alert timeline.
This combined together with the fact that Datadog is sending the events in format that isn’t understood by PagerDuty by default to set relevant Severity. The warning/alert information is only listed in field monitor_state as either Warn or Triggered.

I have gone through both Datadog and PagerDuty documentation and there’s quite a lack of information on how the grouping based on Incident Behavior is working.

Some questions:

  • Can I get more information on the inner workings of Incident Behavior service setting?
  • Why are the events grouped together under single alert? Doesn’t it break the visibility of the responder in the main Incident?
  • Is this caused by PagerDuty configuration or by how Datadog integration is using the PD API?
  • Is it only our setup or would anybody else expect the Datadog integration to understand the severity by default?

Thanks, Matthew

(Paul) #2

If you are noticing events getting grouped into the same alert in PagerDuty, Datadog must be sending the same incident_key / dedup_key value in the payload. When an event comes in with the same key, it will be grouped into the same alert, as you noted. If you would prefer to have it trigger a new incident, you can configure it to send a different key. Here are our docs on deduplication.

If you are looking to have the payload of your Datadog alert determine the incident severity, you can do this using your event rules. The first step is to edit the service and choose the Use alert severity to determine how responders are notified for each incident. You can then set up a service-level event rule under the Event rules tab of the service. Here you can configure rules to set the severity of the incident based on the payload. Here is our Knowledge Base section on PD-CEF fields.

(Matěj šusta) #3

I looked into the API docs and reviewed the payloads and Datadog is indeed using a same incident_key field for all events coming from single monitor.

For the severity I did ask also the Datadog engineering and this is not a mis-configuration so as you’re suggesting the event-rules are the only option right now.

One follow-up then - I still like the ability to aggregate, split or merge incidents requiring the new Incident Behavior setting. If I configure the service with event rules to properly categorize the severity will the new event updating the alert trigger a change on whole incident?

(Paul) #4

Using event rules to set the severity will only change the behavior of the incident if your service is set to use alert severity to determine how responders are notified. When editing the service, this option is under Incident Settings > How should responders be notified?

If you have your service set up this way, events sent with a severity of critical or error will trigger high urgency incidents, and events with severity of warning or info will trigger low urgency incidents.

Each user can configure how they are notified for both low and high urgency incidents in the Notification Rules tab of their user profile. Most users choose to only receive an email for low urgency incidents, since they are not as urgent. Note also that low urgency incidents only notify once and do not escalate.

(system) #5

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.