Should Incidents Be Re-Opened?

questions

(Yusuf Ozturk) #1

Hi everyone,

I searched about this topic on Google and I see that, incidents should not be re-opened unless it’s not closed incorrectly. Before starting, i want to let you know that our monitoring tool uses API v2.

Okay, let’s move on with an example then. I assume that, we have a failing hardware disk in one of our HP server. Our monitoring indicated that there is a high latency and write error rate on the disk. So monitoring tool opens automatic incident on PagerDuty:

{
“payload”: {
“summary”: “Disk is in warning state”,
“timestamp”: “2015-07-17T08:42:58.315+0000”,
“source”: “hp003”,
“severity”: “warning”,
“custom_details”: {
“latency”: “1500ms”,
“write_errors”: 159
}
},
“event_action”: “trigger”,
“routing_key”: “servicekey”,
“dedup_key”: “hp003-disk-01”
}

After creation of incident, monitoring tool did not detect any latency or write error rate due to idle usage, so resolved incident on the PagerDuty.

{
“payload”: {
“summary”: “Disk is in warning state”,
“timestamp”: “2015-07-17T08:42:58.315+0000”,
“source”: “hp003”,
“severity”: “warning”,
“custom_details”: {
“latency”: “10ms”,
“write_errors”: 9
}
},
“event_action”: “resolve”,
“routing_key”: “servicekey”,
“dedup_key”: “hp003-disk-01”
}

First of all, monitoring tool resolved the incident but at the same time, it pushed new latency and write errors rate which is way below the beginning values. I see that PagerDuty never updates the original incident custom details. Incidents should be like read-only in their lifetime or we should add new values in another way? Because of API v2, I see updated values in the Alerts section. But incident is not updated.

Okay, let’s continue. After a while, monitoring tool detected high latency and error rate and wants to create an incident on the PagerDuty. ITIL states that we should open a new incident for this issue. I assume that this is the right approach? So monitoring tool created a new incident with the updated dedup_key as “hp003-disk-02”.

{
“payload”: {
“summary”: “Disk is in warning state”,
“timestamp”: “2015-07-17T08:42:58.315+0000”,
“source”: “hp003”,
“severity”: “warning”,
“custom_details”: {
“latency”: “1500ms”,
“write_errors”: 159
}
},
“event_action”: “trigger”,
“routing_key”: “servicekey”,
“dedup_key”: “hp003-disk-02”
}

Okay, now we are coming the critical part. Monitoring tool also connected to HP mainboard and detected that there is a hardware error on disk. So we should replace disk very soon. So, monitoring tool wants to change incident state from Warning to Critical. In this case, what is the correct approach? Should we resolve the Warning incident first and create a new incident for the Critical state or simply we should update previous incident’s severity only?

Also why we are not able to update custom_details? Putting numeric value under the custom_details is not a good approach in this case, because user will continue to see old values. How do you solve this issue? Do you suggest to modify payload summary?

Thanks for help!

Yusuf


(Demitri Morgan) #2

Hi Yusuf,

It should be possible to upgrade an alert’s severity to a different level by sending an event through the Events API (of type trigger with the same deduplication key as the original alert). There should not be any need to send a resolve (which should only happen when the alert is no longer applicable) and then send a new event with different severity.

As of this time, severity is the only field in an existing alert that can be updated via the events API (apart from status). This allows you to upgrade the urgency of the incident through the Events API if using dynamic notifications.

If you are interested in seeing custom details and other fields updated in alerts, or similar functionality, we would love to hear more feedback and input from you about how you would use it in your operations, and the problems that it would solve from you. That would be very useful to our product team, with whom I can share this post.


(system) #3