Skip to main content

Hey PagerDuty community! 👋

We experienced service disruptions on August 28 that affected some customers in our US Service Regions, and we owe you complete transparency about what happened.

The overview: A Kafka system failure at 3:53 UTC triggered cascading issues that disrupted event processing for several hours. While no data was lost, some customers experienced 502 errors, delayed notifications, and degraded platform functions.

Here's a sneak peek of what you'll find in our full incident report:

  • Detailed technical breakdown of the Kafka failure and cascading effects
  • Complete timeline from initial incident through full resolution
  • How a second, smaller incident actually helped us identify the root cause
  • Why our external communication was delayed (and how we're fixing that)
  • Specific improvements we're implementing to prevent recurrence
  • Lessons learned from our response and recovery process

We know reliable alerting is mission-critical for your operations, and we're committed to earning back your trust through both transparency and concrete action.

Read the complete technical analysis:

 

Questions about the incident? Drop them below - our team is here and listening. 💚

#IncidentResponse #Transparency #ServiceReliability #PagerDuty

Be the first to reply!

Reply