Hey PagerDuty community!
We experienced service disruptions on August 28 that affected some customers in our US Service Regions, and we owe you complete transparency about what happened.
The overview: A Kafka system failure at 3:53 UTC triggered cascading issues that disrupted event processing for several hours. While no data was lost, some customers experienced 502 errors, delayed notifications, and degraded platform functions.
Here's a sneak peek of what you'll find in our full incident report:
- Detailed technical breakdown of the Kafka failure and cascading effects
- Complete timeline from initial incident through full resolution
- How a second, smaller incident actually helped us identify the root cause
- Why our external communication was delayed (and how we're fixing that)
- Specific improvements we're implementing to prevent recurrence
- Lessons learned from our response and recovery process
We know reliable alerting is mission-critical for your operations, and we're committed to earning back your trust through both transparency and concrete action.
Read the complete technical analysis:
Questions about the incident? Drop them below - our team is here and listening.
#IncidentResponse #Transparency #ServiceReliability #PagerDuty