[Data Makers Fest] Teaching AI Systems to Fix Their Own Failures
Prompt engineering has traditionally been manual trial-and-error. When AI agents degrade in production, teams make gut-feel changes and hope for the best. How do you know if a prompt change actually improves things, or just shifts failures elsewhere?
We built an automated system that transforms production failures into quantifiable fixes. It starts by fetching examples of suboptimal sessions from our observability platform. The system analyzes these failures and clusters them into actionable patterns. For each pattern, it proposes specific prompt changes with before-and-after text. Then comes the critical step: validation. Only improvements that maintain or boost performance against our golden dataset make it into an automated pull request. Finally, the golden dataset itself is updated with representative samples of assessed failure.
The result? We analyze all failures weekly and ship validated fixes—all without manual debugging. This approach transforms vague "my AI isn't working" complaints into concrete, measurable improvements backed by production data.
Scott has extensive experience in AI and Analytics. He works at PagerDuty, where he has been for 7 years. His journey there began as a Product Analyst, where he nurtured his analytics obsession. He then worked on the DS Core team building AI features, mostly of the classic ML favor. In his current capacity on the PD Advance Incubator team, he conducts research and performs experiments relating to AI to identify big bets worth taking. Based in Lisbon, Scott enjoys running and lounging at kiosks.
Login to PagerDuty Commons
No account yet? Create an account
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.