[KubeCon Session] Closing the Loop: Leveraging Agentic AI for Real-Time Diagnosis and Secure Remediation
Incidents are inevitable, but identifying their probable causes doesn’t have to be overwhelming. What if you could harness the power of AI Agents to streamline your incident response? In this talk, we will demonstrate how to integrate agent-based AI systems with infrastructure metrics and events to revolutionize your incident management process.
We will explore how to leverage data sources like Prometheus, OpenTelemetry, and Kubernetes state metrics to build a knowledge graph that maps relationships between system components, incidents, and telemetry data. This graph simplifies troubleshooting by narrowing down potential causes, enabling faster root cause analysis. We will then dive into how an agent-based AI system uses this graph to reason and provide real-time auto-diagnosis, offering actionable insights when incidents occur. These systems continuously learn from past events, helping to reduce Mean-Time-to-Resolution (MTTR) and improve diagnostic accuracy in distributed environments.
Finally, we will demonstrate how to close the loop using automated remediation. We will discuss leveraging secure runners to interface with your infrastructure, executing localized remediation scripts and playbooks. By triggering these runners directly from the AI’s diagnostic output, you can move from "knowing what’s wrong" to "fixing the problem" in seconds, all while maintaining strict security boundaries. You will leave this session with practical insights on integrating Agentic AI with Kubernetes to maintain system reliability with greater confidence.
Login to PagerDuty Commons
No account yet? Create an account
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.