Skip to main content

Hi everyone,

I’m part of a IT team working to unify our incident response with infrastructure automation, and we’re exploring better ways to integrate tools like Ansible, Rundeck, or Attune with PagerDuty for real-time operations.

Here’s what we’re trying to achieve:

  • Automatically trigger predefined scripts or runbooks (via Rundeck or Ansible Playbooks) when certain PagerDuty alerts are fired

  • Create custom PagerDuty services for various infrastructure tiers (e.g., DB layer, web servers, networking) with tailored responses

  • Ensure clean logging, rollback, and audit trails for all automated responses triggered by PagerDuty incidents

  • Tie notifications to team roles or escalation policies based on specific server groups or alert types

Has anyone here successfully set up something similar?


We’d love to hear:

  • What integration method worked best for you (webhooks, APIs, plugins)?

  • Any challenges around permissions or security controls when executing remote tasks post-incident?

  • Advice on keeping automation safe and non-disruptive during active incidents?

Would really appreciate any workflows, tooling stacks, or gotchas to avoid!

Thanks in advance,

Hi EtherealShroudX,

What your trying to get to work is exactly 😎 what we at PagerDuty are providing to our customers. 

The reason for PagerDuty to acquire Rundeck (we call it Runbook Automation/RBA today)  4 years ago was with that in mind. 

Let me give you bit more details based on your questions. 

  • Automatically trigger predefined runbooks or Ansible playbooks when PagerDuty incidents are fired. This is typically done using PagerDuty Automation Actions functionality that is completely integrated to call RBA jobs (no webhooks). Even from Mobile Phone App 🍮

  • Define custom PagerDuty services for each infrastructure tier (DB, Web, Network), each linked to its own automation workflow. For example, a DB alert might trigger a health check or failover job, while a web alert might restart services or check logs.

  • Runbook Automation provides structured logging, execution tracking, and clear audit history. This gives any team full visibility into who ran what, when, and the outcome, ideal for compliance and root-cause analysis.

  • You can scope notifications and automation access using PagerDuty’s escalation policies in combination with Runbook Automation role-based access controls and node-level permissions, ensuring only the right people are notified and empowered to act.

What integration method worked best?

  • PagerDuty triggering the Runbook Automation API is the most flexible and secure method when combining it with Incident Management. We are not really in favour of using Webhooks in this case. And obviously you can dynamically extract details from the alert (like affected system, hostname etc).

  • The Rundeck’s Ansible plugin can help if you’re working with Ansible Playbooks and inventories. It enables centralized, version-controlled execution without manual SSH. Powerful and heavily used by the RBA customers. 

  • Other than that, the platform comes with a lot of other very valuable Enterprise Grade plugins. Plugins with Git. Plugins with different Vaults or Secret Managers. Plugins with security signal sign-on software etc.

Common challenges and how to address them:

  • Permissions and security controls: Define ACLs (via UI !) per team, project or job type in RBA. Use API tokens with minimal required access, and avoid providing broad server permissions. The RBA Enterprise runners (local executer, not available in OSS) to operate in a zero-trust model.

  • The introduction of the Enterprise Runner also added an additional layer of security to the platform. 

  • Safe automation during incidents: Build safety into your jobs, confirmation, rate limits etc, or diagnostics actions (pulling logs) before any remediation job is triggered. A best practice we’ve seen is to automatically stop deployments during major incidents using automation triggers.

To close on potential tips and experience. Well you could/should categorise incidents in well-understood, partially understood, or new and unique. And if you do that, you can help teams to decide which type of alerts should trigger what kind of automation and which still require manual judgment. That is always a good one to look at. 

All in all, in this scenario, I would definitely recommend having a chat with us. We're here to help. 

Just let us/me know, 

Regards, Martin. 
mvanson@pagerduty.com
https://www.linkedin.com/in/martinvanson/


Reply