Here we will answer the questions from the on-call webinar
Question 1: Do you have any ideas about measuring metrics around incident response quality, such as how well stakeholders were kept informed?
Answer 1: We recommend regularly reviewing the process as part of the postmortem. You also want to look at if people are subscribing as stakeholders - you can also proactively subscribe stakeholders in your organization. Check out this info for more details:
Question 2: We try to limit who has write access to our production resources. How do you handle permissions for on-call teams? Do on-call team members have permissions all the time or do you use a tool like LastPass to add access when a shift starts and remove it when it ends?
Answer 2: You can do some transient permissions, but you can also use a tool like PagerDuty Rundeck to provide a more permanent solution for common tasks and controls. For centralized services like logging, monitoring, and observability, the whole team should have access all of the time.
Transient permissions are challenging in that if a primary responder needs to escalate to another team member, that person might not have all the access they need to respond. So if minimizing resolution time is part of your goals, finding a permanent way to provide privileges will work much better.
Question 3: Should an SRE team or NOC be the responder or first responder? Or, instead, should the responder be a developer or someone more familiar with the service?
Answer 3: This really depends on how the organization is set up. At PagerDuty we recommend the Full-Service Ownership model where the folks who are responsible for the service or application be the ones on-call as the first responder as they are the most knowledgable. Check out ownership.pagerduty.com for more information
Question 4: Sleep time alerts… is this a feature in PagerDuty or a Process?
A3: Yep. For errors and alerts for your application code, the app developers are going to be most useful for first diagnoses. If they find the issue is related to the platform or another service, Never Hesitate to Escalate. But the app devs will be most knowledgeable about which errors really apply to the application and its runtime. Things like garbage collection, timeouts to dependencies, errors processing data to/from the data store might need cross-team responders, but the app teams will know best exactly what the issue is versus an SRE or NOC team.
Answer 4: You can use intelligent dashboards in the analytics tab to measure a variety of things including interruptions and off hours and sleep hour interruption volume. Check out this knowledge base article to see all the options for analytics: https://support.pagerduty.com/docs/intelligent-dashboards
I received an email that supposedly had a link to the webinar recording. However each time I click the link in the email, the resulting Pagerduty website just says
Thanks for your interest!
I have the same issue