Video AMA: Postmortems and More with J. Paul Reed

ama
postmortems

(Matt Stratton) #1

For this round of PagerDuty AMA, our guest is our good friend J. Paul Reed!

on-stage

Paul has over fifteen years experience in the trenches as a build/release engineer, working with such storied companies as VMware, Mozilla, Postbox, Symantec, Salesforce, and Intuit.

In 2012, he founded Release Engineering Approaches, a consultancy incorporating a host of tools and techniques to help organizations “SimplyShip. Every time.” He’s worked across a number of industries, from financial services to cloud-based infrastructure to health care, with teams ranging from 2 to 2,500 on everything from tooling, operational analysis and improvement, cultural transformation, and business value optimization.

He speaks internationally on release engineering, DevOps, operational complexity, and human factors and is currently a Masters of Science candidate in Human Factors & Systems Safety at Lund University.

How This Works

Post your questions to Paul in this thread - we’ll collect them up and record a video of Paul’s answers, which will post online in late February. Questions should be posted no later than Wednesday, February 14.

In addition to your questions about post-mortems in general, we encourage you to interpret “AMA” as “Ask My Advice”!

Here are some example questions to get your creative juices flowing:

  • Tell Paul about how you do post-mortems now, and get expert advice
  • Here are specific challenges we have with post-mortems - how do we do it better?
  • What are some things you think about in the non-tech world that could be post-mortems?

(Matt Stratton) #2

Paul,

Recently on Twitter I posited that “your post-mortem isn’t complete until all the follow-up items are closed”. What do you think about this? (Keep in mind that “closed” could me “later determined to not be necessary”)


(Paul Burt) #3

Paul,

How do you keep blameless retrospectives truly blameless?

How do you address managers who violate the trust required for a blameless retro? How can they best make things right after such a mistake?


(Bob Ulrich) #4

Paul,

How do you encourage members of the team, specifically within SRE, who weren’t on shift during an incident to become engaged and interested in helping drive outcomes? How do you quantify rewarding engineers who work on projects and improvements to the service driven by post-postmortems vs shiny run-state features? Would love to hear your wisdom here!


(Morgan Drake) #5

Paul,

One of the audiences for postmortems are customer-facing teams of varying degrees of technical involvement. Typically postmortem documents and videos are relevant to engineering audiences and less so to CFTs. I’d love to hear ideas and examples of how engineering teams can give CFTs post-incident info that both gives them some context for what they just had to live through and conveys (what is hopefully) a solid path forward for improving service. Have you seen any folks doing this well, or even just above and beyond “we posted a video of our postmortem meeting”? Thanks!


(Paul Rechsteiner) #6

Hi Paul - a few separate questions for you:

  1. Some organizations put in place SLAs related to postmortems, either in terms of how long after an incident the postmortem report must be complete, or around a timeframe that all identified follow-up actions must be completed by. Have you seen these types of SLAs have an unintended adverse impact on the quality of postmortems, or on the scope & appropriateness of the included follow-up actions? What can organizations to do avoid this unintended consequence?
  2. In your experience, what are one or two key practices that many organizations don’t do, but would help them drastically improve the value they get from their postmortem process?
  3. What are your thoughts on quantitative/trend analysis of postmortem data? What are the key data points that most organizations could benefit from tracking & analyzing?

Thanks for your answers!

Paul


(Rich Adams) #7

There’s been a lot of discussions around naming things. When it comes to post-mortems, there’s also “After-Action Reviews”, “Retrospectives”, “Learning Reviews”, and sometimes folks even get crazy and use “Postmortem” without the hyphen.

I’ve always taken the view that the most important thing is to review what happened and learn from it, and that the name you give to the process doesn’t really matter too much. However, I can certainly see how some people don’t like the grim connotations behind “post-mortem”. How important are names in your opinion? Can the name you give to the process affect how it’s completed and what information becomes part of it?


(Matt Stratton) #8

Thank you to everyone who posted questions! We are going to be recording Paul’s answers and the video will be made available by the end of the month - we’ll post it here, so watch this space!

This should be a great AMA!


(Matt Stratton) #9

I’m happy to share with you the video responses from Paul!

Thank you to everyone who participated!


(Matt Stratton) #10

J. Paul Reed: [Laughs] It sounds like there’s a certain manager that Paul may have in mind that… you know, for a friend, asking for a friend. Right?

Eric Sigler: Yeah. Yeah.

J. Paul Reed: So, there’s a couple of different conversations that are happening right now about blameless. One of them is this idea of can you have – can you be blameless? And that goes to something that I’ve written a couple blog posts on – and maybe we can link to them – about the distinction between blameless and blame-aware, and the idea that in sort of cognitive psychology we actually use blame to get rid of pain and discomfort. And those are words from a sociologist; her name is Brené Brown.

And so, we’re actually hardwired as a way to sort of dissipate this kind of emotional heat or badness – it’s to blame. Sometimes that can actually be blame ourselves. How many retros have we all been in where somebody says, “I fat fingered it. It’s all me. Can we just move on?” They’re blaming themselves and they –

Eric Sigler: And it’s a release mechanism that –

J. Paul Reed: Exactly. Exactly. It’s like “Mea culpa. Let’s all move on.” When we say “blameless,” are we talking about that sort of discussion? It’s like, can we do truly blameless? And the answer there is I don’t know that you actually can because it’s kind of hardwired into the way we’re thinking.

But there’s also a separate discussion happening about – when we talk about blameless and the idea of blameless versus sanctionless.

And this is something that – a conversation that John Alspaugh actually started with the Stella Report, and this idea that, actually, when we say “blameless” we are actually – we actually technically mean “sanctionless.” And so, the distinction there is if you have a picnic and a storm comes and it rains and it ruins the picnic, we would blame the weather. And that’s a cause, not a root cause, but a cause attribution. The reason why picnic was ruined, I’m blaming the cloud –

Eric Sigler: The weather.

J. Paul Reed: Right. But I’m not going to sanction the weather. That’s not a thing I can really do.

And so, a lot of times organizations, even though they say they’re doing blameless, they’re actually coming up with a causal chain of some sort. Maybe lots of different factors – I’m not necessarily – again, not talking root cause specifically. But they’re coming up with things to blame. But they’re trying to say, “We’re not going to issue sanctions to the people that did the work.”

And it’s interesting that part of the question is “How do you address managers that violate the trust?” We’re kind of required to do that. So, that is one of the things that’s really kind of actually difficult because it’s so easy to mess up and not get that opportunity back. And it’s not just leaders or managers that do this. I’ll give you an example. I was shadowing somebody that was doing a retrospective, and in the middle of their retro one of the engineers blurted out, “That’s the dumbest idea I’ve ever heard.” And that is a teachable moment where the person facilitating that, whether it’s a manager or not, can step in and say, “This is not about that. And that’s not the kind of language we want to be using.” And then, additionally, actually, I recommend talking to that person after the fact and saying – it’s not a reprimanding thing, but also “Is there a reason that you said that? Is there something you can help me understand why you had that reaction to that?”

And so, I was talking to the facilitator afterwards and I was like, “You do realize that because of that, the message that the entire team got is that it’s okay to call somebody else’s ideas in public, in a forum that we kind of said was safe, because we kind of opened it that way, I can call your ideas stupid.”

I hear this a lot where it’s either historical reasons or it’s not the culture and they’re trying to work towards that and there’s a lot of just baggage, so that behavior has been allowed for a long time. So, it can be really hard.

And what I see, the main way to address that is you start with the smallest group of people that feel that they have trust. So, sometimes I’ve actually seen where organizations do multiple levels of retros. And so, maybe there’s a big incident and three teams were involved. All the teams will do retros, which can be a safer space than when they all get together and then do the retro. One thing you can do, and I’ve seen it start with two people, they’ll do – the two people that were involved in an incident will do a retro in a way that they can trust each other and talk about it, and maybe that becomes three, four, then maybe the whole team.

Some managers – this is interesting – will self-select out of the process. So, they’ll be like “The team has the retro but I’m not there” because it’s perception.

Eric Sigler: Yeah, the power dynamics of having someone who’s your manager and in control of your salary and your raises also hearing things that may not be entirely positive about you.

J. Paul Reed: Right. Or that they did not know that’s what work actually looks like. It’s really hard once – after such a mistake it’s really hard to fix that because it basically requires going to the team and saying, “Here’s the mistake that I made. Here’s why I don’t as a team want to be doing that. And here’s what I’m going to do differently.” And then being very deliberate about doing that differently so you don’t just say it and then not follow through on it. Through a combination of factors that’s not always going to be possible.

So, actually, what often happens is you get your standard sort of organizational change dynamic going on. And what I mean by that is that person is just not involved, and either they eventually leave and there’s a new manager that does it differently, – or – I’ve seen it too – sometimes managers, there may be low trust and they have someone facilitate. They hire a facilitator. And then, what happens is through that the manager that we’re talking about isn’t involved, but you still start to promote the right behaviors or the behaviors you want because it’s just someone else.

Eric Sigler: One of the questions that I sometimes deal with when I’m talking with folks who are not either actual subject matter experts doing the work or line managers who have a team of subject matter experts, but when I talk to folks that are a little higher up in the organization there’s still a lot of disconnect. “What do you mean it’s blameless? Somebody did this thing.” What differences in the approach would you take for somebody who’s really, really high up – a director, a CIO, or a CEO? “Here’s what blameless is.”

J. Paul Reed: At that level we talk a lot about accountability. And it’s kind of – in a good way it’s “I need to hold my people accountable;” in a bad way it’s “Heads need to roll for this.” One of the biggest things that I think it’s important to talk about at that level is what business outcomes are you interested in, and is firing someone going to get you that outcome?

One of the other things that is interesting to help people at that level with kind of examining their own organization is what the work actually looks like. And you see this in DevOps a lot: walking the gemba of Toyota product and systems. Walking the gemba is like executives and high level people walking through the highways seeing people do the work, seeing the messy realities of the situations that decisions they make four or five levels removed actually put those people in. Sometimes the tradeoffs that they’re forcing those people to make, potentially without a lot of guidance. Senior leaders, it’s not like complexity or complex situations is new to them.

And that’s 30, 40, 50 years of kind of theory on that. So, it’s not like it’s new. But what can often be new is what is the actual reality of – if you go and look at things? I think that’s a good way to kind of have that conversation.

The other thing I’ve noticed sometimes: The pattern to blame is really ingrained. And remember, I said it’s a cognitive thing for ourselves. But also, a lot of our culture, the stories we tell, we tell linear stories. Movies are for the most part linear stories and there’s a protagonist and an antagonist. And so, we’re wired to sort of – that’s how we understand what happened. And the thing about that is if you’re talking about a huge multinational corporation – or, remember when the airlines had all those problems with their IT systems going down?

Eric Sigler: Yep. Yep.

J. Paul Reed: It’s really hard for a CIO to tell a story to the public, “Yeah, the board has been underinvesting in IT for 20 years, and I showed up 3 years ago and they expected me to fix it and they didn’t. They haven’t done that. That’s not been a priority for the company, and this is what happens.” You don’t see CIOs saying that publicly.

We talk a lot about empathy at kind of all the levels of the stack. A lot of times, when you see that it’s like it’s really hard to do it any other way. And that’s kind of the important part when you get someone at that kind of senior leadership level that does start to think about “Well, would I get the better business outcome that I want if I thought about it differently?”


(Matt Stratton) #11

J. Paul Reed: So, I think one of the things that it’s interesting – it would be interesting to ask Bob a little more about is if you’re on an SRE team and the engineers aren’t interested in the results because they weren’t on shift, does that mean they are ever on call? Because you can set up a thing where it’s like “Yeah, I’m never on call so I wouldn’t ever care about this,” – and then you get these weird – there’s no – there’s mismatch. There’s an impedance.

The thing that I would say is if everybody is on call, if that’s a rotating duty and everybody is on call in some form or another doing something, one of the things you can point out is that there are so many stories that get told. You’ll hear things like “Oh, yeah, I remember that. I had to deal with that two on call sessions ago but I did something different.” Those sorts of conversations generate the tribal knowledge and sort of the institutional organizational knowledge.

And the second part of that question, working on projects and improvements to the service driven by postmortems, I was reminded by – and maybe we can link to this – there’s a post that’s actually old, it was from last – a couple of years ago now, called “All Hail the Maintainers.” And it was talking about Silicon Valley but also the American kind of world economy, that we love in technology and in Silicon Valley to rip entire things out. It’s like, “Oh, you have a functioning Docker cluster. Let’s Kubernetes all of it and rip everything out.” And the thing is if we do that every two years, then those features become worthless, because it’s like you invested in it and then you ripped really all of it out. And it also impacts your institutional knowledge because you have a whole new system.

Does that mean you should never replace anything? No. But there is a question about do we decide to maintain this thing and improve it versus wholesale rip it out?

Eric Sigler: Is your ops team able to quantify the impact of maintenance work?

J. Paul Reed: Right.

Eric Sigler: Right? Versus, okay, we can – new features tend to have dollars associated with them from the customer, whereas maintenance work, well that’s just an FTE doing something that isn’t a new feature.

J. Paul Reed: Right. It’s kind of funny. You want a hat that says, “Make maintenance sexy again” or “great again.” “Make maintenance great again.” There was just talk about the infrastructure proposal, the budget proposal, and one of the metrics they were talking about was everybody’s commute. Nobody wants to pay for roads but they want a better commute. Or pay for BART, but they want a better commute.

I mean, that’s one of those things that’s like – a lot of times teams talk about “We build the paved highways but then we let them fall apart.” It is that sort of we need to change how we think about –

Eric Sigler: How we think about those metrics.

J. Paul Reed: – and how we fund it. And you see a lot of those conversations in DevOps going on too.


(Matt Stratton) #12

J. Paul Reed: We all know this. If you’ve ever done agile or watched it, one of the big debates that comes up is “What’s done?” And then you get teams doing “Well, is it done done?” or –

Eric Sigler: Is it the definition of done?

J. Paul Reed: – “Is it done done done?” All of these kind of –

Eric Sigler: Ready for review. In review.

J. Paul Reed: Right. And so, people tend to play games with “closed.” I pretty much guarantee that if you go and look at a postmortem for any incident that is medium to large – so, I’m talking an incident where there were maybe multiple teams involved, usually more than one team, and the business is actually paying attention. So, oftentimes it comes up in that context. In most – in that type of incident you never close all of them. And by the way, if you measure that, you’ll find people just don’t say in the retro what they think really should happen.

The example that I’m thinking of is if you are really doing kind of comprehensive retros and looking at the entire system, a lot of things you just can’t – they’re not reasonable to fix in that context.

Eric Sigler: “Let’s rewrite everything in this other language.”

J. Paul Reed: Right. Or, I’ll give you an example. There was a major, major, major, major outage – and I can’t tell you who and I can’t tell you what but it was in the papers. One of the action items was "Well, the root cause of this is a setting on all of our servers that were provisioned in this data center five years ago. So, to fix it we would have to turn all of them off, reformat all of them, restore them, and bring them back up.

And by the way, we are in an 18-month plan to get out of all data centers and move to the cloud." So, the answer to that is “Yes, we know this is bad and the solution is literally not necessarily something the business would want us to do or that we’re ever going to do.”

So, that is one of things, is that in a healthy organization you can actually – you can point that out and say, “Yes, we’re not – we’re literally not going to do that.”

Eric Sigler: And that goes back to what Matt was saying in his question, which is “later determined not to be necessary.” Is that an acceptable kind of way to close an action item?

J. Paul Reed: I think that’s sort of organizational. Is it okay to actually say it’s not necessary, or “We’re going to solve it as part of actually not a retrospective remediation type thing” but as a “We’re going to do a plan that addresses these five things and that’s kind of one of the things, but we’re going to do it holistically.” I think kind of implicit in Matt’s question is, again, this difference between a way that leadership thinks the work gets done and the way that we in the trenches actually do the work.

An experiment you can work with your own teams if you’re curious about this: If you’ve got a sufficiently complicated/complex incident, at the end of that retro get out sticky notes. Have everybody by themselves and not talking to anyone else pick the top three that they think are important or that they think are going to get fixed or are actually things that they’re going to do. Then, rank those. Add them all together and rank what the team – most people on the team are thinking first, second, third. And I would look at the top five and say that’s what you should really care about. Out of the 15 items you came up with, look at the 5 and drive those to completion. And the rest of them, if the work’s not getting done, the system – by which I mean the team – is effectively telling you one of two things, or maybe both of these things. Thing one is “Literally, none of us think that’s important enough to do in the context of the other stuff we have planned or the other work that we have.”

Or, number two, “The business is putting so much pressure on us to do something else that we’re not going to do the remediation. We literally have no bandwidth to do it.” That’s a feel-good statement. That’s what we would like to say. But I don’t think in practice it’s ever true.

And that framing of it, even though it feels – it kind of feels good maybe to be like “All action items must be completed,” it’s not also realistic. And again, if you do that experiment where you stack rank them, go back and look yourself two weeks, four weeks, six weeks out: Which of those tickets got closed and when? Is what you all thought was important, did that actually end up getting fixed? And that can be a measure of the team did say it’s important but they don’t have the bandwidth to fix it. It wasn’t “The solution was replace all of our data centers with the cloud and we’re not going to do that, so ho ho, that’s a funny remediation item.” It’s not that.

Eric Sigler: No, it was –

J. Paul Reed: It’s “We thought this was important and we couldn’t do it. We literally couldn’t get it done for some reason.”


(Matt Stratton) #13

J. Paul Reed: Yeah, so that’s really kind of the hard thing about SLAs, especially in the context of postmortems. They tend to incentivize behaviors that you probably actually don’t want. So, in the context of remediation items, if you have an SLA on when they have to be completed by, people are just going to be like, “Well, maybe there’s only one or two remediation items,” when in reality there might be a rich set of items that you really want to look at.

So, I did a presentation at Velocity a number of years ago called “The Tyranny of the SLA,” – and it was talking about SLAs in a more traditional context, but I think a lot of that is applicable here. SLAs are problematic, actually, for lots of reasons because they’re a measurement and they’re incentivizing behaviors. They’re probably not incentivizing the behaviors you actually want. And this goes back to that conversation we were saying, like is that the business outcome you really want?

Specifically, to answer his question, the sort of commitment to do a postmortem within a certain time frame, I actually think that’s super important. In my experience working with teams, what you generally find from a quality of the output perspective, postmortems done more than 72 hours after the incident, you might as well not really do them at all. There’s just too much going on, too much cognitive bias that kind of seeps in after the fact and it just makes the quality of the data that you get out of it not as great.

Eric Sigler: Now, to be very clear, do you mean the gathering of the information or do you mean the understanding of the information?

J. Paul Reed: I’m talking about specifically what most organizations consider a retro. And this takes us down a hole. We could do a whole A&A actually on that, like how do you actually structure and what’s the practice. But a lot of times an incident happens and then somebody schedules a retro and nobody thinks about it until they show up. And then, part of that is – or, they might show up and say, “This is the collected timeline. Let’s kind of go through it.” Whatever, right? And I’m saying if a lot of that sort of sensemaking that you do happens 72 hours after the fact, it’s pointless. It’s worthless. And the reason I say that is because there’s a bunch of cognitive bias that we’re going to succumb to that has to do with recency bias and other things. And so, what will happen is, especially in those retros where you’re like, “Oh, we know somebody’s going to bring us the timeline and we’ll still be able to make sense of it,” you will take that data – your brain literally does not know, it doesn’t remember what you were thinking in the moment, but it will fill in kind of what you were recently thinking about. And this is well documented memory bias in humans.

You get the most value when you capture that rich, rich context about “Why did I do that?” or “What was my thinking in the moment?” And you lose all of that after 72 hours.

So, it’s not so much you should have an SLA. It’s not you should do retros within 72 hours because it’s an SLA to the business. It’s more like you’re just wasting everybody’s time if you don’t do it. You’re just going to lose all of that rich context, which is really what you’re mining for when you’re trying to figure out tribal knowledge, institutional knowledge, and also a rich understanding of what happened for everyone. You’re just going to lose that. So, you should commit, have a commitment within the teams to do a retro within 72 hours because of that.

Eric Sigler: And that commitment should also reflect into the broader organization, like we were talking about earlier with the alignments, where the broader organization needs to give that team the room to do that commitment.

J. Paul Reed: Yep. Exactly. Exactly. And again, it’s really about – it’s just wasting – it’s a waste of the organization’s time if you don’t do it because it’s kind of – I’ve seen retros with 30 high level people in them retro-ing something 3 weeks after the fact, and it’s – they’re literally wasting millions of dollars.

Eric Sigler: They’re telling a story at that point. They’re not necessarily remembering what actually happened.

J. Paul Reed: Exactly. And that was SVPs in that meeting, so you literally are spending probably half a million for that meeting.

Now, the other question, really, around the time frame that all identified follow-up actions must be completed by, the one thing that I would say, the best approach that I’ve seen, there’s a team that sort of centralized, that facilitates the retros that they do and helps track the action items. But it really is this idea that – for human responsibility, and so the idea is that if the teams decide “We had these four items and it’s just not important to us this quarter to do that,” them making that decision.

Now, to make that work, they have to be responsible for when it goes down.

They’re getting the PagerDuty page, not somebody else, otherwise that’s just a –

Eric Sigler: You miss that alignment. Right? That alignment of –

J. Paul Reed: Yeah, and that’s a toxic, broken system because it’s like “My behavior has no impact on you getting up at 3:00 AM.” So, Mark Imbriaco or somebody else said something about “SLAs are there to manage broken relationships.” So, the thing is if you’re putting an SLA in there that “Thou shalt – all action items done,” (a) that implies it’s a broken relationship somewhere, and (b) people are going to game the hell out of that.


(Matt Stratton) #14

J. Paul Reed: The hard part about that is if you buy what I said about the value coming from tribal knowledge and institutional knowledge, then you’re asking, “Okay, how do you put a quantity on that really kind of qualitative aspect of things?” We all understand metrics and how that drives –

Eric Sigler: It’s not a viable issue…

J. Paul Reed: And that it drives business decisions. And it’s not always bad. It’s like you have to prove to the business: “Maybe we can’t run four data centers with two people because we have all these outages.” I understand the need for us to do that. It’s also something you’ve got to be deliberate about because it’s super simple – to the other questions we’ve had – to incentivize the wrong behavior by measuring the wrong thing. And John Alspaugh has been talking about this on Twitter a little bit, about is MTT – are the MTTs – so, MTTD and -R –

Eric Sigler: Mean time to knowledge, resolve, remediate –

J. Paul Reed: Are those worth anything? And I think he is positive on Twitter that they aren’t, or that more specifically I think the quote was “Measuring those numbers to understand your next incident is like measuring the dimensions of food to understand how it’s going to taste.” It’s kind of like saying, well, some of the values that we collect – a lot like MTTR – it’s sort of saying, “Why do that? It’s worthless.”

And I think John’s opinions on things are always very three-dimensional and nuanced, so I don’t think it’s that those numbers don’t matter.

Eric Sigler: Yeah. Yeah.

J. Paul Reed: That said, I agree with him that MTTD is not going to tell you what your next incident is going to look like. But some of those numbers for doing some kind of trend analysis does give you sort of feedback from the system on “Are we spreading that institutional, ethereal knowledge more effectively?” For instance, if the problem is detected by the call center and Customer Support then escalates, in those types of incidents are we seeing what we call the MTTD going down specifically because we worked on that relationship?

So, there’s some trend analysis. Again, you have to be really careful. You shouldn’t put bonuses and stuff like that. And so, I agree that you’re not going to understand your next incident by measuring MTTD. You’re not going to understand what your food tastes like by measuring the dimensions of it. That said, if you eat onion rings and fries and a double cheeseburger with bacon every meal for a week, you intrinsically know that’s not going to feel really good. And maybe there’s – when we kind of stand up and feel dizzy at the end of that week from the couch or whatever we might connect “I’ve been making a lot of decisions that maybe aren’t the steamed rice or steamed broccoli and quinoa. Maybe there’s a connection there.” And so, that’s, I think, where – kind of where those numbers can help you. Again, you don’t want to directly use them to incentivize behavior but they can give you interesting feedback about “Are we as a team getting better to remediate because the on call person is engaging the right application team?”

I think one of the things that John is trying to point out is that – and I think this is important – looking at those numbers don’t account for black swan events.

Eric Sigler: Nope.

J. Paul Reed: And so, you can’t infer that our next incident won’t be a Chernobyl just because our MTTD numbers have been going down. The one thing that you can say is that hopefully you will have the right people and the right gear to respond to that. And the thing that argues – and this is what I think is super important – the thing that argues to the business that you should invest in that is seeing those MTTD, MTTR go down because then you can say, “Oh, we used to be here and you got me a couple new headcount” – or whatever it might be – or “We got time and our budget to do these tooling – this extra tooling.” I can see that number go down, so let’s do that – let’s do more of that. That’s where I think it can be useful.


#15

I loved your talk at the sensu conference in portland! Is that video available yet?