Mastering DORA metrics with Google's Nathen Harvey

Abi: Nathen, thanks so much for sitting down with me today. Really excited to chat.

Nathen: Thanks so much for having me, Abi. I’m excited to be here.

Abi: I want to start off by talking about what DORA is today and how exists within Google. Everyone listening to this is probably familiar with the DORA metrics, but think far fewer people are actually familiar with what DORA the company was both before and now at Google. So could you give us a little bit of background on what DORA is and how it exists within Google today?

Nathen: Yeah, absolutely. So DORA is of course the DevOps Research and Assessment. It’s a long running research program that looks into how do teams get better at delivering and operating software.

Now, I would recommend that all of your listeners pause this recording and go back and listen to the interview you had with Nicole Forsgren, because she really is and was the foundation of DORA. It was her research, Dr. Nicole Forsgrens research that really led to the founding of a company that was called DORA.

So she has in her episode a lot of the backstory, but I can help pick up the latter half of this story. So at the end of 2018 I believe, Google Cloud acquired Nicole’s company, which was called DORA. This is one of the most computer science appropriate things. We use the word DORA to mean eight different things, right?

DORA is a company or was a company. It’s a research program. It’s a report, it’s an assessment tool, like DORA, DORA, DORA. It’s all over the place. I mean, she’s also an explorer as it turns out.

So Google Cloud acquired DORA at the end of 2018. And interestingly and importantly, Google Cloud has truly maintained its commitment to running a platform and program agnostic research study that seeks to answer that question, how do teams get better at delivering and operating software?

So Nicole has since left Google Cloud, but there are other researchers that have picked up this work. So the lead researcher is another PhD known as Dustin Smith. He’s known as Dustin Smith because his name is Dustin Smith. That’s why we all know him that way. So Dustin together with a couple of other researchers are really continuing the research program moving it forward.

Abi: That’s great to hear. And I’m curious. When DORA joined Google, was Google itself looking at its own DevOps operational performance? Or what was really the inspiration or the joint vision around DORA and Google coming together?

Nathen: I would say that for almost forever, since the inception of Google, Google has been looking at how do we help our teams get better. And in fact, there are deep research studies and teams dedicated to exactly that. How do we help engineers within Google get better at doing engineering for Google?

So I think when Google Cloud acquired DORA… So first I should be very clear in saying I wasn’t part of those discussions. I didn’t make it happen. I wasn’t even in the room when those decisions were made. But I think a big part of it was twofold.

One, I think that potentially, we could use some of the DORA insights within our own internal teams, and we could bring some researchers together and add to our capabilities there.

But then probably more importantly, we can help understand and help our cloud customers get better at delivering and operating software. And at the end of the day, I think that this is really about how do we take the entire industry and lift up our collective capability. And Google Cloud knows that if we help a team get better at delivering and operating software, and that team happens to be delivering and operating their software onto Google Cloud, they’re going to use more Google Cloud. So I think there’s a lot of mutual benefit in that.

Abi: As you explained, I think the synergy is really clear. People listening to this, again, most of them have definitely heard of the DORA metrics. And a lot of them are also familiar with the research program and especially the annual reports that you put out. So I’d love to dive into that a little bit, and give listeners a view into how the sausage is made. How does the research program actually work? How is the survey conducted, analyzed, iterated upon? I’d love to just go through it step by step.

Nathen: It really is sort of a labor of love and a true annual investment that we make. So right now, we’re the beginning of January of 2023. We are sort of kicking off together with the researchers, what is the design of this year’s report going to look like? And of course we don’t actually start with the report. But instead, we start with what are we going to investigate? What sort of questions do we want to ask?

And every year when we do this, and this was true in the past as well, is we kind of sit down and look at the core principles for the DORA research program. So we know that we have four key software delivery metrics. Of course, we’re going to ask questions about those. We have capabilities that we know drive those metrics. But sometimes, we introduce new capabilities that we might want to explore.

So at this time of year, what we’re doing is really thinking about, “All right, what are the outcomes that teams are really looking for? And let’s come up with some hypotheses about which capabilities drive towards those outcomes.” And we’ll whittle down that list because of course, we want to ask about everything. But that’s not really feasible.

So we whittle down that list to a good set or what we believe to be a good set, and then we start building and iterating on the specific questions that we’re going to ask. And this to me is always the thing that I learned a lot from Nicole and continue to learn a lot from the researchers. The way that you ask the questions and the way that you pose them to your survey participants is so, so important.

Simple things like you have to make sure that the question isn’t asking two things at the same time. Abi, do you like the color orange and blue? Well, if you say yes or no, or do you like the color orange or blue? The answer to that is basically useless. So we need to make sure that we’re asking one thing at a time in each question.

But then also, sometimes a survey respondent, a participant in the survey might misinterpret or have a different understanding of the words that we used. So sometimes we have to ask seemingly the same question a couple of times, but phrase it different ways, so that we can make sure that we’re getting as close to the accurate answer as possible. So yeah, that’s the first phase is designing this survey. I think you were going to say something though.

‍Abi: Yeah. I was going to ask, what’s the population of people who actually take this survey? Is it primarily Google customers? Is it the general public, AKA, the industry? Who is it that you’re getting this data from?

Nathen: Yeah, that’s another really good question. And it’s super important to us that we have as many diverse participants in the survey as possible. And so it truly is an industry-wide survey. Excuse me. We open it up to the entire industry.

And usually what we do is, the technical term is snowball sampling. So we’ll go out on Twitter, we’ll post on LinkedIn. We’ll write up blog posts encouraging folks from across the industry to participate in this, because we really want a broad collection of insights from industry verticals. Whether it’s financial services, or technical companies, or retail companies, etc. And people from all different roles within those companies, all different levels of experience. So it is definitely not, just so we’re very clear, it is not an internal Google survey. Although Googlers are encouraged to participate in the public survey, we really want everyone to come along.

Abi: One of the things that’s so important about what you were talking about is that DORA is not a one-time snapshot of the industry. It’s an ongoing, iterative investigative process. And we’ll talk more about how DORA’s evolved recently in a little bit.

But I want to go back to you were giving great advice on the importance of well-crafted survey questions and survey design. One of the questions I have is around the benchmarks that are derived from the results. I know that the survey questions you guys use tend to be sort of non-ordinal scales, like multiple choice ranges. For example, for lead time between one day and a week, those types of things. How are you guys using that data to then derive these very fine grain industry benchmarks?

Nathen: Yeah. Essentially what we do is when we think about the software delivery metrics in particular… So just as a quick recap, there are four of those. Two for throughput, which are deployment frequency, and your lead time for changes. And two for stability, which are your change failure rate. So when you push to production, how frequently does it just break something in your time to restore service when there is an incident or an outage?

So essentially what we do, we do ask a multi-choice question. “What does your lead time look like? How fast do you recover?” We take all of those answers and we do a cluster analysis with them. And so essentially, I’m not a researcher just so we’re clear. So I’ll describe it in layperson’s terms. We throw down all those answers onto a giant plot. And if you step back and look at it, you could see clusters emerging of where these people or these teams are answering these questions. So that essentially is how we do that cluster analysis.

And then we look at within this cluster, let’s say we’ll take lead time for changes as an example. We’ll look at the lead time for change cluster that emerged, and we’ll kind of find the median answer within that cluster. And that’s what we use. And it is important that it’s not the mean. It’s not the average of the answer, but it is the median of the answer across all of our survey participants there.

Abi: Thanks for that explanation. And you brought up an overview of the DORA metrics themselves, and then starting talking about clustering, which I think is such a good leeway into what I want to talk about next. Which is, it’s been really interesting to observe the ways that DORA has been evolving over the past couple of years. So I want to ask you about a few specific examples of this, and then also let you share anything else that you think is notable.

One of the things I thought was interesting was that in 2021, in your report, you added a fifth metric to those original four you described, which was reliability. So I’d love to ask why was this added to DORA?

Nathen: Yeah, and I don’t mean to be pedantic, but I am going to correct you. Because actually in 2018, we introduced, and by we, mean the collective we, we introduced another metric called availability. And so really, those first four metrics I described are really measuring software delivery performance.

But we realized that we call this DevOps, and DevOps isn’t just about delivering software. But also there’s the other half of that word ops, operating that software. And so how do we measure whether or not the services that we’re running are working well? And we really wanted to look at that.

And so in 2018, a measure was introduced called availability. And availability is in the research and as we describe is not the same as uptime, right? Is the service available? But really availability is asking a more nuanced question. Are you able to meet the promises that you’ve made to your customers with this service? As you can see, that’s way more different and difficult to answer than is the service up or down, which is kind of a binary answer.

But over the years in 2021, we actually changed from availability as the noun, to reliability as the word that we use. In part to reflect that nuance between availability isn’t just uptime. But also in part because by this time, the SRE, the site reliability engineering community had really found its feet and was growing across the world. And there were lots of people that were asking this question, “Which is better? DevOps or SRE? And let’s get them to fight with one another.” Which if you look at the practices and the foundation of DevOps and SRE, neither one of them are going to fight with each other. They want to embrace each other, I think.

But what we didn’t have was any data or real science yet to show that reliability practices as we understand them as a community, are they actually helpful? Let’s dig into that. And so that was really the genesis for why we wanted to bring in availability, which then became reliability. And I’ll be honest. It’s much more difficult and much more contextual to actually specify, is our service reliable?

‍Abi: How are you measuring that or gathering signals about that?

Nathen: Well, I’ll tell you the thing that we’re not doing first. We’re not going in our survey and saying, “Are you practicing SRE?” Because again, see also that you keep using that word. It doesn’t mean the same thing that I think it means. So we’re not communicating well. Instead, we want to break that down into particular practices or things that you might do.

So for example, we might ask, “Are you monitoring for user facing symptoms?” We don’t really care if you’re monitoring what’s your CPU load, but we do care if you’re monitoring, can your customers check out? That’s more important.

So we look at some of the various practices. So those practices include those user-centric monitoring. Are you using those signals to reprioritize the work that you’re doing? So if you’re not meeting your reliability goals, how does your team react to that? Do they say, “Well too bad, that’s not my job. I’m going to just keep pushing features.” Or do they then focus more on things that’ll make the system more reliable? So we kind of dig into a couple of those.

But then at a very high level, what we really try to get down to is do you have an understanding of what the goals are, for reliability for your service? And if you do have an understanding of that, do you have also an understanding of whether or not your service has met those goals recently?

And it turns out as a developer in the past, maybe I don’t care what the reliability goals are of my application. And if I do, I probably don’t know what their actual values are. So we start to look at some of those things as well to really summarize, are you focused on that? Are you performing well there?

Abi: I think one takeaway for listeners is just how intentionally and thoughtfully you guys think about these different constructs, and concepts, and practices, and what the right signals are for measuring them.

I want to bridge into a related topic, which is I think a pretty big change in 2022, which was that there was no longer an elite category. And for some context for listeners, of course in the annual reports, there’s typically been an elite, high, medium, and low performing cluster. Can you share a little bit about how that came to be?

Nathen: Yeah, and I think it’s so fascinating. And you called out a point earlier where because it’s a multi-year study, we can watch these trends. And so actually in the early days of this study, there were only three categories. Low, medium, and high performers. When you look across those four clusters of software delivery metrics. I think it was 2017 or '18, I’d have to check my notes here. But in one of those years, we saw within the high cluster, there was this breakout group.

In that first year, we called them elite, but we said elite were kind of a subset of the high performers. And then in following years, the breakout was enough that we could see four clusters in the data. And then in 2022, this weird thing happened. When we looked at the clusters, there were three.

And for me, I think it’s just a good reminder to everyone that we don’t set the bar before we run the survey. We let the data drive what we find. And so we’re really trying to report on that. I think that’s really important. And I think that sometimes that does get lost on some of the readers.

Abi: Just to reiterate that point, these categories aren’t Google’s opinion of what good and bad are. It’s really what’s emerging from the data you’re capturing. And on that note, one thing I noticed also on this year’s report was that you actually talked about these other clusters. Starting, slowing, flowing and retiring were their names. So would love to know what the story is behind these. What was the inspiration for even looking into a different type of set of clusters, and what was the methodology behind this?

Nathen: So the first thing is we saw that elite cluster didn’t emerge. We also now have a couple of years of availability/reliability data. And those clusters were truly software delivery performance clusters. So they weren’t taking into consideration any of that reliability data.

So we thought, and by we, I mean Dustin [inaudible 00:18:37], who’s one of the researchers thought, what if we took that reliability and mixed it in with the cluster of throughput, stability as are two software deliveries? And then reliability as our operational performance. And so then when he put those together, again, this time four clusters did emerge. And then Abi, we had the hardest problem in computer science: naming.

How do we name these things? What should we call them? And again, what I’ve seen in practice is so many teams say, “There’s elite performance is the highest bucket. We got to get to elite performance.” And that brings me a ton of anxiety and that’s not the right way to use this. And so I wanted to make sure, and we wanted to make sure that as we named these new clusters. We tried to stay away from sort of judgy names. No one wants to be a low performer. and we’re not judging you because you’re a low performer. It’s just where you are.

So we tried to find something that fit better, these kind of archetypes. What I don’t know is did we get it right? I’m not sure, but we did find these four clusters. And when we think about them, they kind of mirror what might be the life cycle of a product. And that’s where we came up with the slowing, the flowing, the retiring, et cetera.

Abi: I’m curious what the takeaway is then for practitioners. There are now these kind of two sets of benchmarks or cluster groups, if you will. How are they different in terms of practical use? Which ones should leaders focus on and in what scenarios?

Nathen: I think the biggest takeaway is context matters. Context matters. Within an organization, you almost invariably have multiple applications, multiple application teams. Not every application needs to be quote unquote an “elite performer”. We don’t need to have the best of the best performance there. And really what we want to say is that look across your fleet of applications. You might even pre-designate, “Look, this one application, we know it’s working super well. We don’t have to change it very frequently. Our customers are very, very happy with it. We are very happy with it. It can act like that retiring or that steady state application. We want to focus on reliability. We want to make sure it stays working, but we don’t have to change it a whole lot.”

And so it really is about thinking a little bit more critically about where do you want your application to be, and giving you this opportunity to bring in your context. It’s not about how do I be the best DORA performer. It’s, how do I help my team get the right outcomes?

Abi: I really like that advice around the importance of context and critical thinking. I think that’s going to be a theme as we shift into the second part of our conversation, which is that internal use of geometrics at different technology organizations has become really widespread. And I think this is still something that a lot of leaders have questions about in terms of how they should be doing it. There’s also some controversy around it in terms of examples where it’s gone wrong or been misused. So I want to have a discussion around the ways in which you’ve seen the DORA metrics misused as well as the challenges with implementing them.

‍I want to start off by asking you, and you had mentioned to me that one problem you’ve seen is leaders, and you kind of already touched on this, but leaders being really focused on being high-performing instead of continuous improvement. Can you elaborate on that a little bit more?

Nathen: I’ve seen organizations where leaders say exactly that. “By the end of this year, 80% of these teams have to be high performers or elite performers.” And frankly, there’s a couple of challenges with that, as everyone that’s listening to this will know. When that becomes the goal to hit that metric, that really encourages people to game that metric. You want us to decrease our lead time from commit to running in production? Great, I’m going to do a whole bunch of work over here on the side that’s invisible. And when I know it’s good to go, then I’ll put it into the repository and commit it, and then it’ll flow through very fast. No, don’t do that. That’s the wrong thing to do.

I also think that it sets these unrealistic goals for teams. And really, the thing that we need to take away from DORA, and frankly from all of these measurement frameworks, is that the measures are really for the practitioners, the people that are doing the work day-to-day. And it is a framework and a way for them to understand and us to understand, how do we take the next step forward? How do we build in that practice of continuous improvement?

But I’ve also seen leaders start pitting teams against one another. How do I know which are my good teams and which teams should I fire? Of course, I’ve never heard a leader say those words particularly. But it does happen.

I’ve also seen frankly, some challenges where folks, our research is survey-based. We go out and we ask questions. And sometimes, internal teams say, “That’s cool that that’s what you do for the industry, but I have all the data. So let me pull all of the data from my systems.”

Unfortunately, I’ve seen too many organizations spend a lot of engineering effort pulling that system data together to build beautiful dashboards that nobody looks at. And they spend all of that capacity building these dashboards instead of spending some of that capacity actually improving their capabilities. And so there is a balance and a trade-off here.

And I think the signal to watch out for is this. The teams that are doing the work, are they using the metrics to help guide their own improvement? Or are they using the metrics to shine when they look up the organizational chart? The former is what you want. The latter is the signal that you want to watch out for.

Abi: I think that was so well described, and you touched on so many important points. I want to double click on one thing in particular though. We talked about the practice of not just setting goals based on these metrics which you described the misincentives that can create. But the practice of using them to stack teams against each other. For the leader out there who’s still wondering if that is a good practice or not, can you just state why that is not a good thing to do?

Nathen: Yeah, for sure. Well, I think I’ll just go back to the research. One of the most foundational findings that we’ve seen year in and year out with the research is that the culture of your teams and the culture of your organization is one of the biggest predictors of performance. In fact, this year, and by this year, I mean the 2022 survey and report. We did a deeper dive into supply chain security. The leading indicator for whether or not a team is embracing supply chain security practices is culture.

And so when you start to pit teams against each other, you start to encourage behavior like hoarding information and hoarding lessons. “I’m not in it with you. I’m in it against you.” And instead, we find that those generative cultures where we’re really thinking about what is the overall performance, how do we share information and bring each other along? Those are the organizations and the teams that perform best. So trying to pit those teams against each other is not going to work well.

The advice I would give for that leader though is you can still turn this into a contest. You can still gamify it if you want. But instead of saying who’s going to have the best performance, my question is who’s going to improve the most? And in fact, when you ask that question, your teams that are the lowest performing today, they have the best incentive to ramp up. Because if you’re shipping once a year, going from once a year to twice a year, it’s obviously very different than going from once a day to twice a day. These are completely different approaches that you have to take. And so I would instead look at who’s the most improved. But even better than that, who’s learned the most over the past year? That one gets a little bit more hard to quantify. So maybe back to the most improved.

Abi: Maybe perhaps a little bit aspirational. But I love that example of the quote unquote “low-performing” team shipping once per year, going to two. And that example, that probably delivers a lot more of a stepwise benefit to the organization than just having a team go from once a day to twice a day, where there’s questionable actual business value in that.

‍We’ve been talking about how leaders can get a little maybe overzealous with their use of the DORA metrics. But there’s also I think some examples of the other end of the spectrum, which is, you mentioned an example about working with some organizations that are maybe hesitant to compare their performance against the overall industry, and instead favor just comparing against their industry sector. Can you talk a little bit about what you’ve seen there?

Nathen: Yeah, I have seen that. Because of course not only do we get participation in the survey across a variety of industries, we also share that information back. And honestly, what’s been very difficult to uncover are big statistical differences from one industry to the next.

I think it was the 2019 report that showed the retail industry was different than all of the other industries. And it was different in that it was performing statistically better than the rest of the industries.

So I think that’s interesting. Of course, you talk to financial services, healthcare companies, highly regulated industries. And they look at this idea of speed or velocity, and they get very nervous about that. We’re working in a well-regulated industry, and that is going to put some roadblocks in place and intentionally slow us down. And of course we do see that. But don’t use that as an excuse not to improve. Yes, I’m not saying that you should throw away those regulations. You should please do protect our data. Please protect our data. But, it’s not an excuse not to improve.

Abi: And another excuse you’ve brought up before is companies feeling like they’re already world class at the top of the industry. Not to accuse Google, but companies like Google that are already so prestigious and have very modernized practices, sometimes maybe don’t feel like the DORA metrics are worthwhile for them.

Nathen: Right. And I think that’s fascinating as well. And it goes back to the multi years that we’ve had across this study. As we don’t set those bars in advance, each year, the definition of what is a medium performer changes slightly, just based on what we’ve seen. What I’ve definitely seen is that sometimes, teams hit that peak performance and then they rest on their laurels.

And in part, maybe that’s being a little bit unfair to those teams though. Because maybe, software delivery is no longer their constraint. So perhaps they’ve turned their focus elsewhere. And to that, I would applaud. You’ve got the software delivery on lock. Now it’s something else that’s holding you back, that’s preventing you from making good progress. Go focus on that for a little bit. Maybe it’s okay if your performance on software delivery falls off a bit.

Abi: Yeah, I think that’s such an important point. Because if leaders are looking at DORA measures as the end all signal for success, like you just said, there’s actually a lot of other aspects of writing an effective software organization besides delivering operational performance. And so it’s important that leaders look also beyond DORA at other constraints and challenges within their organization.

Nathen: Speaking of looking at other challenges and constraints, this is the other pitfall that we see all the time with DORA. And I’m going to quote a good friend of mine, Bryan Finster. In the organization he was working, he went out and he bought a bunch of the book Accelerate, and he gave it to a bunch of executives across his team. And all of those executives picked it up and they got to about page 19 where the book describes the software delivery performance metrics. And then they put down the book and they’re like, “Yes, let’s all go get better performance.”

And he realizes now that what he should have done is put the book down and say, “Skip everything, jump to Appendix A.” Appendix A is where we talk about the capabilities that drive that improvement. I think that we all too often get fixated on those four key metrics, forgetting that those are outcomes. And we don’t get better at the outcome by focusing on the outcome. We have to focus on the capabilities that drive those outcomes. And that’s where appendix A and the capabilities model really is so powerful with DORA.

Abi: Yeah, that’s a great story. And I also had a chance to speak to Brian a few months ago, and I remember he used the term vanity measures or vanity scorecards for leaders. That’s what DORA metrics became when leaders were just focused on page 19 as you described.

I want to talk, and I don’t want this to get too pedantic, but one of the challenges I’ve personally seen at multiple organizations with the use of DORA metrics is actually not even how to measure them, but agreeing on the definitions of what it is we’re trying to measure. At GitHub, while I worked there, we worked on a DORA dashboard and unfortunately we did do it the hard way as you described. And I don’t think anyone really looked at it. But one challenge in putting together that dashboard was actually the definition of lead time. And I think part of the problem is that lead time is a term that exists outside of DORA and outside of even just engineering.

So I want to take a step back. We’ve talked about this over email, but what is the intent of the measure lead time within DORA? Is it intended to include the human components such as planning and development, or is it really more about the CI/CD and release, automation and tooling?

Nathen: Yeah, and I think there’s actually pretty good explanation of that in the book Accelerate, but I’ll kind of paraphrase it here for you as well. But essentially we talk about, in the book Accelerate, it might be page 21, although I’ve probably just made that number up. They have a nice table that illustrates the difference between product development and software delivery.

And so when you talk about the time that an engineer puts in building a feature, when we talk about which features should we build, how do we prioritize work, all of that is very much product development. And it’s not that it’s not important, because that is important, and we should have an understanding of that. But the DORA research, essentially it has to pick a place to focus. Where do we focus the investigation that we have? We can’t look at everything or else we’ll just water everything down.

So they intentionally chose delivery lead time, and that starts when the code is committed to the repository. And it ends when that code lands in production. Now, as I just described it, it’s super easy., But you know what? There’s a ton of nuance behind that. When it gets committed to which part of the version control system, is it to my local branch, is it to a feature branch? Is it only to trunk? When does it get to production? Is it when we roll it out to a beta user, to 5% of our users, to 100% of our users? So these four metrics on the surface, beautiful, simple, easy to understand. Once you take the surface off, there can be a ton of nuance in them.

My recommendation to teams always is at least at the team level, have those conversations. Agree to something. Set those boundaries exactly what you’re going to measure.

Now unfortunately, I’ve also seen organizations want to define that organization wide from jump, from the start. And I think the better approach is to have a couple of teams define it for themselves, do those experiments, make those improvements over time, make sure that it’s having the impact that you want. And then maybe start sharing those boundaries, if you will, across the organization. But I think it all comes back to again, context matters. And we need to be clear and transparent in what exactly are we measuring.

Abi: Yeah, the takeaway here is maybe to not so much focus on arriving at a single, universal definition, or a strict definition across an organization or across the industry, but to allow context to surface really what’s right and useful to measure for each context.

‍While I worked at GitHub, something else funny happened, which was Microsoft was also looking into the DORA metrics. And this is while I was working with Nicole at GitHub. And so a group from Microsoft reached out to us. And I remember they were struggling with deployment frequency. Specifically, they were not sure what the definition of a ‘deployment’ should be. I wanted to ask you, is this something you’ve seen outside of this example as well?

Nathen: Oh yeah, for sure. So on the one hand, you have teams that are doing progressive deployments, as I just described a second ago. When do you cross the threshold? You have teams that are doing things with feature flags. Did we release the feature, or did we deploy the feature? And what’s the difference between a deploy and a release? And again, you get into all of those nuanced conversations. I see that happen all the time. And again, especially when it comes to mobile applications.

Because look, those four key metrics, you can use them for anything from a mainframe, to a SaaS that you’re consuming, to your mobile applications. But mobile applications have some very unique constraints. If I want to release an application, I put it into an app store. I don’t have control over that.

But I think that the important thing is just as a team, we’re going to be very clear and thoughtful about what we’re measuring, and do our best not to constrain the boundaries so that we always look good. Back to Brian’s vanity metrics. We’re going to help our teams improve by maybe even expanding those boundaries a little bit further in either direction.

Abi: A tactical question I had, actually we had at Microsoft was with regard to deployment frequency, when you’re looking at it at the enterprise or organizational wide level, it seemed like deployment frequency is just going to go up the more developers you have.

Is there some alternative measure or way of looking at deployment frequency that you would recommend where you don’t run into that problem where this number is really just reflective largely of the number of developers we have or the number of applications we have? For example, do you recommend looking at it by team individually, or divided by number of developers? I’m curious what you’ve seen.

Nathen: Yep. I think that again, the research is pretty clear. Every time we ask the question, “How frequently are you deploying? What’s your time to recover?” For the application or service that you are working on. These truly are team metrics. And rolling them up organization wide, I think there is a lot of peril to be found there.

It feels great to go to Microsoft and roll up all of those four key metrics. And Microsoft of course, is just going to be knocking it out of the park. They’re doing hundreds if not thousands, or tens of thousands of deploys a day. Who could beat that?

But when you look at team by team, what you will find in an organization like that, or in an organization like Google, is that one team may be deploying multiple times a day, and other teams may be deploying once every six months. And so bringing all that together at an organizational level, it’s really an inappropriate use of the data and that the idea is behind this, it is a team or a service level.

The other challenge I think is when I say team, it feels like I’m talking about your org chart. Really, it’s the cross-functional team that owns that application or service. Because if it’s not the cross-functional team, we may fall back into our old school pitfall of, “Hey developers, you own those throughput metrics. Hey operators, you own those stability metrics. Now go fight.” We can’t do that.

Abi: That’s so funny. So many lessons here that we’ve talked about around the common pitfalls of using DORA metrics. I want to invite you to share any other common misunderstandings or challenges that you see around the use of the metrics. And I also want to leave you with one more prompt, which was that you had mentioned you had recently had a good discussion in your community, which we’ll talk about later, about MTTR. Or maybe there was some similar confusion or discussion around it.

Nathen: Again, I think that the most important thing really when it comes to those common misperceptions and challenges is remember that there’s more to DORA than those four keys. The heart and soul of DORA is the capabilities and the predictive model. We can say that practicing continuous integration and getting better at that is predictive that you will then be able to do continuous delivery, which is predictive of doing software delivery performance, which is predictive of organizational performance.

So remember, we’re always looking at inputs and outputs, and outcomes. And I think it’s really important not to lose focus of the inputs to those outcomes. And software delivery performance really is an outcome. Oh wait, but it’s also an input to organizational. Wow, this is weird. Everything can be an input and an outcome all at the same time. Try not to get too lost in the details. Remember, you’re here to get better.

Now onto MTTR, boy, time to restore service is a really controversial measure. And I think that it’s controversial in part because as a collective community, we’ve learned more about how systems behave. And we’ve evolved our understanding of what metrics matter and how should we be looking at those.

And I think that when it comes to time to restore, some of the big pushback on that has to do with almost the same thing as our delivery lead time. It has to do with the fuzzy ends. And frankly, it was Courtney Nash at DevOps Enterprise Summit who gave a talk about the fuzzy ends of time to restore service. And it’s both sides.

So if you want to measure your time to restore service, the first thing you have to know is when did the service go out? Well boy, as you can imagine, we’re going to get in a room together and argue about that. When did this problem start? Was it when we saw it? Was it when a customer saw it? Was there some time before that? How do we know?

And then just as difficult to define, when did that problem end? When did we actually restore service? When we felt like when our monitors turned green? Was that a restoration of service? Who knows? So I think that there’s a lot of fuzziness in the numbers. And then more importantly, what do we learn from that? What do we learn from comparing time to restore service between one incident and another?

So I think there’s again, a lot of nuance and context that matters here. How are we using this number? And frankly, when we go back to the four keys, the four keys of software development. Sorry, software delivery. We are really trying to get to a place where we can essentially measure the batch size. Because we know we want a smaller batch size, smaller changes, constantly moving through the pipeline. They’re easier to reason about, easier to recover from.

So maybe, and this is a little bit controversial here. Maybe the time to restore that we’re really asking about within DORA is what’s your time to restore when there is a change failure, not your time to restore across all types of incidents.

But we do know through our own research at Google that a large majority of outages or incidents are caused because we introduced change. Which is different. Of course, there are some where a data center loses power. That wasn’t because of a software delivery, and that will take some time to restore. Should we really bucket those time to restores together? I don’t know. An argument can be made both ways. I think that those that are on the sort of leading edge of thinking about reliability, thinking about learning from our systems, would argue against it.

Abi: I have to admit, this is so much insight in what you just talked about. At both GitHub and Microsoft, we were so just tripped up by lead time and deployment frequency. I don’t think we ever got to the conversations about MTTR, but I’m sure others are as you’ve seen yourself.

Nathen: It’s even that M. Are we talking about the meantime to recover or the median time to recover? We can spend a day debating that as well.

Abi: You mentioned earlier how Nicole Forsgren was really the creator of this research program and the company DORA, which Google acquired. Nicole after leaving Google went on to be the lead author of this paper called SPACE, which was this framework for thinking about and approaching measurement of developer productivity.

And naturally when that came out, I think a lot of people were asking, “What do I do with these two things? How are space and DORA related? How do I use them together?” So I want to get your perspective on that. What is the advice you give to organizations who are looking at both of these frameworks and trying to figure out what to do with them?

Nathen: Yeah. For me, it’s kind of the same thing. Should you do DevOps or SRE? No, I’m sorry. It’s platform engineering now. That’s the thing. No, no, no. Of course that’s not how I look at them. I actually look at them as very complementary to one another.

So if we take DORA and say that the four key metrics are what we’re driving towards, what are some of the inputs to that? One of the inputs, one of the capabilities might be continuous integration. So are you practicing continuous integration? How do you as a team get better at that?

Well now, we might bring in the SPACE framework and say, “All right, we want to get better at continuous integration.” So we need maybe a measure around the satisfaction of the team, the S of the team. So we could go out and survey our team members. Is continuous integration giving you the feedback that you want in a timely manner? How do you feel about that?

We can talk about performance of continuous integration. The activity of continuous integration. How often does a commit actually result in a build? Activity, or even the efficiency and flow might even look at how many of those builds actually pass the test, or how many of those builds are successful builds versus failed builds. So there you might bring in the E.

And then of course there’s the communication and collaboration. Think about continuous integration. If you’re a developer, you might not be in charge of or have access to the underlying foundation of the continuous integration platform.

When the build fails, and I have to tell you, I used to be responsible for the Jenkins platform at the place that I worked. And anytime the build failed, it was always Nathen Jenkins is broken. It doesn’t matter that my test failed. My test failed because Jenkins is broken. So there’s a communication problem and handoff problem there.

So I think that we as a team decided that we want to improve continuous integration because we know our continuous integration capabilities are going to lead to better software delivery performance. Now let’s define our goals for continuous integration, using the SPACE framework, and let’s pick two or three of those SPACE type of goals and use those as our objectives. And for me, we’re going to take DORA out of space with this. I think it’s perfect.

Abi: Yeah, I love that explanation. And when people have asked me this question, I think it’s similar to what you’re describing as far as earlier you talked about how everything’s an input and an output. Or sorry, input and outcome. And there’s just these perpetual layers of outcomes and inputs.

And when we take a step back, the outcome we’re ultimately trying to get at is organizational performance. And then the effectiveness of our technology organizations drives organizational performance. And so many things need to be included in our view of what an effective engineering organization is. And as you mentioned, I think SPACE can help organizations think about the specific areas and maybe deeper ways to drive that overall delivery performance. Does that sound about right?

Nathen: It does. And in fact, as you’re talking through it, the other thing I think that’s important to recognize with SPACE is that SPACE actually is well applicable beyond just technology. You could go to a sales team and look, it’s easy to measure the performance of a sales team. Did they close all the deals? Did they hit their quotas? But that’s not enough. We do need to look at what’s the satisfaction of the team. If we close all of our deals and then those top performers leave there, there’s a problem. We’re going to lose out on some of that.

So I think even taking the SPACE framework, I think it’s more broadly applicable as a way to think about what sort of measures can we put against or use with these sociotechnical teams, or these teams of humans period. There doesn’t even necessarily need to be technology involved.

Abi: Well, I just want to leave listeners with just a heads-up that we do have a conversation coming out with Margaret-Anne’s story, who is one of the co-authors of SPACE, coming out in a couple weeks. So if you’re more interested in diving deep into SPACE a couple years after it was published, hearing about how that’s gone, look forward to that conversation.

I want to ask you now about how to actually measure the DORA metrics. But really, specifically what you guys do now at Google to enable that. I know that when Google acquired DORA, DORA was not just a research program, but an actual product solution that companies could use to survey their own engineers and organizations. Is that something that Google is still doing or offering? Curious to know what you guys provide in that hub.

Nathen: Yeah, for sure. So DORA built up a capabilities assessment tool that was exactly that. It was a survey that you deployed within your organization. And we would crunch all the numbers, use the data models, and generate a report, give some suggestions. We still use that tool within Google. We still offer that to our customers.

But the other thing that’s really important is we’re trying to bring more of that and make it just broadly applicable to the entire world, and give more sort of self-service access to that. So as an example, at cloud.google.com/devops, there’s a DevOps quick check where anyone in the world can go and not only quickly assess how are you doing against the four keys. But the next step of that helps we identify the three capabilities that typically help teams at your level. So then you can do a deeper investigation and assessment of those particular capabilities for your team, and kind of stack rank.

And I think over the next year, you’re going to see more of this being a little bit more transparent. Bringing as an example, the questions that we ask in the survey, making those a little bit more durable and easy to access, so that you can start as a team leveraging DORA even more.

Abi: I want to talk more about what’s upcoming with DORA. Because as we’ve talked about, it’s constantly evolving and iterating. You just mentioned there’s some potential for some more sharing or openness, and maybe even tools around capturing these types of insights within organizations. I want to ask you about one recent development, which is this community that you guys have launched. Can you share with listeners what is this new community, and why did you guys start it?

Nathen: Yeah, absolutely. So the community, if you want to go to it and join, and of course I would recommend that you do. Go to DORA.community. Who knew community was a top level domain name, but it sure enough is. So DORA.community is where you can go.

The genesis behind this, the reasoning behind this. As we continue to work with various customers, as we continue to watch what’s happening in the industry, I see more and more people adopting the ideas and the practices and the metrics around DORA. And every time I talk to them, there are questions like, “Well, when does change lead time start? When does it end?” And of course, I have answers and opinions. But if everyone’s asking this, let’s bring everyone together, because we can all learn from each other. So it truly is, or we envision it as a community where practitioners, leaders, and researchers can all come together and start sharing ideas.

Today, it’s a discussion group, and we hold regular DORA community events that are online, virtual, open conversations. We use a lean coffee format so that we talk about the things that are top of mind for the people that show up for those events.

And I tell you, it’s been really fascinating. I think we’re all learning a lot from each other. I’ve been especially encouraged because not only do we have practitioners and leaders showing up. But we do have researchers showing up. It’s folks that are on the DORA research team. But we’ve also had researchers from other areas of the industry that are looking at different things.

In fact, yesterday we held a community meeting. We had Dave Farley, who was one of the authors of this year’s report, but also wrote continuous delivery, Modern Software Engineering. So he came in to share some of his insights, and then we had a conversation, Dr. Nicole Forsgren joined in to that conversation. So I think it’s really great to see this community come together where we can all learn from and grow with each other.

Abi: I can just provide my own testimonial. I joined the community as soon as I heard about it. And I’ve really enjoyed the often, pedantics sounds bad, but really detailed discussions about things like definitions, and how to use these metrics, and how to capture insights outside of the DORA framework as well. So would highly recommend to anyone who’s interested in DORA to join this community.

You also mentioned that one of the goals of the community is feedback on the direction of the research program. So I want to ask you, what is the direction of the research program? Naming and branding was something you brought up as an interesting problem you guys may be working on. But would love to know, what should we expect from DORA in the years to come?

Nathen: Yeah. I think that DORA has a well-recognized brand. And DevOps itself has a well-recognized brand. Well-recognized does not always equate to good. Just well recognized. I think DevOps is a good example of this. I think that there are many in the industry who hear the word DevOps and think that we’ve moved beyond that. And whether or not we have is unimportant. I think there is and was a time where it was a great banner for us all to get behind. But we didn’t all have the same understanding of what that word meant.

I think I have a little bit of concern about DORA, because the D and the O stands for DevOps. And I truly believe that this is applicable to any type of technology and any type of technology teams. So I think that’s a big question for me. How do we balance that? I don’t have a good answer. Maybe we’ll come up with something.

And then in terms of where we’re going with the research also, as I said, we’re looking at which outcomes are we going to measure and look at this year. We’re going to have a DORA community discussion where the researchers come and share, “Here’s what we’re thinking for the 2022 survey and report.” Gather some feedback and some insights from the community. Which already is influencing some of the things that we’re talking about.

Abi: Well, that’s great to hear. And Nathen, thanks so much for sitting down with me today. DORA is top of mind for so many leaders and organizations out there. So I think people are going to get tremendous value out of this conversation. Thanks so much for coming on. Really enjoyed the chat.

Nathen: Thank you so much for having me. It’s been a real pleasure and a lot of fun.

Mastering DORA metrics with Google's Nathen Harvey

Timestamps

Transcript