Skip to content
Podcast

Key findings from Google's 2023 State of DevOps Report

An interview with Nathen Harvey, who leads DORA at Google, about the key findings and emerging trends surfaced in this year's State of DevOps report.

Timestamps

  • (1:10) What DORA focuses on
  • (2:17) Where the DORA metrics fit
  • (4:35) Introduction to user-centric software development
  • (8:05) Impact of user-centricity on software delivery
  • (9:40) Team performance vs. organizational performance
  • (13:50) Importance of internal documentation
  • (15:19) Methodology for designing surveys
  • (19:52) Impact of documentation on software delivery
  • (23:11) Reemergence of the Elite cluster
  • (25:55) Advice for leaders leveraging benchmarks
  • (28:30) Redefining MTTR
  • (33:45) Changing how Change Failure Rate is measured
  • (36:45) Impact of AI on software delivery
  • (41:25) Impact of code review speed

Listen to this episode on:

Transcript

Abi: Nathen, so good to have you back on the show. Excited to chat today.

Nathen: It’s really great to be back. Thanks for having me. Really appreciate it.

Abi: Of course. Well, last time you were on the show, we went deep into what DORA is, the research program, how the work lives on at Google. Just to give folks a quick reminder, who are tuning in for the first time, share with listeners what DORA is and what your research focuses on.

Nathen: Yeah, absolutely. DORA’s a research program that started… wow, it’s been almost a decade now. This research program started elsewhere. It’s currently part of Google. But this is a research program into how do teams get better at delivering and operating software. And I really like to talk about it as if… and because this is what it is, it’s about how do we help organizations use technology in furtherance of their business goals, and how do we improve organizational performance through technology? And of course, the way that we have to do that, technology matters, but it’s also, we want to look at the whole system. It’s the people in the system, the culture, the process, all of this. And so DORA is really an investigation into which of those capabilities are driving software delivery and operations performance, and whether or not there’s a connection between software delivery and operations performance and actual organizational outcomes.

Abi: And of course, the DORA metrics, the four keys are everywhere. Everybody’s using the DORA Keys. So remind people, what’s the relationship between the DORA metrics and DORA?

Nathen: That’s a great question. So DORA is… as a research program, it’s full of metrics, but you’re right, it is well known across the industry for the Four Key metrics. And those Four Key metrics are really DORA’s measures of software delivery performance. And so when you think about software delivery performance, of course, that is only a part of your entire life cycle. It’s only one of the ways that you can deliver value using technology. But the idea that delivering software is a crucial step and often, a bottleneck within organizations.

There’s two measures for stability, the first is your change fail rate. When you push a change, how frequently does it do what you expect? It would be a failure or we would count it as a failure if you had to immediately roll it back or push forward a hotfix. Of course, you’re going to release things with known bugs, known issues. If those things are going to wait until the next release before you fix them, we wouldn’t count that a change failure. And then our fourth one is about this idea of, we’re storing service when there is a failed deployment. So we call this the failed deployment recovery time. How long does it take your team and or your system, if you will, to recover from that failed deployment?

So taking together, really quickly, those four metrics. One of the cool things about them is you can use them for any type of technology that you’re delivering. It doesn’t matter if it’s the latest and greatest AI model that you want to push out, or it’s your mainframe application where you’re shipping some changes. You can use those four metrics across any of those applications.

Abi: Well, for listeners who are interested in learning more about these metrics, some of the pitfalls, challenges, calculating that, we had a great conversation a few months ago, I’d point people to that prior episode. One of the things that’s so interesting and enjoyable for me to follow with DORA is the way that the research and the program continues to evolve. And today, I want to focus in the latest report, the 2023 DORA Report, because there were so many things that were new, things that have changed that were fascinating to me. So I want to start with this idea of user-centricity or user-centric software development. This is a new construct, a new concept I hadn’t personally noticed before in prior reports. So take us through, first of all, how did this new concept end up in your research?

Nathen: Yeah, for sure. That’s a great question. So we’ve seen a real resurgence or push within the industry towards this idea of platform engineering. That’s one of the things that helped us push towards this user-centric measure. Well, platform engineering, unfortunately, a lot of times, it gets marketed as DevOps is dead, platform engineering is the new thing. Well, maybe, I don’t know. I’m not super tied to labels, I guess. But really, when you think about what’s happening with platform engineering, you have engineers that are building internal platforms for other developers. One of the challenges that I’ve seen happen with platform engineering is that you have platform engineers that go and build a platform and then they get really angry when the other developers aren’t using their platform. Why aren’t they coming here? We built it, why aren’t they coming? That can be a challenge. You also look at things like, how do you measure reliability?

Well, you don’t measure reliability by looking at your CPUs. You have to know if your service is reliable because you understand the users. And so there’s a lot of these different things where we’re really starting, or focused in on this idea that technology is here, not so that we can play with the new toys, but instead so that we can serve our users. And so it’s this product thinking and this idea that we have to care about our users and that maybe, there’s something to that. And so this year, we did ask about user-centricity. The specific questions we asked, we… Sorry. So this year, we asked about user-centricity. The specific question we asked, we said, “Thinking about the primary application or service you work on, to what extent do you agree with the following statements?” And we had three statements that were around user-centricity.

The first, my team has a clear understanding of what our users want to accomplish with our application or service. The second, my team’s success is evaluated according to the value we provide to our users and our organization. And third, specifications, things like requirements or planning documents are continuously revisited and reprioritized according to user signals. So in other words, are you thinking about your users? Do you know how they’re getting value out of your platform? And when you get feedback from them, how are you acting on that feedback? This is what we measured around user-centricity.

Abi: What’s really interesting, you brought up that part of the rationale for introducing this into the research is the emerging trend of platform engineering. But as I understand it, this concept applies to both users who are internal within your organization, as well as users who are external, if I’m understanding correctly.

Nathen: That is totally correct. Platform engineering was part of the thing… We wanted to go and study platform engineering. But if I ask a team, do you have a platform engineering team within your organization, that’s not actually telling me anything, you may have just changed the label of a team. So we really want to just get to the characteristics of some of the things that might be happening there. So yes, platform engineers have users. Those users happen to be my coworkers that are developers, but every service that we run, my banking site, a retail application, they have users as well. Sometimes we call them customers and we have to really think about who they are.

Abi: So what were the findings? I would speculate that user-centricity matters, what surface in the research?

Nathen: Yeah, I mean, two of the big things that popped out, those teams that have a stronger user focus tended to have about a 40% higher overall organizational performance. They also had better wellbeing for the employees on the teams that have that user focus. So there’re less burnout, happier job satisfaction, things along those lines.

Abi: And in your view, what’s the… The stats speak for themselves, but what’s the story behind this? What’s the why? What’s the takeaway for organizations as far as platform engineering or customer facing work from this data?

Nathen: I think that the practitioners that we work with every day, whether you’re a software engineer or a product owner or an operations engineer, an SRE, what you really want to do is ship value. You want to understand that people care about the work that you’re doing and that it is valuable to them. And it’s nearly impossible to understand how valuable your work is to your customer if you don’t know who your customer is, if all you’re focused on is pulling that next card off the backlog and getting to done as soon as possible. And so I think that, that idea that as we work, we can see how our work is valued and we have a closer, tighter connection to the users of the work that we’re producing, I think that, that really helps both with job satisfaction and then of course, happier, productive people tend to work in more successful organizations and help drive that success.

Abi: That makes sense. Well, I want to shift gears a little bit, and this one’s really nerdy. As you know, I geek out over all the research you guys put out. One thing that was new was this new construct of team performance. In all the prior studies DORA has put out, there’s been a focus and study of organizational performance, commercial performance, non-commercial performance. This year, I saw this new construct of team performance. So again, I want to ask you, why did this get introduced?

Nathen: That’s a good question. So just going back on the methodology here for a quick moment, the way that we run our research is that we put out an annual survey and we ask people to participate in this survey. We don’t have a good way to audit the answers that people gave us because they’re participating anonymously. So we do ask them about organizational performance. How profitable is your organization? Is customer satisfaction going up? Are you meeting your business goals? Things along those lines. But we know that for some practitioners, they may be pretty disconnected from that. So can we ask something that’s a little bit closer to them but beyond their individual remit? And that’s where this idea of team performance came in. And of course, we hypothesize that a happy, productive individual is going to sit on a happy, productive team that those organizations with more productive and happy teams are going to have better overall organizational performance.

So we wanted to test some of that out and dig into this idea of team performance. And then you get to this weird challenge here, Abi, what even is a team? Who’s part of my team? So we had to use some pretty specific language in our survey. And what we said was, we recognize that team has many different connotations and definitions. When we say team, we are talking about the people who work with you on the same primary application or service. Now, think about that, in a large organization, that’s going to be a cross-functional team that comes together on that primary application or service. And then again, we use a Likert scale that asks between strongly disagree and strongly agree, to what extent do you agree or disagree with the following statements about how your team performed over the past year. We want to look at the recent performance as well.

So how? We delivered innovative solutions, we were able to adapt to change. We were able to effectively collaborate with each other. We were able to rely on each other and we worked efficiently. Those were the measures that we used to look at team performance. And by the way, while I’m describing those, it may sound like I’m reading them. And the reason it sounds like that is because I’m reading them. I’m reading them from dora.dev/research/2023, where we publish all of our survey questions. I think this is really important because when we say something like team performance, you might have a very different understanding of what that means than what I do. And that’s fine. That’s going to happen. If you look at the questions, you’ll get a better sense of exactly how we evaluated that.

Abi: I love that you have started openly publishing more of the research and the surveys you all use. And not only differing definitions of performance could be a problem, but differing definitions of team as you described.

Nathen: Absolutely.

Abi: What are some of the interesting findings around where team performance and org performance are in line with one another or maybe instances where they weren’t?

Nathen: Honestly, I think that this is one of the challenges of how we do the survey as well, because of the anonymous nature of how we collect data, we may hear information, we may have data from two different teams that are working within the same organization, but we have no idea. We have no way of knowing that. We cannot actually classify that. So really, I think our methods leave us unable to say, how do those multiple teams interact together to drive organizational performance? But what we can see is that teams that reported higher team performance tended to report higher organizational performance as well. Not always the case but oftentimes.

Abi: And in a minute here, we’ll talk about documentation, which I think there was an interesting discrepancy between the correlation with organizational versus team performance. I want to get into it. So documentation, this is something that was mentioned in the 2022 report and again, surface in the 2023 report. Again, maybe give a quick introduction to listeners of what we mean by documentation and how you guys measure it.

Nathen: Yep, absolutely. So first, I think this is the third year actually that we’ve looked at documentation as a specific topic, so 2021, '22, and '23. When we talk about documentation, we are explicitly talking about internal documentation, teams that… A software development or delivery team might use their own internal documentation. And again, we ask on a Likert scale, and I’ll just read off the questions here, I can rely on our technical documentation when I need to use or work with the services or applications I work on. It is easy to find the right technical document when I need to understand something about the services or applications I work on. Technical documentation is updated and changes are made. And then when an incident or a problem needs troubleshooting, I reach for the documentation. So here, we’re really trying to get to things like the usability, the searchability, the findability of that documentation. Is it actually helping you, as an example, when there is an incident, are you looking to the documentation or do you know that it’s so out of date that reading the docs is actually going to slow down your response and maybe even have detrimental effects on your response? So that’s really what we’re trying to suss out there.

Abi: And I know we talked about this the last time you were on the show and we’re talking about getting someone from your research team on the show in the future, but as you’re talking about these Likert items and how you measure these different constructs, share with listeners, I mean, how do you guys come up with this stuff? How are you figuring out these survey items and design?

Nathen: Oh, yeah, that’s a really good question. And I’ll start by referring all of the listeners to the methodology section of this year’s report. There’s about 10 pages wherein we go over the exact methodology. And really, it starts with us getting together as a team to think about what sort of outcomes matter to organizations, organizational performance, team performance, employee wellbeing, and so forth. But once we have those outcomes that we believe matter to organizations, we then have to hypothesize about how do we think those outcomes are achieved? And so there, we start to model a couple of things, like, documentation we hypothesize is going to help orgs perform better. We don’t know that upfront. We have to figure that out. We then look at potential surprises and try to uncover those. Then we develop the survey and as we develop the survey, this is, again, where it’s really important how we create the survey.

We talked earlier about team performance, and other capability that we look at is continuous integration, but we actually never use the term continuous integration in the survey. Instead, we talk about the characteristics of a continuous integration capability. Code changes are merged, sorry, are built as created every time code changes are merged, automated tests run every time code changes are merged. So we aren’t asking about the label, if you will. We’re asking about those characteristics. And then of course, we also have to do a good survey design. I can’t ask you on a scale of disagree to fully agree, do you like the color red or yellow, because what if you like one but not the other? We have to be careful about that. And that’s relatively basic survey design, but sometimes we have to remind ourselves of that even still. So then we collect all the responses, we analyze the data, we go through and do a factor analysis.

So when we ask four or five questions about a particular topic, we want to make sure that the answers are all grouped well together. This tells us that, actually, these four or five different things are actually one construct, if you will, and then we can treat it as such. And then we go through a bunch of additional evaluation. We write up our report. And this is the part that I love the best, and this is what we’re doing right now. The last step is to synthesize the findings with the community. It’s really a discussion around what we’ve found, and how does that relate to your own lived experience? What information and ideas can you take out of that as you seek to help your team improve?

Abi: Well, I want to plus one your point about the challenge of the survey design. As you know, I’m working on a study with Nicole, some folks at GitHub and Microsoft, we put together this survey, ran it, and of course, as we’re analyzing results, we’re realizing all kinds of potential flaws in, not only the design of the questions, but even what the questions are really getting at and how they were presented. So such a challenging problem, and so hard to iterate because as you know, we typically do this once per year.

Nathen: Exactly. One of the things that we do is we do some sort of pre-testing with our surveys. That’s another thing that I know researchers use all the time, like, let’s take a smaller set of people and just test them out to make sure that we’re getting valid data on the surface. Of course, then you open it up to a much larger audience. So just like if you ship to 1% of your users and then suddenly open it up to 100% of your users, you’re going to find different things. I think that’s one of the challenges and one of the things that is really interesting. Second, boy, I would love to just be surveying people all year round, lots of small surveys all the time, but nobody wants to answer lots of small surveys all the time. And this was a thing that we’ve put a lot of attention into this year, the length of a survey matters.

We have essentially one shot each year to gather this data. How do we get the most value out of that? And so this year, we did focus on making our survey much shorter than last year. It probably took the average person about 15 minutes to go through the survey, which was significantly shorter than last year. And this, combined with a couple of other things, led to a much better data collection. In fact, we saw 3.6X more responses this year than we saw last year. So having more data means more data we can analyze, more pathways we can find. So we’re pretty happy about that.

Abi: Yeah, I saw the numbers. Congratulations on that. Coming back around to documentation, and I’m really interested in documentation, because it’s something that overlaps with some of the research I’ve done, developer experience. So I remember I was going back to last year’s report and there was a note in the surprises section that mentioned that documentation practices negatively correlated with performance. And there was some skepticism mentioned in those notes. It said, we don’t have data to support this or dispute it at this time. So I’m curious, what did you find this year?

Nathen: So first, before we even got to this year, we did further analysis following last year’s report. And there’s a blog post, I’ll make sure that you have a link for the show notes, but the title is something along the lines of, documentation is like sunshine. And what we found as we dug deeper into the results and we replicated that again this year, was that documentation is one of those foundational things that enables the technical practices that we talk about. So with better documentation, you’re more likely to have better technical practices, like continuous integration, like loosely coupled architecture and so forth. But beyond that, it’s like sunshine in that it amplifies the effect of those technical practices. So documentation leads to better technical practices. Technical practices with better documentation leads to even better organizational practices. So those technical practices or capabilities, you are better at them and they matter more when you have good high quality documentation.

Abi: Interesting.

Nathen: Now, that said, what we see is essentially no direct effect on software delivery performance this year, but definitely, that impact on your technical capabilities and then at the far end, your organizational performance, I would expect to see more deep dives like that across the next couple of months. So definitely, be on the lookout for those.

Abi: It’s really interesting. So to recap what you’re saying, as you drilled down into understanding the fact of documentation on org performance, you found that documentation is really an enabler of the practices, which in turn impact performance, but that documentation by itself, at least in the data doesn’t show that correlation as much.

Nathen: Right, right, yep.

Abi: And so then the other thing that stood out to me then in this year’s report was when you look at team performance, it seems like the data told a very different story. So share with listeners what you found and why this is the case when it’s maybe not the case for org performance.

Nathen: I think this really goes back to the differences in the way that you measure team and org performance. So team performance, as you saw, we talked about, as a team, are you resilient? Are you able to respond to change? Are you able to be innovative? This is tied, but not directly, with what is your customer satisfaction like? What is your profitability like? These two things are related but not the same thing. And so it’s not like team performance is just a smaller org performance, and then we just expand it to org. They’re really two different things. But I do think that when we look at team performance, we definitely see high quality documentation leads to a 25% better team performance.

Abi: Really interesting. And while we’re on the topic of performance, let’s talk about the performance clusters. Last time you were on the show, we were just fresh off the 2022 report in which the elite cluster had not presented itself. There was no elite category last year. This year, elite is back. So would love for you to first of all remind listeners how these clusters are actually determined. And then last year, you had really interesting thoughts on why maybe there wasn’t an elite cluster. So I’d love your similar speculation as to why you think maybe the cluster reemerged this year.

Nathen: Yeah, that’s great. And thank you for using good language here. When it comes to the clustering and setting these performance levels, we don’t set those in advance. We don’t decide before we look at the data what clusters are going to be there. But each year, we do, do at least one cluster analysis. For the past couple of years, we’ve done actually multiple different levels or lenses of cluster analysis. But consistently throughout all of the reporting in the DORA research, we have done a cluster around software delivery performance. So looking at those Four Key metrics that we talked about earlier, we take a look at those as a group and we try to identify, looking at the data, are there clusters of respondents where in a cluster, the respondents in that area are going to be very similar to one another and yet, all of the people in that cluster are going to be distinct from other groups?

I mean, I guess that’s just a basic definition of what a cluster is. But what we found this year was in fact that there were four clusters. And one of the challenges is we have to put a label on those things. And so we went back to, or we continue to use these labels that we’ve used historically, low performers, medium performers, high performers and elite performers. Now, those designations, just to remind you, are about software delivery performance. So you’re a low software delivery performer, a medium, high, or elite software delivery performer.

Now, as you mentioned, in 2022, we only saw three clusters, so we had to label them something. We chose low, medium, and high. So yes, it is true that elite has reemerged this year. I think that, really, what we’re seeing when you compare the two years is that just the delta and the gap between the lowest performers and the highest performers has actually increased. The low performers of last year or what we might’ve considered low performers last year, we’re seeing maybe slightly worse performance this year than we did last year, and at the high end, a similar thing, maybe slightly better performance at the top end.

Abi: So one of the things that’s also interesting about these clusters is how they’ve changed from year over year. What advice do you have for organizations, leaders who are trying to interpret these changes over time and perhaps, compare their organizations against them or compare the clusters themselves against each other over time?

Nathen: Oh, yeah, we see this question a lot and I talk with people all the time about this. And I think, first, the thing that’s maybe most important is to realize that each year, what we’re doing is providing a snapshot of the data across the industry right now. And based on how we collect data, I’ve mentioned a couple of times here that it’s an anonymous collection of data, we actually have no idea who’s in our sample set from one year to the next. So comparing year over year over year, sure, there were some maybe statistical trends that you can trust as we look broad based across the industry. But the reality is that we are very likely to have different participants each year.

And then you go to a year like 2022 where we had about a third of the respondents that we had this year, that’s going to have some impacts on the data that we’re able to find and the things that we’re able to learn from that. So I think, first, you got to be careful with the conclusions that you draw, looking at what did an elite performer do in 2021 versus 2022 versus 2023, as an example. But I think that the thing that is super valuable is how did your team do in 2021 and 2022 and 2023, and what does your team’s trajectory look like? Because that really helps put all of this into context of your team, your applications, your users, and I think that’s the really important and valuable lesson to take away here.

Abi: Right. DORA isn’t a precise longitudinal study. The data samples each year are a little bit different. So there’s some danger in overly latching on to comparisons against the benchmarks year over year that might cause some internal distress for organizations if they’re too focused on that as you’re advising.

Nathen: Absolutely, absolutely. And you don’t want to be changing the goalposts all the time on your users or on teams within your organizations. What does good performance look like? I mean, my own personal perspective is good performance looks like something that is better than it was the last time we measured. That’s good. I’d much rather see elite improvers than elite performers, personally.

Abi: Right. And that was a point you brought up on the last time you were on the show as well, the advice to really focus on the amount of improvement rather than absolute comparison with industry benchmarks, if you will.

Nathen: Yep, absolutely.

Abi: I want to now go into… and this is something I’ve been excited to talk with you about ever since you were on the show last time because you gave this great explanation of the problem with MTTR at the time. What is the M? Is it median? Is it mean? What even is it? And this year, you guys changed, at least, the name of that metric, from what I can just tell. So first of all, remind listeners what the problems were with MTTR and then explain what drove the change and what is the change this year.

Nathen: Yeah, absolutely. So look, each year, we’ve asked about your time to restore service, and then we report on time to restore service. And oftentimes, we put an M in front of that TTTR or TTR, and of course, everyone argues or forgets or sometimes even, we say different things to different people, whether that M stands for median or mean. And maybe that doesn’t matter so much but probably, it really does. All right, so now we’re going to set aside the M. What does the M stand for? It was usually median. But set that aside for now. Now, we come back to this idea of your time to restore service. Look, what does it mean to restore service? Well, in order to restore service, first you have to know that an incident happened. Second, you have to assemble some part of your system, whether that’s actually your systems or the people in your system as well to come together to resolve that issue.

And then you have to know when it’s resolved. So there could be a lot of challenges in that. Think organizationally, how long did it take you to recover? Well, when did the incident start? Did it start when a customer first noticed it, or when we first started taking action on recovering that? And depending on the organization and what you’re incentivized towards, you may want to answer that question very differently. And so there are some challenges there figuring out what’s the front end of this time? Oh, and similar challenges on the backend, when is it actually restored? Because usually, when a service gets fixed, your customers don’t open up another support ticket to say, hooray, it’s working now. Thank you. So the whole, when does it start, when does it stop, gets frustrating. Second, what can you learn from that over time? And I think this is really the more important piece. This question of what can you learn about how your system behaves, and system, of course, includes the people in it, how your system behaves when there is an error.

And frankly, looking at the mean or the median of your time to restore service can not necessarily give you a whole lot of good insights, especially if you’re comparing multiple types or classes of incidents to one another. Imagine this, Abi, you’ve got a data center where all of your stuff runs, and unfortunately, there was an earthquake and that earthquake took out power to your data center and destroyed all of your machines. How long will it take you to recover from that? The answer to that question is likely very different than you had a configuration change that you pushed out to your servers and it caused an incident. How long is it going to take you to recover from that? These two are very different classes of issues and trying to just bunch them all together and say, well, what’s the median of those two, or even the mean, it’s not really informative.

Abi: Too fuzzy.

Nathen: Yeah, it’s way too fuzzy. So we also then reflected on those Four Key metrics. What are they trying to measure? They’re trying to measure software delivery performance. So maybe that earthquake that takes out your data center, you should be able to recover from that but it has nothing to do with software… Well, all right, it does have some things to do with software delivery performance, but not your day-to-day software delivery performance. So how do we focus this question to really look at, how long does it take you to recover during your software delivery process? So we did the sciencey thing, we asked this question twice this year. We asked it our more traditional way. When a change or when an incident happens, how long does it take you to recover?

And then we asked it a second way, for the primary application or service you work on, how long does it generally take you to restore service after a change to production or release to users results in degraded service and subsequently requires remediation?

So it’s that after a change to production or release to users that really focuses us in on that change or that deployment failure itself. And so because we focused in there and we’re not looking at the general time to restore service, in addition to changing how we ask the question or introducing a new way to ask the question, we also changed the label on it. So the new label is in fact your failed deployment recovery time. And when we looked at the data, by the way, we asked it both ways, that failed deployment recovery time loaded much better with the other four software delivery metrics. So it seemed much more in line and statistically relevant to the other three.

Abi: Well, that’s awesome. After all that work, it’s good to hear that the numbers, the stats were in your favor as well, speaking as a researcher.

Nathen: Yeah, we grouped out our hypothesis there. It was pretty good.

Abi: And you mentioned earlier when we were chatting that you also made some changes to how you measure change failure rate, share with listeners what those changes were.

Nathen: This is also interesting. I know I just told you not to compare year over year, but now, you know what I’m going to do, Abi, I’m going to compare year over year. So if we look back at 2021, your change failure rate for elite performers was zero to 15%, for high and medium, it was 16 to 30 for both of them, and then also for low, it was 16 to 30. There wasn’t a whole lot of variance there except for those elite performers. And then similarly, in 2022, the numbers, there was more spread but maybe still not so much variance there. And this is a pattern that we’ve seen actually for many, many years. In fact, if you go back to the book, Accelerate, and read what Dr. Nicole Forsgren wrote there, you’ll notice that change failure rate has never really loaded properly with the other three.

Statistically, it doesn’t sit well with them. So we were wondering why is that. Well, we looked at how you were able to answer the question in the past. So in previous years, as an example, in 2022, one of the ways that you could answer that question, it’s not on a Likert scale, but instead, we gave you buckets, buckets wherein you could answer that question, zero to 15%, 16 to 30%, 46 to 60% or whatever the number was. Don’t quote me on those numbers, but there were five distinct buckets of numbers that you could select. So we had a hypothesis this year that if we gave respondents a little bit more control over their answer, one, that number might load better.

And two, if we look back a decade ago when this research started, we maybe didn’t have a good handle on how frequently our changes were failing. But I think today as the industry has matured, we’re learning more about that and we have better access to that data perhaps, and certainly to a much better lived experience around that data. So we added a slider this year. Instead of selecting one of those buckets, you could select any whole percentage between zero and 100%. And we were happily surprised to see that, yes, in fact, this number, that the answer is now load properly so that the look across those four metrics are statistically all significant and joined together into one construct, hooray, for science and for data. And that also helped us get much better fidelity on the differences between low performers, medium performers, high performers and elite performers.

Abi: Well, for context for listeners, and we talked about this on our show last time you were here, this research has been going on for such a long time and you are still making these improvements and refinements to how these things are defined and measured, just goes to show how tricky it is to measure software delivery, developer productivity, et cetera. So congratulations on the progress, and I think listeners will find this really interesting. I want to now shift over to AI, and I wouldn’t say I was surprised to see AI mentioned in the report, given the times, but of course, AI is the hype right now. So I want to hear from you guys, what did the data say? What were your investigations focused on? What should listeners take away about the impact of AI on development and software delivery?

Nathen: Yep, for sure. And of course, AI is in the survey. It is top of mind for everyone right now, but we did run into a challenge, how do we even assess that capability within a team? And so when we sat down at the beginning of the year to design the survey and get it created, even just going back, I mean, it’s October of 2023 right now, if you go back to January of 2023, the AI conversation was not the same as it is today. It is evolving super fast. So one of the things that we did, and the way that we assessed AI was we said this, for the primary application or service that you work on, how important is the role of artificial intelligence in contributing to each of the following tasks? And then we listed, it looks like about 20 tasks. I’m not going to read them all for you, but things like analyzing logs, monitoring logs, organizing user feedback, identifying bugs, writing code, all of these were things that, potentially, you could be using artificial intelligence for them.

And what we found was that about a little over half of our users are using AI in some fashion across one or more of those tasks. What we also found in our data was that the teams that are using more AI, we see a moderate impact on the wellbeing of the people on those teams. So they’re happier, they’re more productive. This makes sense. I’m an engineer, I want to play with the new toys also. AI is a new tool. Let’s put it into my hands. Let me start kicking the tires, playing with it. That’s a lot of fun. What we haven’t seen yet though is that, that improvement flowing all the way through to organizational performance, and frankly, we just hypothesized that this makes sense because it’s all just so new. It hasn’t really had a chance to really take hold within teams and really start driving that organizational performance.

The other thing that’s true about AI, I talked about January 2023, what was the conversation in AI, in July, our survey closed and that even from July to today, October, things just are continuing at this super rapid pace. So again, this speaks to the idea that this is a snapshot, a moment in time of how things are going. So to recap, what we see is, one, lots of people are using it. Two, it’s driving or improving wellbeing for individuals on your team. Three, we’re going to have to wait and see on impacts on organizational performance.

Abi: Definitely such a fast evolving space. You mentioned you had, had respondents write a number of different tasks that they could potentially be using artificial intelligence for. Were there any in particular where you saw currently at least a higher level of usage of AI to assist in those tasks than others?

Nathen: Oh, yeah, Abi, that’s a great question, which particular things stood out? Honestly, when we asked, we gave users or participants a scale slider, again, they could select that AI is extremely important, all the way down to on the low end, not at all important. And so what we’ve reported is that just about everyone or a large number, more than half of our respondents answered, or did not answer, not at all important. So over 50% of our respondents, it plays some role. And then in the 20 to 40% range is where we see those users saying it’s extremely important. But the truth is that when you look at the report on page 26, you’ll see this graph, it’s not quite a vertical line in terms of tasks. It’s pretty darn close though. None that really, really, really stand out relative to any of the others, I would say.

Abi: Well, as you said, I think it’ll be really interesting to see where we’re at next year when you guys run the next one. I’m sure the results will look very different.

Nathen: Yeah. I’m excited for us both to find better questions to ask to help assess how AI is important to you and how you’re using it, but also, just to see how… We’ll probably repeat this question necessarily so that we can see some trends, see also our earlier comments about year-over-year comparisons of the data. But it’s still interesting.

Abi: Well, Nathen, you’ve been a great sport letting me go deep into specific questions and changes in regards to this year’s report. To wrap up, I want to invite you… Any other key takeaways or themes that you think would be valuable to listeners to take away from this conversation?

Nathen: There is one thing that I think is really fascinating and I’m intrigued to continue digging into this. One of the questions that we asked about this year was about code review speed. So thinking about, I’m an engineer, I wrote some code, I hand it off to you to get your review on it before I maybe put it into the pipeline. We’ve seen so many teams having this process in place and we’re really interested in what is the impact of that and how might you improve it and what does it do for your overall performance. And one of the things that we found was that faster code reviews really unlock software delivery performance. In fact, we saw that speeding up code reviews led to 50% higher software delivery performance, 50% higher software delivery performance. That is impressive. There are a lot of interesting takeaways from that though.

So the first, we have to remember what DORA is really here to help you do, is find your constraint. If your code reviews are already fast or maybe even non-existent, don’t try to make them faster. That’s not your bottleneck. But where your code reviews are slow, I think you have a big opportunity to improve there. And when you think about code reviews, think back to, I don’t know, the DevEX framework, where we talk about flow and reducing cognitive load and getting faster feedback. Wow, code reviews play into all three of those criteria around developer experience. I think that’s really cool. But also, DORA, every year, continues to find and reiterate how culture matters. Think about the culture of a code review, and how much do you trust the other people on your team. We also look at how work is distributed across teams, who is being asked to do your code reviews and what sort of feedback is going into the code review based on who requested the code review.

I think that’s really interesting. But then I think the other thing that’s super interesting, we just got off this topic of AI, we see it’s starting to play in here. Well, what if you are on a team that’s interested in an AI experiment, bringing in some AI, and you have slow code reviews? Boy, this is where I think it all comes together. So now we start asking this question, on my team, how can I use AI to speed up my code reviews? And immediately, you can probably imagine 16 different ways. But then I think, and this is, again, where the power of DORA comes in, I think, now, before we introduce that, let’s set some baselines, how long are our code reviews taking?

What do our software delivery performance numbers look like? And literally, you can set that baseline in three minutes with your team in a conversation. With that baseline in place, now let’s introduce this AI or any new capability. Let that mature for a minute. Let’s get good at that capability, and then let’s retest against our baseline. Is it having the right impact? The impact that we expected? This process allows us to essentially take the scientific method and apply it to the work that we’re doing. To me, that is super exciting.

Abi: I share your fascination with code review, and I love that you touched on the developer experience framework because code review is this amazing intersection of so many, both, technical and social factors, and not just within a team, but often cross team, documentation comes into play, technical debt, code quality comes into play, process comes into play. So really interesting finding. Thanks for sharing that. Nathen, it’s always great to have you on the show, such a big fan of what you and your team are doing, continuing the DORA program. Thanks so much for your time today. Really enjoyed the conversation.

Nathen: Thank you so much, Abi. It’s always a pleasure and we’ll talk soon.

Related: State of DevOps Report: Key takeaways + applications