DX Core 4: 2024 benchmarks

Abi Noda: Good to be back with everyone. Last time on our show, we did our initial reveal of the DX Core 4 framework. We’ll talk a little bit more about this today to catch folks up who weren’t able to attend that previous session. But today, what we want to talk about is benchmarks. We just published our first set of Core 4 benchmarks, so we want to let everyone know what these are, discuss some of the interesting findings that we’re seeing in the data, and then also talk about advice for how to actually use benchmarks. Folks are always asking us for benchmarks, but then the next question we always get is, “So how should we actually use these? Which benchmark should we use?” So we want to talk about that today. As always, folks, please leave questions in the chat for the Q&A. Laura and I will be keeping an eye on that and weaving topics from the chat into the conversation.

So let’s dive in, Laura. I think we should start by just catching folks up on the DX Core 4. We did a session on this about a month ago, but the DX Core 4 is a new framework that we’ve published. People are always asking, “Why publish another framework? There’s already all these other frameworks out there.” And the reason is actually because people have constantly been coming to us asking us, “Hey, there’s DORA, there’s SPACE, there’s DevEx. With all these different frameworks, how should we actually get started measuring productivity?” So we’ve been asked that question. Nicole, Margaret, the researchers we collaborate with, are always asked that question as well. We really wanted to put something together that is our recommendation on a good starting point that encapsulates all those different frameworks. And of course, as soon as we put this framework together, folks started asking, “Do you have benchmarks? What is good for these metrics?” Laura, I mean, in your experience, why do people want benchmarks?

Laura Tacho: Yeah, I mean, I think we’re naturally inclined as leaders to want to contextualize our performance against other people in our industry. I think we want to be able to set high standards, see how we’re doing against our peers. Benchmarks can also be a really effective way to protect investment in stuff that’s working now. I think that’s sort of a secret way to use benchmarks that’s not often talked about because you can see that you’re doing well, which means you got to keep investing, got to keep doing what you’re doing, otherwise you’re going to fall behind.

A lot of us are into benchmarks, especially when it comes to developer productivity data, because of DORA. I think DORA really, they just put out their 10th DORA report, and they really brought this idea of benchmarking software delivery performance to the masses as it were. I’ll just pick this up because it’s permanently sitting on my desk. I think everyone’s seen this book. But to your point, Abi, DORA covers software delivery capabilities. We’ve come up with the Core 4 to cover a much more comprehensive look into developer productivity, and now we’re excited to share some benchmarking data so that you can see how your own teams are doing up against the rest of the industry.

Abi Noda: I heard a really great quote a few weeks ago from a CTO. He said, “Fast is relative to your competition.” If you’re a FinTech company or a bank, your processes aren’t going to be necessarily the same, your velocity may not be the same as a consumer tech startup in Silicon Valley. So I think benchmarks really help you contextualize, what does good look like for a company or organization like ourselves? Laura, could you do the reveal of the benchmark page website we put together and walk folks through what does this benchmark data set actually consist of?

Laura Tacho: Yeah, so if you want to follow along at home, you can go to our website, getdx.com, go to this research tab, and you’re going to find a link to the Core 4 benchmarks. You’re going to be on the same page. You can look at some of the numbers as Abi and I talk through them. So right now, we’re sharing benchmarks on the four primary metrics that go along with the four pillars in the Core 4 framework, and this data is consisting of data from 500+ companies that are using the Core 4 framework along with data sets that have come through our industry research. This is primarily self-reported data, which I think is an interesting point here. We actually do have workflow, so automatically collected data, and one thing Abi and I found was very interesting was there’s actually not that big of a difference between the two of them. For a lot of these things, the measurement can span multiple tools, and so the self-reported data is actually much more reliable is what we found, which is why we’ve gone with that.

Another thing I thought was very interesting, for each of these benchmark segments, we have at least 30,000 data points. So there’s definitely a lot of data going into this benchmarking data. If you would like to download the full set of benchmarking data, so right here you’ll see tech versus non-tech, and you can see some of the percentiles, but we also have it segmented out for company sizes. And then, also mobile engineers have different workflows, different challenges. We have a special segment just for mobile engineers, and you can find that here in this full benchmarking data. You’ll just get a nice Excel sheet of the numbers.

Abi Noda: Yeah. I was going to say, Laura, that Excel sheet’s a big data dump, so we didn’t want to present it today, but I definitely encourage folks. Go to the website, you can download the data file. There’s a lot of data to comb through.

Laura Tacho: Yeah. Abi, I mean, obviously you’ve looked at that big data dump. We’ve all spent some good hours pouring through all of it and kind of drawing our own conclusions. I’m really interested to have you share with those of you who are joining us today, what are some of your big observations, big trends that you’ve seen in the data?

Abi Noda: Well, there’s so much data. Honestly, every time I go stare at that spreadsheet, my eyes bleed a little bit, so we’re only going to be scratching the surface today. I’ve been continually going back and looking at the data because there’s just so much in there that I think is fascinating. I was immediately drawn to the PR throughput data. That’s just something that folks look to a lot as a signal of velocity. That’s a whole other topic we may touch on later today. But one thing that we see in the data that was a little bit surprising was that, as a whole, we see about a median overall of 3.5 PRs merged per week across industry. But one thing that stood out was that we do see a higher PR throughput from tech companies than non-tech companies, and we’ve been speculating why might that be the case. But Laura, what are your thoughts on that, and what have you seen in that data? I know you’ve looked at it a little more closely.

Laura Tacho: Yeah, I think you brought up a good point in one of our earlier conversations, which is, I’ve worked in, I mean honestly, mostly tech companies, specifically mostly developer tooling companies, but I’ve also worked in companies that are in edtech, finance, whatever. And I think that you had an observation that the rhythm of business feels different in some of these companies. And I think, even looking within tech, a startup has a really different rhythm of business than an enterprise tech company, and we see that reflected here. We see 4.27, we’re looking at P75, versus 3.5.

Another thing, these differences I think can feel really subtle, like the difference between 4.27 and 3.59. But if you think about it, the difference between just the median performance, so P50 across all companies, we’re looking at 3.5. And some of the highest performers are tech startups. When we get over here to P90, we’re looking at almost five PRs per week. That’s closing five PRs versus 3.5 PRS. What does it take organizationally for you to have enough work to do to close another whole PR? I mean, this is not just about batch size. There’s a lot of organizational movement that’s behind that change. But just like, that’s so much work. It’s like a fifth, depending on how you’re looking at it. It’s so much more output. It’s so much faster.

Abi Noda: Yeah, and I think you brought up a good point. The numbers seem close, but when you extrapolate across an entire organization, in some cases we’re talking thousands of more PRs per week across the organization. The rhythm of business point came to mind for me, but as we’ll talk later, the developer experience data, which we’ll get to soon, we see data that tells a little bit of a different story in terms of actual effectiveness, friction, and efficiency. So yeah, I don’t know if we have the answer on PR throughput, but definitely a really interesting one to touch on.

And I saw a question about this to clarify the difference between tech and non-tech. The difference is non-tech refers to companies whose core business is not selling a technology product. So their core business isn’t software, so we’re talking banks, airlines, pharmaceutical companies, traditional enterprises that also have software engineering as a capability, but it isn’t their core product. So that’s the difference in terms of the industry segments.

Laura Tacho: Yeah. I think actually weaving in the speed to developer experience, the pattern here is it’s quite a nice pattern match because we can look at the highest performers for a Developer Experience Index. And so, the DXI is a kind of predictive measure of key engineering developer experience performance driver. So basically, how easy is it for a developer to get work done at your company? How much friction do they encounter?

And we can see that small tech companies have the best developer experience index. So in this case, the higher the number is better. And we can look at then, they’re also the ones closing five PRs per week compared to their counterparts more in the median range who are closing far fewer PRs per week, and we can also see their Developer Experience Index is quite lower. So we can look at developer experience being directly related to how fast work can flow through a system. I think diffs per engineer is, we’ve talked about this a lot. This is somewhat of a difficult metric to talk about sometimes because there’s been a lot of opposition to measuring things like speed through diffs per engineer. And Abi and I can’t emphasize this enough, we need to look at these numbers in relationship to other numbers. So telling the story of speed, but also developer experience is important. Speed but also quality, speed but also innovation, we need to tie this all together, which is at the heart of this Core 4 framework.

Abi Noda: One of the things we saw that was interesting with the Developer Experience Index, and again, folks, go to the website. We have a white paper on the Developer Experience Index, or DXI. Like Laura said, it’s a measure of really the key performance drivers for engineering effectiveness. And Laura, you pointed this out, one thing that was really interesting in the data is that there’s quite a bit of variance within the tech startup segment. There’s a pretty large disparity between P90, P50, which I interpreted as meaning that for startups there’s a pretty big spectrum of some startups being really, really effective and some being pretty awful. Is that the right way to, awful being a harsh word, some start struggling more?

Laura Tacho: Struggling, yeah. I think that’s actually a really good question, and thank you, Julia, for asking about our sort of jargony P50, P75, P90, and thanks, Matt, for answering that question. So 50th percentile is just like the median average performance, P75 would be the top quartile, and then P90 is the top 10% of performers here.

And so, we can look at these things in averages in different ways. And a lot of times, when we look at the mean averages, which is just taking all the figures and then dividing them by the number of respondents, there’s a lot that gets lost. There’s a lot of context that gets lost because we’re just left with one number. But if we look at the distribution, and in the DX platform, you can actually get this data sort of on a more granular team level, you can start to see these outliers. You can see how different types of teams are performing really, really well or have amazing throughput or really high quality, where other teams might be struggling in different ways. And you can see the distribution, and that tells a different story and gives you even more information about where interventions might be most impactful, where a mean average or some of these other averages might obscure some of that more targeted information that’s going to point you to where an intervention might be most helpful.

Abi Noda: Laura, another thing that stood out to me, if we go to the percentage of time spent on innovation versus… So to give folks some context on this, this is one of the key metrics in the Core 4 we look at, which is how much engineering time is actually going into elective work, developing new features, new capabilities, as opposed to other things like meetings or putting out fires and bugs and KTLO. One of the things, we’re not showing it on the screen, this is available in the data file that folks can download from this page, but we saw that mobile engineers have significantly lower KTLO. And this was a bit striking, the gap we saw in the data. Curious your thoughts on it. Why is that the case?

Laura Tacho: Yeah, I think, man, mobile engineers really have some just different workflow patterns. Their deployment targets are, it’s just really different. And so, it makes sense if you’re not literally close to the devices where things are operating on. You just have less operational surface to cover, so it makes sense that you’re spending less time on maintenance, KTLO. You have more time for innovation.

I think it’s interesting that it could have been KTLO, it could have been new features, it could have been bugs and stuff. Mobile engineers are spending more time innovating, which I think is pretty cool. We don’t have a lot of data on mobile engineers and how they’re spending their time. We don’t have a lot of data on mobile engineers and the core DORA metrics or the key software delivery capabilities that DORA reports on. That’s always been a question that I get asked all the time, which is like, “Well, I’m a mobile engineer. Does this actually apply to me?” So it’s really exciting, and I’m so glad that we did this to actually take mobile engineers as a different population segment and be able to call out and offer that, which again, it’s not on the summary that we have on the website. But if you download the full version, you can see all these values specifically for mobile engineers, which is pretty cool.

Abi Noda: Laura, let’s switch over to change fail percentage. I know you’ve had some interesting observations there on what you’re seeing.

Laura Tacho: Yeah. So when I looked at this, for those of you who might not be familiar with change fail percentage, this is the amount of changes that, when they get into production or a customer-facing environment, results in degraded service, so something that you have to go in and fix. And change failure rate is one of the four key metrics from DORA, and historically, anything under that 5% is considered really good when it comes to DORA. And companies that are like high, medium, low performers, we’re seeing anything from 15% to 40% change failure. Well, the companies that are using DX Core 4, it’s a wildly different story. These are all pretty clustered up in under 5%. And so, I was very curious about this. To me this means, hey, this is a good guard metric just to keep in the background to know that, if you slip, that’s a warning signal.

Even though the differences are subtle, I was also really surprised that here the big tech companies are actually, they have the highest change failure percentage. And in this case, higher is worse, so it means you have more changes going to production. Abi, I wonder, do you have a hypothesis? Why are these values slightly higher for enterprise tech versus other parts of our population?

Abi Noda: Yeah, I can only speculate, but one thing we see with the larger scale-ups in tech companies is that they’ve had rapid growth. They’ve oftentimes over-hired a little bit. They have a lot of churn. And so, maintenance is a real challenge. Knowledge of systems has left the organization, and as a result, I think you see a lot of larger tech companies really focused on reliability now and reining in some of the side effects of the rapid growth. So that’s one perspective on the problem, but curious what your thoughts are, Laura.

Laura Tacho: Yeah, I mean, I think that’s true. Often when I talk to companies and when you look at that distribution and can kind of go down more granularly into the team or service level, you’ll see also disparity. You’ll see some old legacy thing that never gets updated. No one knows. I don’t know, maybe it’s written in COBOL or whatever. No one’s touched it in decades, and then you have to go change something, and of course it breaks, versus the thing that’s being iterated on every single day.

I don’t know if this hypothesis has any weight to it, but I’m just thinking a lot about how AI is helping us write more code, and I have always sort of viewed source code as a liability, not really an asset. And I wonder if there’s some correlation between, do big tech companies just have more code laying around that looks slightly different from big non-tech enterprise companies? I don’t know. These are some of the things that I’m thinking about. I don’t really have any data to back it up, but just something that I’ve been thinking about is why this disparity might be there. But it’s hard to say because the differences are so subtle in this population compared to, for example, DORA’s population where we have 40% change failure versus 5%. So really interesting about this quality aspect.

Abi Noda: And Laura, you touched on this earlier, but for folks looking at this data, this data is quite clustered together. These numbers are all pretty close, but we want to emphasize again, when Laura actually brought this up with me, when we originally started looking at this data, and this is at the organizational or company level, these aggregates. When we go and double click into the team level for this metric, we see a lot more variance. We did see teams with 10%, 15%, even 20% change fail percentage. Just so folks keep that in mind, this is averaged across organizations. When you double click at the team level, we see a lot more variance, more in the range that we see in some of DORA’s research benchmarks as well.

Laura Tacho: Yeah. One question here, “Is there any significance that the non-tech change gets lower for non-tech as opposed to higher for tech?” Yeah, I think one other thing that I thought about here was, are bigger tech companies willing to accept more risk? Is that the kind of sociotechnical circumstance around here that we’re talking about things like finance insurance, automotive, other industries that are maybe just not able to accept more risk? And then, we can also look at this again. We want to have opposition in our metrics. We want some tension between them. So we’re going to look at P75, and we can go over here to diffs per engineer. We do see that even though the big non-tech companies in 500+ have a lower change failure rate, they’re shipping less frequently, so we can start to tell an interesting story here about what might really be happening there. It’s hard to draw a ton of conclusions based on just data points here, but definitely some trends are emerging that are interesting to discuss.

Abi Noda: Laura, let’s wrap up by talking about advice on how to best use benchmarks. Folks want this benchmark data. It’s really interesting to look at, how can folks put these benchmarks to use in their organizations to drive value and change?

Laura Tacho: I think that we always emphasize getting a baseline as soon as you possibly can. That’s another reason that self-reported data is so useful in this context because it’s just faster to get. And what we have seen is that the difference between the self-reported data and the automatically collected data is like it’s not significant enough to care. For the purposes that you need this, it’s high fidelity enough. So you can look to external benchmarks like this Core 4 data if you want to download it and look at it. Do that to set things like high standards, do some gap analysis, try to figure out where you need to protect your investments, but you also should look at benchmarking data internally. And I think that is the step that a lot of us skip is that we do an assessment based on external benchmarking data, but we don’t actually then close the continuous improvement loop and then look at that same benchmarking data quarter after quarter internally to see if we’re making progress against the goals that we’ve set. So that would be the first thing that I would start with. What do you think, Abi?

Abi Noda: Yeah, I think the most important thing to me is, look, we work with a lot of leaders who are trying to bring positive change to their organization, make lives better for developers, help the organization be more effective. And we’ve heard stories of folks going to their CEO, CTO and saying, “Hey, our lead time is X.” and CEO, CTO says, “So what? Why do we care about lead time?” And so, it’s so important for folks to understand, one of the important pieces with benchmarks is that you’re able to tell a story to stakeholders. You’re able to go to people and say, “Look, here’s where we’re at, and we’re behind our peers. We’re behind industry top performers. Here’s how we can get better. Here’s how we can compete with industry peers.”

And being able to frame your data in that way is where benchmarks are critical. We find that without that, you’re just four PRs versus five PRs per week. Who really cares, right? Like I said earlier, fast is relative to your competition. So talking about these numbers in the context of your competitors, industry peers, that’s where I think you really get attention from executive stakeholders. That’s where you’re able to really drive investment, and that’s where you’re able to tell a really great story when you drive those improvements and you’re now ahead of your industry peers. So that’s the advice I have for folks on how these are used.

Laura Tacho: Yeah, I can share an example with the audience here, just an example of how I would do this. So innovation ratio, again, we’re looking at the percentage of time spent on building new features and capabilities. So this is an important business impact metric because we have the difference between building the thing right and building the right thing. And a lot of times, engineering doesn’t get to choose to build the right thing. We have to build the thing right, so we’re talking more about not what gets built but how it gets built. And so, the best thing that we can do is when there is opportunity for innovation, make sure we have capacity to do it by having low tech debt, frictionless processes, whatever.

When we look at some of these numbers, we can see, let’s just call it out as a 6% difference if we’re looking at tech companies. We’re looking at startups versus the enterprise. And so, we might say like, “Oh, 5%, 6% difference in innovation rate,” which who cares? If I come to Abi and say, “Well, we’re like 6% behind our peers,” Abi’s going to say, “Laura, I don’t care. Tell me something meaningful,” which is totally fair. But if I come to him and say, “Abi, every single week, our competitors are spending two hours per developer working on brand new customer-facing features. That’s two hours that we’re not doing, and that adds up to two whole full-time engineering teams each quarter,” that is something that Abi is going to, I don’t know, throw his hands up and be like, “Okay, how do I fix this? What do you need?” That is such a much more compelling story to tell with data than just coming with plain benchmarking data.

The whole point of having access to data like this is to figure out how it applies to the particular problems that your company has and specifically the goals that you have. So it’s not just about the 6% difference, but figure out what that actually means. Figure out what it means for your competitors, and then figure out what you need to fix it because not everything that can be fixed should be fixed. If you’re going after a half a percent innovation ratio change, that might have diminishing returns based on the amount of effort that you need to put in.

Abi Noda: And Laura, it’s important to call out. Again, part of the philosophy and the design of this Core 4 framework is the balance in tension between these different dimensions and metrics. So I saw someone bring up, “Isn’t incentivizing focus on innovation over KTLO, won’t that lead to degraded quality and increased tech debt over time?” That is absolutely the case if you’re only focusing on this metric without the others, but we have change fail percentage, we have developer experience as other key metrics in this framework, and those tell the story of those opposing considerations that are also important for effective engineering. So again, that balance here, when you’re going and asking for investment in a specific area, you have options and different paths here in terms of where to tell the story you want to tell, anchoring to these different dimensions.

Laura Tacho: I think someone mentioned Goodhart’s law, and I think that was a great example of like, “Okay, are we just going to optimize for one thing,” and then it’s no longer a good measurement. Goodhart’s law really applies when you have a one-dimensional system of measurement, and that’s the opposite of what Core 4 is. But people can still make mistakes or hyper-focus on one thing, so whenever you see one of these numbers presented all by itself or outside of the context of a discussion, to me that’s a bit of a red flag that we might be getting into some bad territory. Make sure to look at these as a basket of metrics, like a nice collection. That’s how they were designed to be used, so that’s the appropriate way to use them. It’s not an appropriate way just to take out innovation ratio and look at it by itself. We can say that pretty authoritatively as the authors of this framework. That’s not how we designed it, so please don’t do that.

I think another question that we get about benchmarks is, what should we be aiming for? And so, Abi, I’ll let you take that one. We have the raw data set to answer another question. You can go ahead and download that. Just go to GetDX, look at this research tab, benchmarks. You can download the full raw anonymized data set at the top. You’re going to see P50, P75, P90. So Abi, what should people be looking at?

Abi Noda: Yeah, this is such a common question we get. What should we actually be aiming for once we get ahold of this data? And it depends, of course. It’s situationally dependent. If you’re behind P50, P50 might be the right place to aim. But generally our default recommendation is P75, which is really looking at being a top quartile performer. And I think P50 is, you want to be above P50, but P50 isn’t really aspirational because that’s the middle of the group. Whereas P75, that top quartile bar, I think is a good place to aim while not being too extreme like P90. So that’s our recommendation typically.

Laura Tacho: Yeah, I think that’s a very strong target for teams to try to aim for that top quartile performance.

Abi Noda: Well, Laura, let’s do some wrap up. So folks, again, thanks for joining the conversation today. You can go to our website, getdx.com, to download these benchmarks, look at the visualizations we have and the explanations about these metrics. We also have the original Core 4 white paper on the website as well, and more information on the Developer Experience Index. And of course, if you want help collecting this type of data or analyzing and benchmarking it, definitely check out the DX platform. You can get a free demo of the DX product at our website. Thanks so much, everyone, and Laura, thanks to you as well for joining me today.

Laura Tacho: Yeah, thanks Abi. Wonderful. Take care, everyone.

DX Core 4: 2024 benchmarks

Timestamps

Transcript