SPACE framework, PRs per engineer, AI research
In this episode, Brian Houck, Applied Scientist, Developer Productivity at Microsoft, covers SPACE, DORA, and some specific metrics the developer productivity research team is finding useful. The conversation starts by comparing DORA and SPACE. Brian explains why activity metrics were included in the SPACE framework, then dives into one metric in particular: pull request throughput. Brian also describes another metric Microsoft is finding useful, and gives a preview into where his research is heading.
Timestamps
- (0:48) SPACE framework’s growth and adoption
- (3:47) Comparing DORA and SPACE
- (6:30) SPACE misconceptions and common implementation challenges
- (9:34) Whether PR throughput is useful
- (15:13) Real-world example of using PR throughput
- (21:33) Talking about metrics like PR throughput internally
- (24:39) Where Brian’s research is heading
Listen to this episode on Spotify, Apple Podcasts, Pocket Casts, Overcast, or wherever you listen to podcasts.
Transcript
Abi Noda: Brian, finally great to have you on the show. Super excited to chat with you today.
Brian Houck: Yeah, thrilled to be here.
Abi Noda: Let’s dive into it, Brian. You and I have been on the same conference circuit lately, bumping into each other at different places, and one of the recurring topics is this SPACE framework. It’s still something that’s really top of mind for folks. A lot of people are still just getting into it for the first time, but it’s been a couple years since you originally published it.
Would love to talk through some of your reflections on SPACE, some of the challenges you see with it, what folks are asking you about, how you’re seeing people use it. To start, it’s been two years. What have been some of your biggest learnings on SPACE since publishing it?
Brian Houck: Yeah, it’s actually been over three years since we published SPACE, which feels like a blink, but it is, first off, I just continued to feel incredibly fortunate to have been a part of it, and I feel very thankful that people are still so interested in it. It clearly has resonated, has some enduring relevance. I think as time goes on, we see that, I just checked the numbers today, SPACE still gets about 3000 downloads a week, and you can’t plan for that.
You certainly don’t want to set your expectations up for something like that. I have papers I’m incredibly proud of that don’t get 3000 downloads in their entire lifetime. The fact that people are still coming back to SPACE is really interesting to me. I think that over time, I am seeing the ways in which SPACE is answering questions for individuals and organizations, and certainly, I’m seeing the ways in which it still is leaving some unanswered questions and some challenges, which you certainly hinted at.
Abi Noda: Well, it sounds like you’ve reached Evergreen status with SPACE. It’s going to just keep compounding and growing, which is awesome, great for the industry, and a testament to the well-written research and paper that you were a part of. Start with what you just referred to, in terms of how are you seeing the relevance or ways in which companies are maybe trying to adopt and leverage SPACE changing in more recent times?
Brian Houck: Yeah. At its core, SPACE is a set of principles. It is a viewpoint of what the definition of productivity even is, and it’s sort of this notion that productivity is not just about output, it’s about a bunch of human factors and outcomes as well. I think we are constantly seeing the world around us changing, and that’s accelerating through things like increased AI adoption change in the way that developers are working, and also just the context with which we are working through things like remote work into hybrid work, and now we are seeing a push of return to office. I think the kinds of challenges that SPACE is being applied to are changing, even if the concepts of SPACE themselves aren’t necessarily having to change.
Abi Noda: One of the questions, I know you get asked this, I know Nicole gets asked this, is DORA and SPACE, how are they different? How do you use that? What’s your advice on that question?
Brian Houck: Yeah, it is a question I get constantly, and I think it actually highlights some of the challenges the SPACE framework presents, some of the limitations, if you will. DORA is, in my mind, not a system for measuring developer experience, it’s a way of measuring deployment efficiency, and it is incredible at that. It is the gold standard. I think a lot of the success that DORA has found comes from this notion that it is very prescriptive.
It says, “Here are a set metrics that you can go and use, and using them are going to lead to the kinds of outcomes that you are looking for.” Clearly, there is a lot of appetite for that. That is not what SPACE is. SPACE is not a prescriptive framework that says, “Okay, here’s the recipe you should follow to deploy it. Here are our set of recommended metrics, and here’s how you should think about them, and here are the activities you could go and drive them.” That is clearly missing.
This is where things like the work that you are continuing with, the DX Core 4, I think, are really picking it up, because the kinds of questions I am getting are, “Hey, with DORA, I got these sets of metrics. What are the set of metrics that I should use?” In Space, we give some example metrics that are, in many ways, they feel like an afterthought. It’s like, “What are the specific metrics I should use, and how do I go about starting to collect them?”
When I talk about the differences between some of these frameworks, there’s a difference between that prescriptive element and also then what is just a set of ideas, ideals, if you will, to build out that framework?
Abi Noda: I was recently chatting with Nicole and her view on this, and I wonder if she, I didn’t ask her this, but I wonder if this view is a retrospective view, or if this is what she had in mind when you all were creating this. You would probably know better than I, but she shared that DORA is an instantiation of SPACE. Does that align with the way you view it, and maybe explain what that means, how you would characterize that description?
Brian Houck: She’s never actually said that to me, but that does resonate. SPACE, part of the power of Space is it is a very broad umbrella. You can fit pretty much any situation under SPACE, and there are some challenges with that. Some of our dimensions maybe are a bit too much of a catch-all bucket, but SPACE is very, very broad.
DORA is a lot more of a narrow focused view of certain aspects of developer activity, particularly, again, related to deployment efficiency. All of the DORA metrics fit within the SPACE framework, but they fit within just a slice of the SPACE framework.
Abi Noda: Right, that makes sense. You talked about how one of the challenges with SPACE is it’s not a prescriptive set of frameworks, but rather a framework for coming up with the metrics you want to use. I know we joked in a previous conversation, and listeners of this show will appreciate this, that it’s a little bit similar to Spotify Backstage, right? Folks always go looking at Spotify Backstage as, “Hey, I need an internal developer portal.”
Then they come to realize it’s actually not a developer portal, it’s a framework for building developer portals, so it’s not as out of the box ready as some would hope. How do you see that, when it comes to SPACE, how do you see that manifesting? Are people coming to you with mis-expectations about what SPACE is? Do they run into challenges? How do you see that nuance sort of manifest in the real world?
Brian Houck: It’s such a great question. The most common thing I see is I’ll go and talk to a customer and they say, “Hey, we’re trying to deploy, we’re trying to implement the SPACE framework, deploy the SPACE framework, and we’ve built a dashboard.” They’ll show me their dashboard, and two things will be true, that we actually recommend against in the SPACE framework. One is they will have every dimension, SPACE, and then their set of metrics beneath those dimensions.
It’s like, “No, no, no. We did not ever intend everyone to go and try to measure every dimension all at once.” It’s focus on the things that are most relevant for your organization, but we list five dimensions, and so people will go and build a dashboard that has all dimensions, and then they will have dozens of metrics across those dimensions. If you measure everything, you’re actually focused on nothing, because you’re always going to have some things going up and something’s going down. It’s very complicated, and it makes analytics really, really tough to do over it.
Oftentimes, companies will, or organizations, rather, will fit on one end of the spectrum. What I just described is they’re biting off way more than they could chew, way more than anyone can chew. You can’t focus on everything all at once, or you’ll have organizations who just are so lost they don’t know where to start. It’s just like, “Okay, my dashboard is nothing because I don’t know any of the metrics I should be using. I don’t know where to start to begin with.”
Abi Noda: What are some examples that come to mind of folks who you feel like have understood SPACE and deployed it successfully? Outside of Microsoft of course, in your work, which we’ll talk about as well.
Brian Houck: You mentioned Spotify Backstage, and we know that Spotify themselves used to use the SPACE framework and have now modified it to their own purposes, which is the goal of things like SPACE is take it as a set of principles, and then adapt it to your own local needs. I think Meta is another example of a company that is passionate about SPACE, and they’ve been able to do some incredible research using SPACE as the language with which they are evaluating their developer efficiencies.
I’ll be totally honest, within Microsoft internally, we still struggle with how to most effectively use SPACE within ourselves, and the scale at which Microsoft operates makes it challenging. I think there are some companies like the Spotifys and Metas of the world, Netflix, that have been able to use it maybe even more effectively than we use it ourselves, sometimes.
Abi Noda: The adage, all models are flawed, some are useful, comes to mind, because I think both of us see time and time again, SPACE is the best thing we have today in terms of how to think about this problem of developer productivity, yet it doesn’t give a single answer to the question. A lot of organizations are really searching for, “What’s the answer look like for us?” It’s such a difficult challenge. I’m sure people hit you up all the time asking, “So how should we measure productivity?” What do you say to these folks?
Brian Houck: I think an important part of the SPACE paper itself is that there is no one size fits all solution. There is no one metric or even just a couple of metrics that are necessarily going to be equally relevant for everyone. Step one, and we talked about this in the DX and action paper that we worked on together, is step one is ask your developers. You need to go and find out, what are the actual pain points you are trying to solve for?
Without doing that, you’re just defining metrics to chase problems that you may or may not know actually exist. The first place you should always start is talking to the developers themselves. That is always my recommendation.
Abi Noda: Love that. Colin and Sierra from Google were recently on this show, and they shared that’s the advice they give to leaders at Google as well is they say, “Start with asking your developers, or start with the survey data,” which is in aggregate asking developers, “Identify where your opportunities and focus areas are.” Then design the metrics that you’re going to use to measure and track improvement in those areas.
Switching gears a little bit, I want to talk about, this applies to work that I’m doing right now at DX with the DX Core 4, as well as, I think, a somewhat controversial element of the Space framework. That is measuring developer activity. I specifically want to talk to you about measuring pull request throughput. This is obviously a polarizing topic. I have taken a lot of heat on this. I’ve been on both sides of the issue. Recently, when we published the DX Core 4 and included this metric, even Peggy had concerns.
Funny enough, Nicole was a fan, and I’m still figuring out the right ways of advising organizations on the dos and don’ts of this. First of all, what’s been your experience? You were obviously part of including this metric in the SPACE framework. What was the reaction to that at the time?
Brian Houck: I think that, like much of the SPACE framework, while we were writing it, there was robust discussion over what are the dimensions that should be included? At the end of the day, activity was not included by accident. It was included for what I feel are very good reasons. To your point, it’s all about the context with which you use it. How are you using it? The nuance there is not the single answer, but is part of the answer. I’ll say from my perspective, maybe somewhat surprisingly, I am a huge fan of PR throughput as a metric.
I think like many traditional productivity metrics, it’s all about how you use it, and understanding the ways in which it is maybe not as useful, not as relevant. I think, for example, it is incredibly poor as an assessment of individual productivity. If you are using it to assess developers, I think it is probably an outright inappropriate metric. I think it is really, really powerful and really useful at measuring the health of your system. I think it is really useful at finding friction in the developer experiences related to flowing code, and it is a useful way of finding where developers are not able to efficiently complete their work.
I think that understanding, again, the ways in which you can use it and the ways that you can’t use it can lead it to be a really powerful metric to use in really interesting ways.
Abi Noda: What’s your opinion, and I’ve talked to a lot of organizations, like Meta, GitLab, Stripe is pretty well known for doing this, but where looking at PR throughput, PRs per engineer is a pretty central part of how they’re thinking about and measuring productivity. Generally speaking, what is your opinion on that practice?
Brian Houck: I think as a top level metric, as long as it is part of a basket of metrics, including things like developer satisfaction, then it’s useful, then I am a proponent, I’m a proponent for it here at Microsoft. I think you want, again, to measure across lots of dimensions as a way of setting up a system of checks and balances. At the end of the day, we see that increasing PR throughput is not only good for the business, but it is good for the developers.
We see that as you reduce the friction in your PR process, and thus you increase the PR throughput, not only can developers flow more code in total, but they are happier. They feel more productive, because no one wants friction in their tools and processes. PR throughput measures friction from both automation, coming from where do you have slow CI builds and test passes, but it’s also friction in your human processes, like the code review process.
It shines the light on a lot of different places of developer pain. As you remove that, it makes the lives of developers better, and it also allows you to more efficiently move code through your system, which is good for the business.
Abi Noda: Well, I suppose I put you on the spot a little bit with that question, but I will say I 100% percent agree, I have the exact same point of view on PR throughput being generally useful, so long as you avoid measuring individuals, so long as it’s part of a basket of metrics that counterbalance each other. It’s interesting when Nicole and I last talked about it, she’s similarly, in a Nicole fashion, said, she said, “It’s not perfect, but it’s pretty darn good.”
Peggy, I think, is a little more conservative in her concerns over how these types of metrics can be abused, I think rightfully so, but at least talking to other organizations and other folks like yourself, I think generally what we see is that as long as you avoid the don’ts, the do nots, and as long as you do it in a balanced way, that it really is a really useful input into understanding productivity.
Brian Houck: Yeah, and I actually have a real world example that I think proves that. The example I’m really fond of giving is if we look back to March of 2020, Microsoft, like much of the world, moved to remote work. We had not historically been a very remote-friendly company, and so most of our developers weren’t particularly used to it, and we weren’t sure that our developer infrastructure was going to handle it.
Looking at PR throughput was a way that we could evaluate, are our systems actually able to handle the load of all of our developers moving remote all at once? What we found is that, yeah, our PR throughput actually went up dramatically. There was a little, some learning curves within the first few days, but after those first few days, we saw that PR throughput skyrocketed. I actually use this as a way to say that that PR velocity just going up isn’t always a good thing.
It was an indication that while our systems were healthy, it let us dive in and see, “Oh, well, some of that is due to the fact that our developers are just working more hours, because we’re in the early days of a pandemic, and we’re quarantined, and there’s nothing else to do.” We also saw that 78% of our developers reported feeling burned out. It was an indication that our systems were healthy, but we needed to look more at developers. It’s not just like looking, “Is it going up?” It’s like looking at the reason why. I wouldn’t have known to look for that if we hadn’t seen that metric move so dramatically.
Abi Noda: I feel that one of the challenges with looking at PR throughput is that it gets associated unfairly with other metrics like counting lines of code, which I do think is not a very good metric. I think lines of code is easy to game. It is often used in an unhealthy way versus PR throughput. It sounds similar, but it’s really not the same thing because you’re really looking at the friction from whatever lines of codes you did happen to write and what is the friction in moving that through the system. And I think not painting those things with the same brush is incredibly important.
Brian Huock: Yeah, that’s really interesting. Once upon a time, I definitely was on a stage criticizing pull request throughput for that exact reason you just brought up. I said, look, this is actually just lines of code repackaged up into bigger units, right? But I’ve since changed my perspective because I believe PR is really units of value and work that is decoupled from the number of lines of code. So yeah, I see both sides of that argument and it’s a really interesting debate.
Abi Noda: And it certainly is something that elicits a lot of passion from people in the developer productivity SPACE. I recently posted a little bit of data we published on seeing some PR lift at one of the companies we work with in relation to their adoption of GitHub Copilot. I should have double-checked myself, because as you can expect, I got a little bit blasted in the comment section on using PR throughput as… there were comments like, “since when is pull request a measure of productivity?”
First of all, I’m curious to ask, and don’t share anything you can’t, but anything you’ve seen as far as PR throughput specifically and its level of impact on that through some of the GenAI tooling that you all are experimenting with at Microsoft? Second question, what is your advice on how to frame this metric, how to communicate about this metric, in ways that don’t result in that kind of backlash, especially from developers?
Brian Houck: Yeah, it’s a great question, and both from internal research at Microsoft, and also confirmed by external research we’re seeing at other organizations, it is pretty clear that across most engineering organizations deploying AI tools, they are seeing uplift in their PR throughputs. It is helping devs complete tasks faster, and like all things, it really highlights the need for nuance, because it doesn’t help devs necessarily complete all kinds of tasks faster.
Importantly, it should be noted that not all dev work is just creating PRs. There is a lot of developer work that is not actively coded. In fact, we see from lots of studies that developers only spend about 15 to 20% of their day coding, and tools like GitHub and Copilot may not be helping at all with some of the other 80% that is remaining. Like applying SPACE to almost any problem, the answer is not, what is the one metric to use? What is the basket of metrics to use?
Yes, we see that PR throughput may go up, and I think that’s a valuable, important signal that is important for developers. We know that that correlates to their own satisfaction and feeling of productivity, but it should be used in combination with metrics across the rest of the Space framework. Are devs happy using AI? Do they feel that it makes them more efficient? Do they feel that it is changing the way they’re collaborating with their peers in a healthy way, or in a harmful way?
I think as a metric, it is really useful for evaluating the impact of AI. I think in isolation, it’s not, and I think that’s true for a lot of problems.
Abi Noda: Then what’s your advice on how to introduce either data findings or use of this metric peer throughput to developers? What’s the right way to frame it? How would you communicate it in a way that doesn’t lead to backlash?
Brian Houck: When I am talking with developers internally at Microsoft, or at least I try to be unwavering in the way that I talk about metrics, and framing them through the lens of, “This is how improving this metric will make your life better.” I do not advocate for metrics that I can’t say with research that improving it will improve the lives of developers. I think I try to anticipate and head off some of the criticisms.
I always am talking about that basket of metrics that is a set of checks and balances, but I will always go back to framing metrics with rigorous research that backs them, showing that, “Improving this is actually going to make your life better, and these are the ways it will make your life better.”
Abi Noda: Last question on this topic, have you thought about or experimented with any variations of this metric? One thing that comes up when I talk to organizations is they say, “Well, not all PRs are created equal,” so there’s obviously some fuzziness in this number. Have you experimented with, I don’t know, weighted PR throughput, or classifying different types of PRs and attributing them differently? Have you thought about that? Have you tried that? What’s your reaction to that idea if you haven’t thought about it?
Brian Houck: We have definitely thought about it. I think that we are continually trying to find ways of classifying PRs that makes this metric more actionable. The honest truth is we haven’t found anything yet that makes it any more actionable than just looking at everything in its sum total. I think it does highlight one of the advantages of metrics like this, where it’s, at first blush, it looks like it’s susceptible to gaming.
It’s like, “Oh, well, I’m just going to take all of my PRs and break them up into tiny little PRs.” Turns out that is a good thing. Turns out having smaller, more frequent PRs results in more total code flowing through the system with less total friction. It is resilient to gaming. I do think that looking at different ways of computing the percentiles, should I look at PAD? Should I look at medians? Those are all healthy discussions.
In terms of completely different metrics, this is one I keep on coming back to for what it is, now, we deconstruct it as well. We look at how much of that PR throughput, that friction in that PR throughput, is coming from flaky tests, or coming from our build infrastructure. How much of it is a metric that we care a lot about internally at Microsoft is how long does it take for that code review to start? How much time are you just waiting for one of your peers to code review? That’s a valid sub-metric, but it’s not a replacement for looking at the PR completion in its entirety.
Abi Noda: Really fascinating. Well, we could talk about this topic forever, I’m sure, and look forward to ideas that we both have on it in the future. I want to ask you about your current work. We keep running into each other. I keep hearing your talks, but what do you focus on right now? Any interesting projects or research findings that you think would be useful to listeners?
Brian Houck: Yeah. I, as you might imagine, am doing a lot of work looking at how the use of AI is changing developer experiences, and in which ways it is impacting developer productivity. That involves looking at how AI is impacting developer experiences across the SPACE framework. The way that I’m doing that is by asking developers themselves, I am asking developers who are actually using AI, “How do you feel it is impacting your satisfaction, your activity levels, your efficiency, your collaboration, your overall performance, your outcomes?”
I think that work is really interesting to cut through some of the noise around people who maybe aren’t developers day to day, putting their own viewpoints on it versus the people who are using these tools day to day. That’s where I think I’m very proud of how it’s coming along. I am, as part of a specific pivot of that, looking at how the usage of AI tools is impacting code quality. One of the, I think, very valid criticisms of using increased PR throughput as a metric is, well, if we’re producing more code, we’re flowing it quicker through the system, but that code is of lower quality, we’re actually in a worse spot.
I’m trying to answer the question of is AI impacting different dimensions of quality? Readability, maintainability, the security, the robustness of code. Then I think of a mad scientist project that I’m really excited about. I’m partnering with a brilliant PhD student, Isabel, at the University of Zurich, and we are looking at quantifying the relationship between sleep and developer productivity.
There’s lots of research out there that shows that developers, all humans, but there’s some unique developer research that shows that developers who feel like they have a poor night’s sleep feel less productive the next day. I’m going to see if I can quantify that. Then looking at, does the use of AI change that relationship? If I’m particularly tired on a given day, does using AI help fill that gap at all?
Abi Noda: Well, fascinating areas. Any early data on either of those two projects you’re able to share?
Brian Houck: Do not have any data I can share on the sleep study yet. That is still underway. For the SPACE and AI study, it is very clear that the vast majority of developers believe that the usage of AI is making them more productive, and is making them more productive across all of the dimensions of SPACE. We see it-
Abi Noda: Including quality?
Brian Houck: I’m sorry, so for productivity, the dimensions. The study’s looking at how AI is impacting productivity through the dimensions of SPACE, we are still analyzing whether it impacts quality.
Abi Noda: Interesting. I want to ask you for advice on how to ask questions to developers when you’re trying to understand this type of impact. One question I have personally that I’m still trying to figure out, is are you typically looking at longitudinal data before and after, or are you just asking developers at a point in time to kind of self-reflect on their before and after, or with or without? What are the ways you’re finding it more or less effective in terms of gathering good data from developers?
Brian Houck: It’s such a great question, and the constant tension that I have in my own work. If I truly want to get at causality, I need that longitudinal data, and I need to look at how things change before and after interventions. Doing that is really challenging, and it limits my ability to do other new research. Like many things, I blend both techniques. We do collect a bunch of sentiment data that we collect longitudinally, and we try to use that to prove causation.
Most of my studies are point in time, and it is looking at reflecting back over the last three months, how have things changed? It would be better to always collect longitudinally, but it’s tough to do that, and so you have to make compromises.
Abi Noda: Outside of the self-reported data and sentiment data, and outside of PR throughput, any other system-based telemetry metrics that you found a lot of value in relationships, or value in looking at?
Brian Houck: Yeah, so the metric that I care about a lot in my research is simply how much time per day do developers spend coding? I use that because if you go and ask developers how they want to spend their time, they will say, “I want to spend my time coding.” That gives them much more sense of fulfillment than dealing with compliance or managing their work items. We can see that in the data time and time again.
We see as a metric, it is something, the time per day that developers are actively coding, we see it positively correlates to their job satisfaction. It is one of the best descriptors we have for self-reported productivity. It is very predictive of their retention too. We find that, again, garbage measure of individual performance, developers who don’t spend a lot of time coding, it is clear that they are not doing that because they are less productive. That’s what they want to do. They’re doing it because there is friction put in their way by tools and processes. That’s a metric I like a lot.
Abi Noda: Well, Brian, really excited. I think whatever the findings of that sleep quality versus productivity, that is data that me and my wife will use against each other, because it is a constant debate between us. I’m really interested in that finding, because I always think about how in this whole productivity conversation, people are, what are the tools, what are the processes? Sometimes we forget that it might just be things outside of the work environment, like sleep quality, that actually potentially have some of the biggest impact.
Brian Houck: Yeah, and I’m thrilled that you caught onto that. That is the reason I do studies like this is we think about investing in our engineering infrastructure, and we should be doing that. We should be looking at how we can make our CI builds faster, how we can make tests less flaky. At the end of the day, the gains we have from things like that pale in comparison to addressing the human factors that impact productivity.
If we find that giving developers a little scent pod that sprays lavender at them as they sleep, that could be more impactful than anything we’ll ever do in our build system. Looking at the entire human experience, I think, is the only way to understand holistically, developer productivity.
Abi Noda: Well, Brian, thanks so much for your time today. I know you’re really busy preparing for another talk in Singapore in a couple of weeks, so thanks so much for making time to come on the show. Really enjoyed this conversation.
Brian Houck: Thank you so much. Thrilled to be here.