How Google measures and improves developer productivity

Abi: Collin, Ciera, so excited to finally have you on the show. Thanks for coming.

Ciera: Thanks for having us.

Collin: Yeah, happy to be here.

‍Abi: Google’s obviously such a unique company. The level of depth of the work you guys are doing I think is really remarkable. I want to start by asking you both to just share a little bit about your team. What type of work do you do, who is on your team? Yeah, share a little bit with listeners.

Ciera: ** ** Yeah, our team is the engineer productivity research team. We were started as a way of trying to understand what we can do to really improve developers productivity. Previous to our team’s creation, a lot of decisions about what types of tools we needed at Google were made just based upon an engineer at the company, taking a good guess based upon what would help them. We got really far at Google with that, but that only takes you so far. At some point, the company’s really big. People are doing different types of development. The people building the tools may not know about all the different types of work that are being done.

We wanted to create a team that would better understand on the ground developer experience so that we could figure out how to improve the tooling, the processes, everything that people do. We created the team with the idea of it being a mixed methods research team. From the start, it was not just a team of software engineers. We also got UX researchers involved right away. We continued to grow as a mixed team with… We try to grab people from a variety of fields, especially our UX researchers to come from a pretty wide field, variety of fields there.

Collin: Sure. It’s a really unique team in some ways. I think people are always surprised to find out that we have eight software engineers, and then we have eight or nine UX researchers. As Ciera said, the backgrounds, especially among the researchers, are quite diverse. We’ve had behavioral economists, I’m a psychologist, we have social psychologists, industrial organizational psychologists, we have somebody from public health. We have people from engineering and public policy across the board, trained as researchers in some sort of social or behavioral science methods, but really diverse. That really strengthens what Ciera was saying. We really lean into using a wide variety of methods, mostly in concert, not in isolation, but together so that we get a complete picture of what the developer experience is like. We do diary studies, we do survey research, we do interviews, we do qualitative analysis, we do logs analysis. We do a really wide range of things to understand exactly what’s happening as best we can and as holistically as we can.

Ciera: Yeah, the goal is always to try to figure out how to triangulate on developer productivity. We know there’s not a single metric for developer productivity, so we try to use all these different research methods to see, are they all aligned? Are they all telling us the same underlying story of what’s happening? Or are they misaligned, in which case we need to dig deeper to figure out what’s going on?

Abi: We’ll talk a lot more about the different methods and ways in which, what does mixed methods even mean? But I want to also ask, so how does your team actually interface with the rest of the organization? Who are you actually working with? What other teams are you partnered with? What does the daily work look like?

Collin: That’s evolved over time. One of our big customers is the first party developer team. They build all of our internal and homegrown tools. When they want to, for example, understand what makes developers productive and what could make them more productive, our data, our research is one of the places they go to understand how to even measure that.

That’s an evolving relationship and we’re focused on trying to improve how well our foundational research can be applied to these very product specific, or tool specific questions. But that’s the goal, is to make these tool infrastructure and process and practice improvements, and to just help them understand the impact that their work is having.

We also consult with a variety of other folks at Google. We work with people operations, to some extent. We work with some of the folks in real estate and workspaces. We work with some of the people in corporate engineering who do the tools for all Googlers, not just the engineers. It’s a pretty wide smattering of folks. When you focus on engineering productivity, you’re focusing on a big chunk of the Google population. There’s wide interest in what we find.

‍Abi: You’ve both already touched on this, but I want to double click on it. You come from different professional backgrounds. Ciera, I understand you have a software engineering background. Collin, you have a psychology background. It’s probably very clear to you how those things come together, blend together to apply to the work you’re doing, but can you share your own experience, your own perspective on this for listeners who may not understand why… First of all, why the pairing is so powerful, but also why you even need a psychologist?

Ciera: Well, I guess I can go for that one first, since I was the one who started hiring the psychologist on the team. Having the UX researchers on the team and having people with a psychology and social… It’s really a social sciences background, is important to provide us with context. I have a little bit of experience with doing user studies for my PhD work, but that wasn’t the primary goal of any of my PhD work. I was more… I was in software engineering, I was doing static analysis. A user study was something we’d do at the end, make sure it kind of… Users were happy with your tool, but it wasn’t the primary focus.

What we knew was that if we are doing logs analysis, that only provides you with part of the picture. It tells you what developers are doing. It doesn’t tell you why they’re doing that. It doesn’t tell you how they feel about it, if what they’re doing is good or bad. It doesn’t tell you if there’s room for improvement. It only gives you a number, but you can’t interpret that number unless you have more of the qualitative side of the world and you understand the behaviors and how those behaviors change over time depending upon how you change the context.

That’s why we started pulling in… We had started with one UX researcher on the team and it worked really, really well to have somebody who’s thinking about, “Wait, how do we design this survey to get the question that we want answered?” Because we quite frankly weren’t very great designing surveys at the start. It was working out really well to have that.

The other powerful thing though is that having the engineers with UX researchers also meant we could start scaling UX research. I remember one of first times we did that where a UX researcher on the team was wanting to run a survey and she was planning on doing it as a daily survey or an end of week survey for people. She goes, “I really just wish though that I could send the survey when the developer is running a build, because that’s what I really want to know, is what they’re thinking about when they’re running that build.” We go, “Well, we can make that happen for you. We can plug into the build system and find out. Right when they’re running the build, we can send them a ping on chat and ask them to take the survey right then.”

That unlocked a whole new set of methods for us, that we could do that. That’s what we call experience sampling. It’s a common technique, but it’s not always available for UX researchers because they don’t have the software engineers to be able to scale them like that.

Collin: Yeah, I would say the combination of the behavioral research methods that we bring from the UXR side and the capability to scale that the engineers provide us, also the support for quantitative analysis. Half of our researchers are actually quantitative researchers, so it’s not like everybody on the UXR side is sitting over there doing interviews and then talking about emotions. We have people who do quantitative analysis and statistical modeling, and even ML on the research side. You can’t paint all of the researchers with a single brush, but the engineers offer us this scaling. They also offer us really deep subject matter expertise. We’re in a really fortunate position. The UXRs on our team, if we were UX researchers on a consumer facing product, it’d be like, those out there are the users and we’ve maybe met them sometimes and we try to understand what they mean.

We are sitting next to our users. Ciera has a software engineer at Google. I study people like her. The direct contact with your user base, as well as the domain expertise, because I’m not a software engineer. I knew almost nothing about software development when I joined this team, but the direct access to subject matter experts who are way deep in it and who are at the top of their field is a really powerful augmentation to having this quiver of arrows that is behavioral research methods. Those things like the domain expertise, the scalability, and the technical skills from the engineering side combined with the wide variety of behavioral research methods, and a facility with accounting for things like bias and the way people work and what to watch out for in survey responses or interviews, demand characteristics, all the stuff that plagues and makes psychology research hard, it turns out whenever you’re talking to humans, those things are a factor and you need to account for them. We run surveys, we run surveys that we think aren’t terrible, but it’s super easy to run a terrible survey. Once you’ve run a survey, non-response bias analysis is a thing. That’s stuff that comes from the social sciences. We’re able to bring that expertise.

Ciera: I got a good example of the domain expertise coming into play. One of our first quant UX researchers was doing some data analysis early on in the team. She did a great job with the analysis and she showed us her early results. Immediately all the engineers said, “That’s wrong.” At first she thought that we were criticizing her work. We’re like, “No, no, no, your analysis is great, but there’s something wrong with the data.” She says, “Well, how would you know?” We’re like, “Well, you’re studying memory. The results that you’ve got, nothing’s in a power of two. Something’s wrong. This is supposed to be about machine memory. There’s something wrong with the underlying data here.” We went back and found out, yes, there was actually… The upstream data source we’re getting data from had a bug, there was some weird stuff coming in we had to filter. She knew then to filter all that out and it completely changed her results. Having that initial sniff test of this doesn’t look right from the domain expert’s perspective can be real helpful.

Collin: There’s one other thing that I think is pertinent here, because I think sometimes we talk to other teams about how can I build this capability. They mean, how can I build that mixed methods capability? A thing that we benefit from for our research, having professional engineers on the team, is the data engineering stuff.

At Google we have instrumentation and logs that a lot of folks won’t have, but that doesn’t mean it’s easy to work with those things. Even if you have a very good data science team, all that data engineering, cleaning the data, interpreting it, understanding where it has weird anomalies when it’s wrong, that requires data engineering skills that not everybody has. Even if they do, data scientists might not want to lean into that part of their job. They might want to, “I want to do the analysis and the research. I don’t want to munge through the data and figure out which rows to throw away.” That’s a necessary prerequisite in many cases, but it’s not the fun part.

We have software engineers who do a lot of the building curation, maintenance and monitoring of our data sources, and do it more efficiently than we could do as researchers. When it comes time to do the quantitative research, they can focus on the meat of the problem. I have to say, relative to other contexts, it’s been easy to hire quantitative researchers to our team because we can just say, “Look at this playground of well-formed data that you get to work with.” There’s so much, and it’s curated, and you have engineers to help you with data engineering problems. You get to do the research and the fun part and have really a lot of muscle around the hard data engineering part that isn’t necessarily the fun part for a researcher.

Abi: Well, the background of your team and all the resources you have put toward this problem are inspiring and enviable for probably a lot of listeners, but I think also just speak to just how hard of a problem this is. That’s what we’re really here to talk about today.

I want to move into, we talked about some of the methodology, some of the backgrounds of folks working on these problems at Google. You’ve written several papers that are public about different ways in which Google is actually trying to measure productivity or aspects of productivity. Can you first start by explaining, just at high level, what you mean when you talk about a mixed methods approach to developer productivity? How does Google measure developer productivity, I guess is the question?

Ciera: When we’re measuring developer productivity, we have a general philosophy first. There is no single metric that’s going to get you developer productivity. You have to triangulate on this. We actually do that through multiple axes. The first one that we try to work across is speed, ease, and quality. These are three separate aspects of productivity that are kind of intention with each other, where… Some of them are more obvious than others. I actually like a quote that Collin gave at one point where he told somebody that he can approve developer’s productivity at Google by removing code review. Developer velocity, by removing code review.

That’s a nice one at Google at least, because every Googler will say, “Absolutely not, you’re not taking away my code review. We need that. That’s a basic quality check there.” It makes the point well. But yeah, the first… We’re going over those three aspects. Even within those three we further measure across different methods.

We will measure using logs for speed. We’ll also measure people’s beliefs of how fast they think they’re going. We will also follow this up with diary studies and interviews to make sure that this all lines up and matches up together. For every single one of these, not even there’s one speed metric, we have multiple speed metrics. We’re talking about mixed methods. It is both using multiple measures, but also making sure that they’re validated against each other. As an example with our logs data, this is in our paper about cross tool logging, we have a measure which is active coding time. We don’t just create that measure in our logs. We also do diary studies and we ask engineers to write down every few minutes during the day what they are doing and we make sure then that what they said they do is matching up with our logs. That helps us give some confidence that our logs data is actually accurate.

Collin: Yeah, I think there’s a bit here that really epitomizes our mixed methods approach. Not only are we using our logs data and our behavioral data in concert, but when we do those diary studies that Ciera is describing, we actually don’t treat either data source as the ground truth. That would be a reflexive action is like, “I’ve got these logs data. Do they predict? How well do they match the subjective report data?” That’s not exactly what we do. We actually use the approach that psychologists have taken to iterate a reliability. If you have a bunch of observers all trying to say, count instances of sharing in a kindergarten class, there’s no objective ground truth, you just have a bunch of observations. What you calculate is the agreement among those people and you just treat them all as equally weighted in that analysis.

That’s actually what we do with our diary studies and our logs metrics. We say we have an active coding time metric defined on logs data, and then we have these diaries from developers that tell us when they were actively coding. We just calculate the agreement between those two sources. We assume there is some truth out there that we can’t directly observe without sitting next to the developer and probably bothering them. We take these two sources and we say, “Are these two lenses telling us about the same world?” Because that’s really what we want to know. We just want to know that they’re telling us the same information. In the end, often what we do is we lean into the log space metrics because they scale well. We can collect active coding time for every engineer all the time, and they’re passively collected, so we don’t have to bother anybody.

The diary studies require an engineer’s time, attention and effort and we can only do it for a small number of engineers at a time. I think we’ve done a study as large as 50 at a time, which is actually big for a diary study. But once we’ve found good evidence that we’re getting the same information from the two sources, then we can lean into the scalable method a little bit. But that elevation of… A lot of engineering organizations would say the subjective data’s like that’s the soft data and now I’ve got the hard data. We really don’t do that. We elevate that behavioral data and we say, “This is just another view of what’s happening in your organization, let’s treat it that way. Let’s see whether we’re learning about the same stuff from these sources,” and then other considerations, like scalability, and the investment required by the participant. That’s what determines how we decide to collect long-term data.

The other thing I’ll say is that we often use some of our behavioral methods where it’s either we don’t know how to do logs based metrics or we can’t in principle. We talked about the fact that we run a survey. We run a quarterly engineering satisfaction survey. When we first started that survey, there was a lot of selling to execs like, “Hey look, this isn’t just people’s opinions, this is actually valuable data about how engineering productivity is working.” I think at this point we’ve gotten past that obstacle and people have bought into it, but one of the things that really helps make that point is that there are things that are very difficult to extract from logs. Technical debt is a thing that we’ve run into that is it’s just hard to find good objective metrics that tell you how much and where and whether it’s a problem, the technical debt is.

Surveys can help you measure things that you don’t know how to measure objectively. It can also help you measure things that are in principle not measurable objectively. Like engineers satisfaction, there are no log based metrics that will tell us that directly, but we can ask engineers and they’re a good proxy. That’s two uses for behavioral methods, in addition to triangulating, is just augmenting what we can do objectively with we can measure flow or satisfaction now. Maybe eventually we can measure them in some other way later, but we can get them now and track them longitudinally.

Abi: You’ve shared now quite a bit about how you use surveys and log-based metrics and data in combination, and how you use both to cross validate the other. Can you share, have you run into discrepancies between the two types of data? What have you learned from these instances where the data isn’t lining up?

Ciera: Yeah, we’ve had a couple cases where there’s been discrepancies. They’ve tended to fall into two categories. One is that the logs data was wrong. It happens rather regularly when there’s a discrepancy. The other is that there’s some underlying facet that we weren’t measuring yet, that is resulting in these things being different.

An example of that latter one would be that if the survey data was representing a larger concept than the log data was. You’re asking maybe engineers, how do you feel about your developer velocity? There’s a lot of pieces that go into a developer’s velocity. There’s a lot, whereas maybe you’re only measuring one small sub part of that. You might see those diverging and it’s only because one of them is actually measuring a bigger concept.

The case we’ve seen, as an example, logs data being different, or incorrect even, we had a case actually with an experienced sampling study where we were asking our engineers after every build to please take a survey. We were doing this to try to correlate build speeds with satisfaction and velocity. We got some weird survey responses back a few times, where engineers said, “What are you talking about? I didn’t do a build.” We’re like, “Well, that’s weird, because the logs data says you did a build.”

Well, it turned out that the logs data was actually not just including builds that engineers kicked off themselves, it was also including a bunch of robotic builds that were happening in the background that the developer wasn’t even paying attention to. Those were useful for other purposes for the developer tools, but the engineer didn’t care about it. That didn’t actually factor into their satisfaction. When you remove those builds, it actually gave you a very different picture about the build latency that developers were experiencing at Google than when you’ve factored them in.

Collin: I think there’s also interesting aspects where just because there are humans in the loop here, you can’t assume a one-to-one relationship with some objective metrics. Build time is a good example. If you look at psychology broadly, the entire field of psychophysics is partially predicated on the idea that the thing you can measure in the world isn’t the same as the subjective experience of that thing. You can turn a light to be twice as bright. The subject’s impression of that light won’t be twice as much brightness necessarily. There’s these weird compression and expansion effects that emerge in their mathematical functions that describe them.

Those non-linearities and qualitative shifts we see in software engineering too. If I reduce build latency, if I make builds shorter and shorter and shorter, you might be like, just everything’s a gain. I get a linear gain as I reduce build latency, but that’s not true. In fact, once a build is longer than X minutes, the developer has shifted to another task. Beyond 10 ish minutes or whatever, I can’t remember what the number is here, sorry, but beyond 10 minutes or whatever, the developer’s not sitting there staring at the build. They’ve gone off to do something else, or they’re taking a break, or they’re like… They’ve task switched.

Now if you’ve shortened your build from 30 minutes to 20 minutes, the gain that you’re going to affect in terms of straight up throughput or productivity has very little to do with that 10 minutes and a lot to do with what else did they shift to and when will they return. How long will it take them to do the task resumption that’s required to? What was I doing anyway?

Those things are features of humans, not features of builds, but of course they’re critical to understanding the way that humans react to build latency changes. That’s one example, but I think we see that in other cases too.

Abi: You talked earlier about how you don’t, you can’t… We all agree that you can’t reduce productivity down to a single number. There’s all this data you’re collecting. How do you go about figuring out which numbers actually matter? I’m talking more in the broader context for the folks you work with across the organization. What are the numbers that you steer people toward paying attention to? I’m sure, of course it depends on context, but what’s the way you approach that? Maybe you can’t share specific details, but where have you arrived?

Ciera: You said the right word there, which is it does depend on context. That’s usually the first thing we ask people. What is your goal? The goal for different people is going to be very different. The goal for our VPs is usually… They’re just trying to get a very broad sense of how are things going right now, and is there a fire that I need to go look at? They don’t need all the nitty-gritty details about what the current level of build latency is, and build latency in this tool versus that tool. That’s way too low level for them. They just want to know the broad strokes, are things going as smoothly as they were before?

For them, we provide them with very high level metrics. But as we drill down, there are teams that are going to… The teams that are responsible for those tools, they do need more specific metrics for those tools. We always encourage people to… We follow goal, signals, metrics approach. We ask them to first write down your goals. What is your goal for speed? What is your goal for ease? What’s your goal for quality? Write those down first and then ask your question of, what are the signals that would let you know that you’ve achieved your goal? Regardless of whether they’re measurable. Signals are not metrics. What would be true of the world if you’ve achieved your goal? At that point, try to figure out what are the right metrics

Some of the signals may not even have metrics. That’s useful to know sometimes, that yeah, we can’t measure this thing, or maybe we can’t measure it now, but we can create a new metric. We can… Then you can start talking about like, well what’s the right metric? Do we build a survey for this? Do we do logs? But we try to encourage people to really think from first principles of what are they going to need for their goals, as opposed to just giving a blanket set of metrics for the whole company.

Collin: We do have an assortment of metrics that we hold up as a good starting point. They map to this speedy quality set of facets. Our philosophy there has been to lean into the metrics that we have a lot of confidence in. We know what we’re getting from the data side. We’ve done that validation work to understand that it reflects what developers are experiencing and that we’re not overlooking these non-linearities, or there’s these weirdnesses in how people experience the state of the world.

We do ask the question a lot, what is it that you’re trying to do, and which metrics makes sense? We just also lean into this notion that you need a variety of metrics, both across speedies and quality, and across objective/subjective, volume rates, to really understand what’s happening. What we end up doing with stakeholders a lot is pushing them. What would you expect to see? What movements and metrics would you expect to see as a result of that state of the world? We often challenge them with, what if that’s not true? What if you see no difference on that metric? What might that mean?

In fact, one of the ways we prioritize projects is if you don’t have a good idea what you would do if the negative case were the outcome, we’re not sure how we’re actually helping you, because we don’t want to just be in the business of confirming what people already think they know. We push them on what if the null hypothesis comes out? There is no difference. Or what if the negative hypothesis comes out? A wide variety of metrics across that speeds and quality framework, ideally across subjective and objectively sourced metrics, with just nice coverage of the whole experience, and then a lot of thinking about how do these metrics reflect the world that you’re trying to measure and understand?

Ciera: I want to put out there too, it’s okay to sometimes just not measure productivity. This is why we ask what are you going to do with the negative hypothesis? Because we’ve had teams that wanted to measure productivity. When you ask them that question, it’s like, “Oh, well we’re going to do it anyway, because it’s a huge performance improvement. It reduces our machine resource.” It’s like, “Great.”

Collin: Do it anyway.

Ciera: Go do it then. You’re telling me it’s going to be slight to positive improvement for developers, and you’re going to do it anyway because it’s definitely going to be improvement to machine resources. Why are we having this discussion? Just go do it. And that’s okay.

Collin: It is. We also have this conversation sometimes. A team leader will come to us and say, “I want to measure… We’re really trying to buy down the technical debt in our code base. Can we run your technical debt survey to understand whether we’ve done that?” I’m like, “How big is your team?” They’re like, “It’s 12 engineers.” It’s like, “Why don’t you guys just have a meeting and you can talk about where you still see hindrances? Your engineers are experts at this. They’ve actually waded around in your technical data and your code base. They don’t need to take a survey to understand that.” A survey is great if we’re trying to understand broad strokes. Is it migrations? Is it… What is the flavor of the technical debt that most plagues large bodies of developers, but on your team? You don’t need to survey your team necessarily, and you don’t need to even use logs based metrics to understand them. Having a focused conversation with a meaningful framework of what are we trying to do? What does that mean? How do we implement it? That is better spent time than trying to measure imperfectly in some cases.

Abi: Yeah, I think this anti-pattern of like, “Oh, I need a metric for my team of four to understand something that you can just talk to people about.” It goes back to this mixed methods concept, just talking to people is a good method to start with.

Two things you guys brought up that I really appreciate. One was in your goal signals, metrics, framework, the focus on signals. What is the thing that would be observably different in the world, or in the environment if the thing you were trying to measure were altered? The second piece I really appreciated was you challenging teams and asking them, “Hey, what action or decision is this measurement actually going to inform? Because if it’s not going to inform anything, then there’s questionable data, or value in the effort that it’s going to take to get that data.” I love how you start with the problem and the context and move into the measurement piece.

You spoke earlier about the quarterly survey at Google. I would love to ask you a little bit more about that. Tell listeners just a little bit about the high level process around that. Who designs that? Who runs it? Who does it go out to? What kind of participation do you see?

Collin: Yeah. We’ve been running the survey, the engineering satisfaction survey, for five, five and a half years now. Q1, 2018 was the first one after we piloted it. It hits about one third of Google engineers every quarter. We try to stagger our sampling a bit so that we’re not asking people every quarter, how do you feel about your productivity? But we get a quarterly signal. We have a lot of engineers at Google. We can meaningfully sample a third of them at a time and get a good signal. As far as who designs it, it’s sort of an evolving product. Initially, a couple of UXRs, with lots of collaboration with our engineers, decided on what topics we wanted to address and crafted the survey and piloted it with engineers, and then launched it and iterate it.

Over time, as it’s gained some visibility, we’ve had a lot of engagement from stakeholder teams. Eng VPs across the company who are interested in the productivity of their organizations are like, “Hey, it’d be great if the survey also included questions like this, or the tools that my folks use, or the processes that I’m most concerned about.” We’ve had this accretion of material from outside.

At some point, we actually hit a breaking point where we had accreted a lot of material and we needed to start streamlining. In the last maybe three years, we’ve been more in a mode of streamlining and trying to make the survey a little sharper. We’re always looking to evolve it, but every quarter there’s a dedicated UX researcher who executes the survey. There’s a dedicated engineer who supports the survey execution as well. We have a team effort to do things like work on refinements to the survey, triage requests for changes. Everybody gets in on analysis because there’s just so much data that comes out of it. We ask a lot of questions. It’s a pretty hefty survey. We’ve tried to-

Ciera: We also have a data analysis pipeline that we’ve built up to automate a lot of this so that we don’t have to just be playing in a spreadsheet the entire time. We pull in all the data from our survey tool, do all of the basic slicing and aggregations, and we can turn that around the data in a couple of hours.

Collin: Yeah, yeah. We’ve built a lot of infrastructure around our survey. As you may know, executing a survey consistently longitudinally is a challenging thing in of itself. Having specific people assigned to that task is obviously beneficial. The engineering support for building infrastructure to automate key steps and to manage the data has been huge, and really just a sustained investment in the program has been a big deal.

I have said to numerous people in the last couple of years that one of my biggest takeaways from that program has been how undervalued consistency is. Default, no, we’re not changing our survey. We’re going to stick with what we know has worked and that we have longitudinal tracking for. There’s actually immense value in having that kind of inertia for an instrument, a measurement instrument. Yeah, we’re flexible and we do evolve and adapt it, but by default, saying, “No, no. What we have is working. We’ve already measured it for X quarters. Let’s see how that number changes and we can talk about changes in the future if that’s necessary.”

Abi: Is part of your analysis… How do you deal with the distribution of the information back to… Does it go all the way back to engineers? Is it everyone? Is it just leadership? Is it presented? Or do they just get all the data themselves? How’s the data used? What’s the organizational communication or follow up on it to sustain this program?

Ciera: We have a pretty big emphasis within the team about being transparent about how we utilize this data. First of all, no data ever leaves our team un-aggregated. The responses are all private. Once we aggregate the data, it goes out onto dashboards. We do give some early access for teams that have signed up because they actually are needing the data very quickly to score their own quarterly goals, but the rest of the data though goes out to VPs and to the individual contributors at the same time. They can all see it, they can all see the aggregations, they can all play in our dashboards. They can all query the table, the aggregated tables, and learn from it.

Collin: Yeah, some of this is about the history of the survey program. Early on, not necessarily everybody was bought into the idea of a developer satisfaction survey. At that time we could have shared it very, very widely and everybody would’ve been crickets, but we focused on getting key people to buy into it. We did just start a practice of sharing broadly because we could.

At this point, as Ciera said, we still share a report every quarter with anybody who wants to see it. The dashboards are available to anybody who wants to play with them. We have seen, not only does that foster a lot of goodwill, that transparency, but it means that there’s a lot of DIY work that can go into executing on that data. We’ve had folks in specific product areas or specific teams who use the aggregated tables that we make available to build their own dashboard that pulls in data from other sources that are meaningful for them, or that corroborates problems they know that they’re trying to tackle.

We’re a large team, I guess for what we do. We’re like 18 or 20 people, but we’re not that big. We can’t do bespoke analyses for every Eng team that’s out there that wants to do them, but they can do them for themselves when we provide them access to the data, as well as all the information about how the data were collected and what questions we ask precisely and really are open about those things, which we are.

Ciera: It’s been interesting too, how much people have really gotten into this. I’m always surprised to find new slide decks, different places in the company where people cite our survey data that we didn’t talk with them, we didn’t even know about it, but because the survey is so widely used, they’re continuing to go back to it to either understand their team or their product that they’re building for other developers and say, “Hey, we need to invest more here, or we need to shift our focus on this particular area,” and they come back to the survey for it.

Collin: The one other thing that maybe we haven’t mentioned is that… The survey has a lot of structured questions in it. How satisfied are you with X? There’s a Likert scale, but it also has a fair number of open-ended questions. We’ll ask, how satisfied are you with the following engineering tools? If they say less than satisfied with some of them, there’s a follow-up question that’s like, “Hey, you said that you didn’t like some tools. Tell us more.”

Then our very engaged and thoughtful engineers will write long structured paragraphs about their grievances. That’s great. That’s so useful, because then the teams who have had their tools put in that negative light, yeah, that’s not super fun, but they’ve got this source of direct qualitative information about what is causing so many problems. The open-ended text questions have been a real goldmine for some of the product teams that either don’t want to or can’t run their own user research because it’s an internal developer tool and they don’t have a UXR, or whatever, but we provide a fairly nicely organized and sanitized open texts for them to mine and to understand so they can orient their roadmaps around it.

Abi: It’s really interesting to hear how you mentioned the buy-in for the survey has grown and usage of the data has grown organically. Part of probably the challenge at the beginning, and we’ve talked about this in other conversations, but is that leaders, especially engineering leaders, I think, tend to by default be a little bit skeptical of survey-based measures. I would love to ask you about your experience with that, whatever you can share. But then also, just advice for others out there who may be trying to get a developer survey program off the ground and are dealing with that exact problem. How do you get buy-in for a survey program? How do you get trust, or educate people on the validity of survey-based data and measurements?

Collin: Early on, we had a lot of conversations about the points I made before, which is there are things that are very difficult, or maybe even impossible to measure in other ways. Don’t you want to hear about them? It’s not an algorithm here. There’s no shared path to buy-in. I think the consistency point I mentioned before where executing consistently longitudinally, that helps a lot.

A one-off survey, you’re not going to get buy-in right away probably. Actually, a lot of what people are interested in is changes. We launched a feature in Q1, did that change this metric? It’s amazing how fast people buy into survey data when they see an outcome that they were hoping for. Of course, that doesn’t always happen. Survey data can be noisy. Even with large samples, we see fairly broad confidence intervals on some of our questions.

That can be a difficulty. Stakeholder engagement and getting subject matter experts input, so you’re asking the right questions, using the right words, providing the right options, that’s a key thing. You can lose credibility with engineers very quickly if you ask a question that’s a little off target or doesn’t reveal knowledge of the underlying domain.

Ciera: I was thinking about this thing about specific VPs that we convinced, who actively were not sure about survey data and then became some of our biggest fans and what happened there. A lot of it was working with them to help them understand how to utilize it. We encourage them to go to the survey data first, because if you just go look at logs data, logs data doesn’t really tell you whether it’s good or bad.

We mentioned that we’ve got this active coding time metric. We know of the active coding time it takes to make a change at Google, to make each change list, but that number is useless by itself. You don’t know, is this a good thing? Is it a bad thing? Do we have a problem? Who knows? We encourage our executives to do is go to the survey first, see where your top problems are, and we highlight to them, “Hey, it looks like the top five problems are the following things. Maybe one of those is something you think, ‘Oh, I can make impact on this one.’” It might not be the top one. The top problem might be something that you cannot independently yourself change. It might be something that’s very expensive to change, but maybe problem number 2, 3, 4 is cheap to change and it is in your control. Great.

Now that you know it’s an issue, now go look at the logs data and now try to see, “Okay, how big of an issue is this at scale? What’s the number we want to set our goal to?” We started convincing a few VPs to approach it like that and they really liked it. It works for them well. They’ve now gone through this a few times saying, "Yeah, I just find my next big problem, focus on that, see it improve in the logs, and then later in the survey as it starts to actually fix things for the developer. Now let’s go take the next big problem.

Collin: Ciera’s also implicitly mentioning a thing that’s important, which is survey data are a lot more convincing when they’re corroborated by other data sources. When we go and we say, “Hey, we’ve measured self-reported productivity. Also, we have this logs based measure that was not survey based that says similar things are happening,” that immediately gains confidence for both metrics. When we see that agreement between behavioral research methods, or survey-based research methods and objective quantifiable research methods, that also bolsters the credibility of the whole program.

That doesn’t mean that has to be the way you gain credibility for survey research, but I think it sure does help with engineering leaders especially. It is a fast track to credibility when you say, “Hey, we have this measure of engineering velocity, it’s survey based,” and they’re like, “Hmm.” But then you say, “Hey, look, it actually… It’s predicted by these three through put in velocity measures that we extract from logs.” They’re like, “Oh, okay, maybe I want to pay attention.”

Abi: Love that. Collin, I want to ask you one more follow up question since you’re the psychologist here in the room. As you’ve interfaced with leaders, not just at Google, but all across the industry, is there one thing you feel like is just fundamentally misunderstood about surveys as a measurement instrument? If so, what would that thing be?

Collin: My glib answer is that people are under the misimpression that it’s very easy to run a good survey, when in fact the easiest thing you can do is run a terrible, terrible survey. Really, there are people who’ve dedicated their careers to survey construction. I’m not even one of them. I just have some training in that area. But I think people misunderstand how difficult it is to construct, execute, and analyze a survey effectively. That’s one thing that I think is a misconception.

The conversation we just had about convincing Eng leaders about the validity of survey, or any qualitative research, that problem does exist elsewhere. I’ve done UX research for aerospace engineers, I’ve done it for radiation oncologists, and I’ve done it for software engineers. Those all three are disciplines where there’s a lot of deep technical expertise and a lot of lean into what’s the hard data say. The story’s roughly the same.

Emphasizing that you can learn, things that are difficult or impossible to learn from objective data quickly with a survey, a well constructed survey. That’s useful. I think pointing out to them that they’ve hired excellent engineers or oncologists or whoever, excellent technicians, excellent experts, and that asking them for their expert opinion is really valuable. Actually, when it’s really presented that way, can be pretty persuasive.

I remember having a conversation with Ciera early on about why it’s useful to ask engineers these things. We go out as a company and we try to hire the best engineers that are out there, the smartest people that we can find, and then we’re like, “Ah, don’t ask them any questions.” That’s really silly. Our engineers are really great integrators of information. They’re observing all of these variables that impact their productivity. When you ask them, how’s your productivity? They’re rolling that up in a way that’s hard to understand and it can be a little bit messy because they’re human beings, but also, they’re really heavily integrative and they’re considering a lot of factors simultaneously in a way that is tailored to their own productivity. There’s value there.

Ciera: They’re not even just… They’re passively doing this, but they’re also actively doing this. Our engineers here are very much reflective practitioners. They’re frequently thinking about their own productivity and how to improve and how do they improve their team’s productivity. When we ask them, it’s not exactly just an idle thought. They put previous work into this and they have opinions that they would like us to know.

Collin: One of the things I love about working in UX research for technical user bases is the interest in self-improvement, self quantification and incremental change. If I were to tell a consumer of a digital product out there in the world, you can do that task 2% faster. They might be like, I wasn’t even aware I was doing that task, or that task isn’t important to me, or I don’t care how fast I do it. If I tell a Google engineer, you can code 2% faster if you do X, they’ll seriously consider doing it. They want to know more about how we know that. The engagement and the interest in optimization is just a whole different story in a technical population.

Abi: Well, thanks for these tips and these thoughts sharing the approaches and experiences you guys have had. I think this will be really helpful for both leaders who are on either end of that discussion, whether they’re trying to get buy-in for survey-based methods at their organization or if they’re on the other side of the table and skeptical about these methodologies. I hope that some of the things shared here can be helpful.

I want to move into talking about a very recent paper you co-published called A Human-Centered Approach to Developer Productivity. Really loved this paper. It referenced stuff that I hadn’t even come across before, even that Peter Drucker quote about Knowledge Drucker productivity. Loved that. I went and read that paper. But just for listeners, this paper really talks about the challenge of measuring developer productivity, how we commonly get it wrong and points toward directionally how we should consider thinking about measuring productivity.

I want to start, and by the way, all listeners should go check out this paper. I’m sharing this paper with people all the time, but I want to ask you both, where did this come from? Why did you feel this paper was needed?

Collin: Yeah, I think a combination of experiences. One of the things that happens of course is that somebody wants to measure productivity and they look at what’s in front of them and they select somewhat out of convenience. What am I already measuring? Let’s count that thing. That leads them to these places, what Ciera was mentioning before. They’re over-indexing on one narrow metric potentially. They’re not even thinking about that metric in relation to their goals.

A little bit of it was trying to springboard into the, “Hey, productivity is a complex thing. You need to think about it holistically and in a multifaceted way.” The other thing is that there’s a little bit and the Drucker reference and also the reference, the Taylorism in that paper, there’s a little bit of a desire to take a systematic and scientific approach to measuring productivity, which I’m totally for. That’s what we do, but that doesn’t mean a reductionist approach. It doesn’t mean that you can rely on productivity analysis methods, scientific management, that was invented in 1900, to understand software engineers do.

It’s not that anybody actually believes that shoveling coal in software engineering are fundamentally the same on that front. We sort of joked a little bit about people not believing that software developers are human beings. That’s not practically an issue, of course, but when we try to address a hard problem, one of the things we do is simplify. One of the things we do is put constraints on ourselves. That can lead to this place of oversimplifying, especially knowledge worker productivity.

It was a set of conversations that Ciera and I had over a long period of time. I think at some point we wrote down engineering productivity for humans rather than for robots or whatever. It sat idle for a long time and we poked at it and sent it to a colleague of ours and he poked at it. At some point we had this opportunity to write for IEEE software. That seemed like a good framing, a good starting point for the work that we do. It’s bringing that human element, that behavioral science, as well as mixed methods approach, into understanding this complex thing that is productivity. I don’t know. I guess that’s my memory of it.

Ciera: Yeah, I think, Collin, you’re right. My recollection here is that it was the series of conversations that you and I would have of coming back together in one of our meetings and going, “Oh, look, someone forgot developers are human again.” It was sometimes from things in our own job, but it was also… I remember… There’s a series of three papers I was reading when I was looking at published research where I got frustrated because somebody would do some research, for example, to understand hindrances and developer productivity, and then they’d get a bunch of hindrances from developers and they’d toss half of them away. Well, those are fluffy human problems, basically. Set them aside. We’re not going to talk about that. I’m going, “Well, no, but these things are tied together. You can’t separate out hindrances to productivity in human fluffy problems that are HR things versus tool hard, tool problems.”

They’re connected. I guess an example of this, something I actually saw a paper reference. They talked about code review. I remember they were talking about how sometimes you might send your code review to somebody who is on vacation. They kind of tossed that as human problem. I’m like, “Well, no, no, no. That is also a tool problem.” Actually at Google, we even have a way of fixing this. It was very nice and simple. Our code review tool gives you a warning if your reviewer is on vacation. It lets you know that maybe you should assign this to someone else. If you ask it to auto assign, it won’t pick someone who’s on vacation. That’s a tool fixing the human problem. These things are tied together. Humans react to their tools, change the tools, it changes how the human works with it. They’re all the same.

I disagree. As another example, actually, if you look at our paper on anonymous code review, that’s another one where, yes, bias is a human problem, but we can address it with tools and then it’s not a problem anymore.

Collin: Less of a problem, maybe.

Ciera: Less of a problem. We’ll go with less of a problem. These things play together. We can’t just pretend that we can focus only on tools. We have to know that developers are humans and they’re going to react to them.

Collin: Yeah, I think you can also think of this as the tools, the infrastructure, the engineering processes we put in place, that’s like the structure that humans are working in, but humans aren’t only a consequence of their structure, their behavior is a consequence of what they bring to the table from outside as well as that structure. But we can look for structural solutions to some of the human problems that come in, and Ciera’s just given you a couple examples.

The other two important points are that the human problems often swamp the technical problems. If you’re super stressed because you have a childcare issue or because you know have somebody sick in your family or whatever else is going on, you have a medical problem, those are human problems. Yeah, we can’t solve those directly, but they sure do make a huge difference in things like productivity. It’s not even that we can necessarily solve all the human problems, but if you fail to account for them, you’re missing a big piece of the puzzle.

Finally, I think the other thing is that sometimes the narrowing of the focus to what we can count or single metrics, it’s easier to lose empathy for the people that we’re talking about. Ciera and I spend a lot of time talking about privacy for our participants and privacy for our engineers. She mentioned we only share aggregated data. That’s because we want to evaluate engineering tools, infrastructure processes and practices. We’re not here to evaluate engineers. When you start to do things like count the lines of code somebody’s produced, a leaderboard is the next logical conclusion. But there’s not actually a lot of utility there from a how do I change the structure of an environment point of view? It’s not a good way to systematically improve engineering productivity, but it’s easy to forget that when you can just stack people up. Nobody at Google does that. We actively discourage it, but it’s a thing that would be easy to slip into if you forget that those are human beings on the other side of this number.

Studying the world rather than just the data is a really critical thing that all researchers need to keep in mind when they’re doing this kind of stuff, is it’s not just numbers that we’re studying. It’s people, it’s an organization, it’s the world that we’re studying

Abi: This problem of leaders, organizations, picking a convenient metric like commits, lines of code, poll requests, something like that, this is happening all over the place. You’re, of course, not the only people talking about this and warning people about this. A lot of my research focuses on that. A lot of people we know there writing about this and talking about this. This is a topic that comes up a lot on the show. Sindu wrote a paper about this. I just want to ask you. Explain to listeners the problem with using the reductionist metrics, number of commits, or poll requests to measure productivity or drive productivity.

Collin: I don’t think there’s one problem with it. I can think of a couple. One of them is just… There’s a human being on the other side of it. Set the empathy part aside, that human being is not just going to sit around and be at the whim of the counting. Goodheart’s laws about metrics once they’re… Once you measure it, people want to game it, basically. I can’t… What’s the right phrasing of the law? Every measure fails to be a good measure or something.

Abi: Right.

Collin: You get the point. The idea is that if I just count one thing and there are human beings on the other end of it who are incentivized to look good on that single dimension, it ends up being, essentially, I want to say deformed by that process. If you tell engineers, more lines of code equals more productivity, you’re incentivized to write more lines of code, they’re going to add lines of code to their changes. Not because they’re malicious, just because that’s the incentive structure you’ve put in place.

Ciera: You’ve also now disincentivized other types of work. I don’t think anyone would argue against, we want tech leads at Google, especially to be thinking about the overarching design and software architecture. Well, that’s not writing lines of code. If we would like to keep having good designs and good software architecture and high quality code, we probably don’t want to be measuring them based upon their LOC.

Collin: Yeah. You also fall into this trap of undesirable trade-offs. If throughput of lines of code is what you want to optimize for, are you not going to measure the quality of those lines at all? Are you not going to measure their reliability, their robustness, how many bugs there are? If you’re going to try to simplify to one or two convenient measures, are you adequately capturing the trade-offs that really exist between, for example, the velocity of your engineering and the quality of what comes out the other side? Expanding that again back to the human beings, if you’re in an environment where you’re like, write code fast and it better be good, what is your attrition going to look like long term? Are people just going to get burnt and be like, “I’m going to go find a job where it’s just less crazy”?

We don’t tend to touch attrition on our team too much because there’s a lot of sensitivity around it, but those things happen because they’re human beings on the other side of these strategies. The measurement strategy, especially if it’s reductionist and let’s say just myopic, it’s more subject to the gaming that happens from the other side. It misses these critical trade-offs that you really do care about as an engineering leader. It can become insensitive to the reality that there’s a human on the other side who has to work on the other end of your measurement strategy.

Abi: Let’s see. This understanding, these ideas you’re sharing, I’m always quoting you guys, as you both know. I love your papers. I was sharing this paper with a group of folks a few weeks ago. One of them was like, “Oh my gosh, the fact that this needs to be said by researchers at Google is just… Shows the sad state of our industry right now.” I kind of want to ask you, why is this so difficult for us as a industry to just come around to understanding this and putting this practice to bed?

Ciera: Because we’re also humans and we would like an easy answer. Unfortunately, that’s what Collin and I are saying, is this is all nuanced and hard to do. It takes significant effort. There’s not an easy, simple, just look at that one graphic and know everything is okay. I don’t like that either. I would love it if it was simple and easy, but the world is a complex place.

Collin: Yeah. I only suspect that this is a pervasive problem. I haven’t been out at all the big tech companies to observe this going on in their engineering organization. We see shades of it occasionally. When we look at the literature and outside writing, you see a little bit of this reduction creeping in.

Why is it hard to come to terms with? Ciera’s right. We’re human beings too. This is a hard problem. It’s not unique to engineering productivity either. Measuring business performance has a lot of the same issues. We’re more comfortable with evaluating businesses on a narrower set of metrics. There’s a lot of people who are heavily invested in quantifying a business’s performance in various ways, but it’s not like those are perfect either. Revenue per headcount is probably not the right metric for efficiency for every business, but we use it a lot, we globally.

A lot of these problems are hard and human beings do want simple answers. We want to consider two, maybe three things instead of 12, so we reduce the field. We put constraints on it.

Abi: On that note, Collin, Ciera, both of you, when we were talking earlier, shared examples from other knowledge worker professions, or other industries that struggled, or maybe overcome to some extent, some of these similar challenges. Share with listeners a little bit about what we can learn, or the mistakes we can learn from even from some of these other fields.

Collin: Yeah, I think we talked earlier about medicine a little bit. I spent a few years working in UX research for a medical company. That field definitely has a great interest in things like productivity, or efficiency, or efficacy, but there’s also a lot of money involved in healthcare and people have to make hard decisions. The example I was giving was about investing in a piece of equipment like a linear accelerator, for example, for radiation treatment. That’s a multi-million dollar investment. A hospital administrator has to make that decision to make that investment for their clinic or their system or whatever. They have to think about, how is that as a capital expenditure? Am I getting my money’s worth when I invest $10 million in this thing?

You’re inclined to ask questions, “Well, how many patients per day can I treat on it?” Medicine is good. They don’t over index on those things. They look at them, but they also look at quality of life. They look at efficacy of treatment. They talk to their clinicians about the quality of the care, the patient experience of those things. They have tried to really do a good job with capturing this holistic experience and looking across many metrics of many flavors to understand what’s happening. It’s not a solved problem. Anybody who’s in medicine would say there’s still a lot of challenge there. But I think they have been really thoughtful about it because to their credit, a lot of the people who go into that field, they want the best holistic outcomes, especially the best holistic clinical outcomes. They’re working within those constraints.

The one thing to take away from that is that it’s hard and that we need a very well-rounded approach that looks at a lot of different dimensions.

Ciera: I was at a seminar recently where we were talking about standards of evidence, and medicine was one of the ones that came up. The other two that came up were psychology and education, and these fields are not perfect but they’re farther along than we are as a field in trying to say, what do we really know about what’s effective? They have created standards of evidence for their research, and then they try to publish out the research and standards of evidence in a way that people can look it up.

One of the examples we were exploring in that seminar was looking at the What Works Clearinghouse, which is a government run website for education. The idea is that teachers or educational researchers can go and look up and say, “What does work for teaching mathematics to this age range that is behind in mathematics at this level, or that has dysgraphia or whatever?” They can find out, what’s the best practice right now? We don’t have that in software engineering yet. We don’t even have standards of evidence yet. After that, we need to actually create a way of saying, okay, I want to make it so that every software engineer can go to a single website and say, yeah, what do I need to consider to improve my productivity in this particular field, in this particular space? But we’re pretty far behind them. We’ve got a lot of catch up.

Abi: That’s interesting to think about. Yeah, something I just think about is just how young this field is. Yeah, we’re behind, but this isn’t… People have been building homes and operating on… That sounds grim, but medical practice. Around a lot longer than software development. I want to, in this last part of the conversation, move into talking about this just recent paper you guys published, another amazing paper around defining, measuring and managing technical debt. I think this will be a great way to tie together all the themes we’ve been talking about today.

First off, you defined technical debt, or at least broke it down into its elements, which I think is remarkable achievement. I want to recommend that all listeners go check that paper out. One thing you start off in the paper talking about is, and by this point in the conversation listeners aren’t going to be surprised by this, but is that you’ve been measuring tech debt with surveys for a while. Some people that I’ve shared that with have been, that’s been a very surprising, a new concept to them. Can you just explain? We’ve talked about survey methodology in general, but how do you think about an approach measuring tech debt with surveys?

Collin: We started measuring technical debt with surveys in 2018. It was shortly after we started running our quarterly survey. One thing that, I can’t explain why, but one thing that our team has collectively been good at is sniffing out the next thing that leadership is going to be interested in. A good nose for the next question of interest tech debt was definitely… There’s a strong whiff of tech debt in the air.

We started just trying to break down technical debt as engineers define it. This is a departure from some people’s approach, which is just like say, here are the kinds of technical debt that I know exists based on a formal understanding of software engineering. That’s a reasonable approach, but the scent we were getting from leaders was, I’m paraphrasing, engineers keep saying technical that is a problem. I don’t even know what they mean. Is it everything? Is it a specific thing? What do they even complaining about that I can fix? It was from a good place, a desire to help, a desire to take action.

Our first foray into measuring technical debt was just to try to ferret it out. When engineers say they’re hindered by technical debt, what is it that’s actually hindering them? It took us maybe three quarters of incremental improvement in our survey and some factor analysis and unpacking open-ended texts, but we arrived at a set of, I think it’s 10, is it 10 kinds of technical debt, that we thought were emerging as consistent themes that engineers were referencing, but also somewhat independent of each other. We really let the data tell us what do engineers mean when they complain about technical debt?

Now we still ask that question. You said you were hindered by technical debt, which flavors were you hindered by? That lets us understand where the big challenges are. It also lets product areas or teams know, “Hey, my engineers are saying they’re bothered by technical debt. What flavor is it and what can I do about it?” If it’s a migration or if it’s dead code, or if it’s whatever, I know then what action to take.

You may know that there’re not a lot of really great effective ways to measure technical debt. We let the data guide us in a survey form to understand what engineers were talking about. From there we had a jumping off point to do more analysis work. Ciera was involved in trying to use that as a basis for then going to objective metrics from the code base and saying, can we see these things from that angle? Now that we have a bunch of engineers telling us what kinds and where to look, can we find it in the code base? If you want to talk about that part of the paper.

‍Abi: Ciera, certainly there’s probably a research question around, as we have developed this greater understanding of what technical debt is and we’re getting signal on it through the surveys, how do we get the more objective measures? But beyond the research question, what else prompted the investigation? I believe there were also just some limitations of the survey signals you were getting and maybe again, some of the desires just by leaders to have more leading indicators.

Ciera: It wasn’t even just leaders, actually, I will note. It was ICs across the company too. Everyone was complaining about technical debt, but first defining what that was. And then there was a sense of, I had a lot of people phrase this to me as I want to see a heat map of our code base, and just tell me where the hotspots are. Where do I need to pay attention to? Leaders wanted to see the big wider code base, but even within a team, people were saying, can I just… Can you show me where the hotspot is in mine? Or can you show me that I have a feeling that my team is a hotspot compared to the rest of the company, and I’d like to back that up so I can buy my team some time and tell our leadership, “Hey, we need to take a quarter to turn down our technical debt,” but they didn’t feel like they had the data to support that.

That was where that research started from was like, let’s see if we can have a logs based metric that agrees with the survey data that says this is roughly where tech debt is at in the code base. Unfortunately, that did not work as we would’ve hoped. I don’t think it’s impossible in the future, but I think the metrics that exist right now in both within Google and outside of Google, just aren’t really representing technical debt. Things like the number of to-dos does not accurately represent how people think about technical debt. Number of lint errors I know is another one I’ve heard people talk about is like, “Oh, if you’ve got a lot of lint errors is tech debt.” Well, not really. That doesn’t correspond with engineers perceptions either.

We did look at a wide variety of measures and we didn’t find anything useful. I don’t think that means that it’s impossible that there are measures out there. I think we need to keep exploring. That’s a place where there needs to be more research.

Abi: When you say you didn’t really find something that turned out to be a useful signal, can you just share for listeners, what kind of statistical analysis? What was the methodology to actually determine that? Or broad strokes, just even in layman’s terms.

Ciera: We were looking for correlations between the code that an engineer touched and how that metric looked for that code versus what that same engineer was responding on the survey. I don’t remember exactly which statistical method we were using. That was a while back.

Collin: Yeah, I think the challenge, one of the challenges of doing this part of the project was of course we ask engineers about their technical debt on a quarterly cadence. We’re asking an engineer, “Hey, how hindered were you by technical debt in Q1?” That’s a long time. And then we have to go back and say, “Well, which parts of the code base did they touch in those three months?” There’s a lot of uncertainty involved there in trying to discern well, where exactly might they have encountered it? Of course, all the complexities we associate with human research. Do they remember three months ago versus one month ago? Should there be a linear relationship between how much technical debt they encountered and how often they report it? You have to make assumptions there and some of them are probably wrong. There’s a lot of noise in that signal.

Abi: When you analyzed the objective measures and found that none of them really correlated, for both of you, was that outcome a surprise?

Ciera: That outcome to me was not a huge surprise. There was a few metrics that I was hoping might actually correlate, but I was also not terribly surprised when nothing did.

Collin: Yeah, I was hopeful that we’d find something, that this resolution issue of quarterly cadence of survey reporting versus trying to find fine grain stuff in the code, that is a pretty big obstacle. At some point we talked about doing something more experience sampling focused where when somebody submitted a change we’d ping them right away and say, “Hey, you just submitted a change. Was there a big challenge with technical debt or unnecessary complexity in the code that you were working with?” We talked about trying to hone in on specific change lists and self-reported technical debt challenges with that change list. I don’t think we’ve ever… We’ve not done that at scale.

Ciera: No, we haven’t had time to go back and pick that up again, but something that I think someone out there could definitely work on. The bigger surprise, to be honest for me on the entire tech debt stuff, was just the fact that we could find 10 different types of tech debt. I would’ve not been surprised, I think a lot of people wouldn’t have, if we had run those surveys and every single engineer said all of them. I’m hindered by all 10 of these things. It’s not actually what happens. I think that was the biggest surprise, is that engineers actually had strong opinions that, “No, I’m only hindered by two of these.” They could point to you to exactly those two.

Collin: Yeah, I think the other thing that was eventually surprising about the survey data is that once we had a pretty solid set of flavors of technical debt, we still have another option in there, so engineers can say none of these is right. I have another thing. I can’t remember what the proportion of engineers that select the other option is, but it’s in the tenths of a percent. It’s really small.

Ciera: Is it less than that? Oh wow.

Collin: Yeah. It’s very small. We have these 10 flavors. Not everybody picks all of them. Also, we seem to be covering the space of what people are complaining about pretty well, which was one of our stopping criteria for deciding we had a good question.

Abi: What you’re telling me is you two should be Nobel Prize candidates because you’ve successfully defined technical debt, which I feel like is as elusive as productivity data itself. One of the things that I really loved about this paper was towards, in the latter half of the paper, you speculate as to… You discuss the fact that you weren’t able to find objective measures that correlate and you suggest that… You explicitly say, I’ll read the line from the paper. You say that this points out that human cognition and reasoning play a big role in developer productivity, particularly because the conception of an ideal state of a system and using that imagined state as a benchmark against which the current state can be judged or measured. That was such a profound couple paragraphs, couple sentences you had in your paper. I love it because it really resonates with me this idea that so many things in software development that are intangible that we can’t see in our logs can only really be understood and measured by using human judges, a case asking people questions and measuring against this imagined state. But I also feel like this is pretty abstract concept, so want to try to break it down a little bit, just discuss it here to help listeners understand and appreciate it more. Yeah, help me do that. Maybe Ciera, starting with you, explain what you’re trying to convey here.

Ciera: What we’re trying to convey is really just that the engineer’s ultimately the judge. One engineer might view one project as having technical debt and not another project, even if some underlying metric was exactly the same, simply because of the context it’s in. It’s more about, did it make sense for this code to look like this? Does it actually impact us in reality? When I think about this, a lot of times I like to go back to Martin Fowler’s technical debt quadrants, because those quadrants are entirely about human judgment too. Did we make a conscious decision to take on the tech debt? Then, was that decision good? That’s again, a value judgment that a human puts on this.

Collin: The example we’ve read in the paper is about migrating from Python II to Python III. That’s a really simplistic but concrete case. If your code base is all Python II and there is no Python III, there’s no needed migration. There’s no, it could be better, but it’s not. It’s just the Python II code base, perfectly reasonable, no technical debt.

But as soon as Python III exists in the world, all of a sudden our imagined state has changed and all of a sudden we’re not what we could be. That’s a very concrete example, but we can imagine ideal states that don’t have anything to do with the release of new languages or whatever. But I think that it’s that disconnect between what could be and what is. That expert judgment is a really critical thing to technical debt. Martin Fowler’s quadrants are a great riff on this, but even Ward Cunningham’s original conception of this. He was using financial debt as a metaphor and the metaphor is useful.

Nobody would argue that financial debt is inherently good or bad. His point was it’s a tool. This is a way you think about how you prioritize, how you scope projects, what you invest in rapidly versus robustly. There’s no unequivocally good or bad for going into debt. It’s really about, do you have a plan to repay it? Is the investment going to pay back a return? Are you going to get more out of incurring this debt than you will lose by having to pay the interest on it?

That’s the crux of the metaphor that it’s sometimes lost. Technical debt is not just that code is bad quality, or even that’s just brittle, or even that’s just old. It’s that code was incurred for efficacy or for convenience or for short-term gain, and we maybe didn’t have a good long-term plan for management. The technical debt management framework that we referenced that our colleagues at Google have put together in the paper, it’s a lot of about process. It’s a lot about track what technical debt you’re incurring, be thoughtful about it, create a plan to track it and pay it down, budget the resources to do that, and just be intentional about this process. Nobody says have a zero technical debt. That’s not a thing, but have technical debt that you have a good grasp on and you have a plan to deal with.

Ciera: This is why we don’t just simply ask people, “Did you encounter technical debt?” The answer is always yes.

Collin: It’s yes.

Ciera: The question is, did the technical debt you encountered hinder your productivity? That’s where it’s a problem that we actually need to solve.

Abi: Yeah, I love that. This idea of… Yeah, I love the Python II, Python III example because it’s so concrete, the existence of a better way, or a better imagined state completely changes our evaluation and judgment of the current state. I imagine this… We’re talking about technical debt now, but this applies to a lot of the survey based measurements, even around speed and ease that you capture. I feel like this goes back to this idea that with an objective metric alone, it’s often very hard to define what, as you said earlier, what good it is. You rely on the counterbalance or the judgment of humans to tell you. I’m curious, have you actually put that to work in terms of using human judgment to give you a better understanding of what a good metric X or metric Y is?

Collin: Boy, our survey based metrics, we mostly look at the percentage of people who answer in a favorable way. We know what headroom we have to everybody is happy. Good in those cases. Always a hundred percent of engineers report being satisfied with their velocity. That’s what good looks like and we know how we’re progressing towards that.

What realistically good looks like, more of an empirical question. We observe where it is, we observe how it moves, we have an idea where we might get. I think we have a few metrics where we have consistently high scores and we think those are probably good benchmarks for about as good as one might do on such questions. But yeah, that’s a little bit difficult sometimes.

For other metrics, we do a lot of this triangulation with self-report. We try to understand where the differences in the metric make a difference to the person. Build speed or build latency was a good example. But even the rate at which you can iterate a bit of code. That’s the thing we’ve been working on lately is how fast can you get feedback on coding edits and then build again and iterate. What does good look like for that? It’s really not clear. We can just be descriptive about what it looks like, what does the distribution of those durations look like? We can get some interesting inklings. Some things are so short that nothing meaningful happened in that time, and some things are so long that something’s gone awry and you’re measuring the wrong thing. But within a range, we know that sometimes faster is better, but it’s difficult to say how fast can you get before you get too short, or how long can you get before you get uselessly long. It’s a very challenging problem.

Abi: Collin, Ciera, it’s been a fascinating discussion. So interesting for me to hear about how you’re approaching your research, your methodology, how you’re measuring developer productivity at Google, and being able to discuss all these recent papers you’ve put out. Super excited to continue following both of your work. Thanks for all the work and all the public information you’re putting out. Really appreciated this conversation today.

Collin: Thanks so much for having us.

Ciera: Thank you. It’s nice talking with you about it.

How Google measures and improves developer productivity

Timestamps

Transcript