How Airbnb measures developer productivity

Abi Noda: Christopher, thanks so much for coming on the show today. Really excited to dive in.

Christopher Sanson: Yeah, thanks Abi, happy to be here.

Abi Noda: So I want to start with the overview of developer productivity at Airbnb, specifically the developer productivity organization. Maybe start by sharing with listeners how the DevProd organization is structured, what it’s made up of, and what your mission is.

Christopher Sanson: Sure, yeah, absolutely. So DevProd at Airbnb is part of the larger developer platform team. So we have a developer platform team dedicated to helping our developers be as productive as they can, ship value to our users, enjoy their work, and feel empowered to do their best work. And so within Developer Platform, we have teams focused on developer tooling, CI, CD, web platform, service platform, core services, and developer productivity is part of the scope of that team.

Abi Noda: And how exactly does developer productivity then fit into that broader organization? I know you specifically, for example, are focused on insights part of what you do. So how does developer productivity fit within the broader org?

Christopher Sanson: Yeah, it’s part of just the larger charter, right? In order to understand what tools to build, the impact that we’re having, and how developers are being empowered at Airbnb, developer metrics play a key part of that. So we use it at multiple levels. And then we look at it holistically, right? So the developer metrics are one aspect, but we look at the whole picture. So we complement that with survey data, customer interviews, and customer feedback. and try to get a sense of multiple places around the developer experience at Airbnb.

Abi Noda: Who do you consider to be your customer? And this answer I’ve seen depends on the different companies I’ve spoken to. But when you talk about these interviews, the insights, the metrics, is this strictly for the platform teams within Airbnb? Or is there a broader group of leaders that are wanting these types of insights?

Christopher Sanson: That’s a good question. So it’s all the above. So we have multiple stakeholders who want to see these metrics packaged in different ways. So as a developer platform team, of course, we use them ourselves in order to help guide our roadmap. The executive team wants the big picture around how we’re doing as an organization at a high level, like long-term trends, changes year over year, and things like that. And then the product teams themselves are also consumers. So we work directly. The larger product teams also have their own little small platform teams as well that we work closely with.

And then, leaders at the product team level also want to get insights on their specific team, where they can improve, how they compare against other teams within the org, and things like that.

Abi Noda: In terms of your day-to-day work, then, how do you juggle those competing customers or stakeholders? How do you sort of prioritize, communicate, collaborate with these different stakeholders?

Christopher Sanson: That’s sort of the job. We do that with everything we do on developer platform, feature development, to developer insights.

We collect data at multiple levels. So, we’ll have a broad-based developer survey. We have DORA metrics and other sort of broad quantitative indicators as well. And then, what we tend to also do, is we try to be able to segment that data either by product team or developer role, like backend or web.

So we both have the big picture data as well as the ability to slice it into more narrow scopes for different audiences. And then it becomes a question of reporting it out.

So we’ll have dashboards that are dedicated more towards the big-picture executive level. We have more team-facing dashboards that go into more depth and look at more of the key drivers. So essentially we collect the data in a way that is packageable in different ways for different audiences.

Abi Noda: That makes sense.

Last year I had a guest on the show, Willie Yao, who was an early employee at Airbnb. Actually, I believe he became one of the first DevEx leaders at Airbnb. So I’m curious to ask about the evolution of developer productivity, developer experience as a theme, as a concern at Airbnb, generally speaking.

Who is thinking about and talking about developer productivity, DevEx at Airbnb? Is that primarily your organization or is this something your CTO, C-suite is constantly talking about? Where are the discussions happening?

Christopher Sanson: Yeah. I mean, it’s top of mind for everybody. Yeah. So absolutely our leadership cares and talks about developer productivity, empowering developers to be productive. I think there’s just a larger general trend and acceptance across the industry that this is really critical and impactful to the business’s success. And so, you know, I think DORA started this maybe 10 years ago, but it sort of becomes accepted practice at this point that like, no, actually having productive developers leads to desirable business outcomes. So yeah, no, it’s really kind of top to bottom. Everybody kind of cares and invests mental energy and resources into improving it.

Abi Noda: It’s funny that you mention that the principle or concept that better software delivery leads to better business outcomes is widely accepted. I hear from leaders all the time who are asking for advice on how we get leadership at our company to understand that. So I’m curious at Airbnb, is that just implicitly accepted, or is that a principle that comes from executive leadership or is there real data that’s discussed at Airbnb that sort of validates that proposition?

Christopher Sanson: Yeah, it’s just part of the culture. which comes from the bottom and the top and is built up over years. So Airbnb is a hospitality company. There are principles inside like ‘be a host.’ It tends to attract folks who want to help others succeed and to be successful. And so that applies also to developer tooling as well. There’s a natural sensibility around that. The hard part comes down to quantifying the impact. That is still a challenge. Everyone sort of knows that developer party leads to business outcomes, but they’re still a large challenge, I think, for everyone around sort of measuring the impact of it. What’s the ROI of this project on the business value? Or how many more features can we ship with this productivity gain? That’s where it gets really challenging to get down to specific numbers.

Abi Noda: And of course, you’re on this journey of trying to answer some of those types of questions. And today we’ll get into the nitty gritty journey of the road to implementing DORA metrics at Airbnb.

Before we dive into that story. I want to ask you, and you’ve already touched on this a little bit, but at a high level, what’s your, or Airbnb’s, point of view on how to measure developer productivity or what types of insights you should get on developer productivity to really get a firm understanding?

Christopher Sanson:

Yeah. So again, as we talked about before, we look at developer product metrics as one part of a larger story. I don’t think you can look at them in isolation and just say like, oh, let’s just keep improving this number 10% year over year, and that’ll be enough.

We use the quantitative data to complement things like survey data, customer feedback, and larger business objectives across the company, both internally and externally, and try to get a 360 view of things, and then use them to reinforce. If we’re hearing about something for customers, does the data back it up? It helps us quantify the severity of the issue, and how widespread it is.

Abi Noda: We’ll obviously talk about how you get quant metrics. such as DORA metrics. And at the conference we just attended, you shared a little bit about how you do surveys as well.

But I want to ask about the other stuff. And you talked about customer research and customer interviews. How is that driven within your team? Do you have UX researchers whose sole job is to look at this stuff, or is it happening more on an ad hoc basis?

Christopher Sanson: Yeah, it’s more on an ad hoc basis. So we don’t, we don’t have a team of dedicated UX researchers that largely is done by either PMs, but also just the engineering team.

Again, it’s a really tight knit company, but a lot of the engineers who are on the developed platform team used to be on product teams and know each other and move back and forth. And so they’re there on Slack, on call, meeting with teams.

So there’s like a close connection between the engineers and product engineers. So It comes from all directions. I think in terms of larger-scale efforts, like developer surveys and developer insights, that’s usually a combination of some kind of working group across product management and engineering.

Abi Noda: Shifting into the main topic we wanted to talk about today, which is what you presented at the DPE summit, your journey with DORA Metrics that I really am excited to bring the listeners.

I want to start by going before the story you told at the DPE summit. I want to ask about what you guys were doing before DORA metrics. So I’m gonna ask like what metrics, if any, has Airbnb, Developer Productivity Org, or other organizations tried in the past? What kind of legacy approaches that you all have tried?

Christopher Sanson: Yeah, so this predates me a little bit, but DORA didn’t invent the concept of developer metrics. It wasn’t like, oh, wait, we should actually measure this stuff. They have been measuring many aspects of the developer experience for many years.

I think what changed, or where the evolution was around, there are a few takeaways and actually reminds me of the story you told about when you were at GitHub and you started looking at developer metrics and you went around and interviewed stakeholders and got everyone’s favorite metric and tried to combine them all together.

I think past efforts looked a lot like that. It was very similar. I think what happened was you tend to get lots of different opinions and so it often became like who was the most influential person in the room, that metric would then be elevated up.

What everybody has found, and what we found as well is that it’s not as straightforward as that. Actually just getting these metrics is really tough. It feels easy to just add another metric to the list, but like there’s actually gonna be a real cost to getting that metric, right?

Because like oftentimes it’s hard to get. There’s a lot of effort that goes into cleaning the data. And then even just defining how to measure the metric in reality becomes really challenging. The metric just sounds very straightforward. And it is if you ask a question about it, but when you actually try to instrument it, there’s a lot of nuances there that I’m sure we’ll kind of get into.

And then the last part is that too much data can be overwhelming. There’s this concept of more is better. And the more data we have, the more insights we’ll be able to extract and the more causation we’ll be able to see between metrics.

And I think what often happens is it’s just this wall of data and people’s eyes sort of gloss over and it doesn’t become as actionable. So I think what we wanted to do going in was really kind of have a plan around, like a strategy around the metrics that we really care about.

What do we already know about the company and the developers that we want to dig into more? And what’s our action plan going to be around these so that we can drive change versus just maybe creating something that’s interesting, but doesn’t actually lead to sort of any impact?

Abi Noda: And what prompted this journey? You obviously had developer metrics already. What is it that prompted you to look for something better? Was it the CTO of your company saying, hey, guys? what are our metrics? And you’re like, we have a bunch. Or was it something more internal within your developer productivity organization saying, hey, we need a standard North Star here?

Christopher Sanson:

Yeah, it came from both directions, really. So I think we saw an opportunity within Developer Insights and Developer Platform to get a better signal around quantitative signal around the developer experience. At the same time, leadership also was interested in getting better insights into developer experience. And so It was a very happy combination. And you mentioned too, that just one of the challenges was getting executive buy-in and leadership.

And so we did do a lot of socialization, like when we initially pitched DORA metrics, because some people had metric fatigue around past efforts and they’re like, well, why will this be any different? It will just be the same thing all over again.

So, there were a lot of conversations to sort of get people excited about it again. But in general, there was sort of executive sponsorship. A few people had used DORA previously and had success with it. So, that really helped align the org top to bottom. And so we weren’t stuck with this position of us trying to constantly sell it to people who weren’t sort of inclined to be interested.

Abi Noda: I remember from your talk, you had this great slide that depicted how this goes, right? You had all these different metrics, lines of code, PR frequency, people chiming in, like, I saw this, let’s do this metric. So did that happen again a little bit as you were trying to evangelize or recommend or measure, did you go down that rabbit hole again?

Christopher Sanson: One of the big benefits of DORA metrics is that they sort of address that and stop it before it begins and saying, I’m a product manager, like just treat metrics like a product, like let’s start with an MVP, like let’s not try to get let’s not boil the ocean. Let’s not try to get every metric we think is going to be interesting.

And what DORA metrics, besides being research-based and benchmarked across the industry and all these other benefits, is that it’s a fairly straightforward, four key metrics, pretty understandable, pretty comprehensive. And it was kind of a way to hit off a lot of those conversations before they start and just say, look, let’s get these in place. And then inevitably have people like, those are great, but, and we’re talking about challenges of DORA metrics, but like they’re hard to take action against because they do kind of roll up so much data underneath.

So if you’re like, well, how do we improve them? What are the key drivers? And then you can kind of investigate. But I think for us step one was like, well, let’s just start there with the industry standard research data and then build from that foundation.

Abi Noda: I’m really interested in learning a little bit more about this evangelism, and the salesmanship around getting buy-in for something like DORA metrics. Can you recollect a particular conversation or particular stakeholders within the organization where you felt like you did a good job? What can listeners take away or learn from your experience there?

Christopher Sanson: Don’t take for granted that it’s going to be this slam dunk across the org. And don’t go and build all this dashboard and then present the final product. That’s the other thing is like, oh, I’m gonna go in a cave. I’m gonna build this thing that’s gonna be so amazing when we’re done that everyone’s gonna immediately fall in love with it and it’s gonna sell itself. That’s not usually a good strategy. And so, we engage with people pretty early on and we sort of met with them. And then that’s where a lot of these concerns came up of like, well, why is this gonna be different? And no, actually we tried this already, and here are some challenges here.

So we just really listened to people and really invested the time to sort of address their questions.

And again, as you go through there too, you also will, you’ll find people who are sympathetic or sort of agree with the direction as well that can have championed it to others.

So it’s kind of the same way you sell anything really in a healthy way. We’re all on the same team, right? I’m not selling anything that hopefully they don’t agree with. It’s more a question of the nuance of how to do it exactly, and that it’s now the right time.

Abi Noda: You shared the rationale for why DORA metrics are a great place to start and why it helped you kind of preempt the usual rabbit hole that folks tend to go down when you start talking about metrics. But I would like to know sort of behind the scenes, was there a plan B? Like, was there a close second? And like, what was the alternative? Were there, was there a competing proposal internally or competing ideas around what could work?

Christopher Sanson: There’s always business as usual. So there was do measure lots of different metrics, mostly at kind of the team level. So, the team that owns CIs measures CI metrics, and vice versa.

In the absence of a centralized dashboard, teams will often go off on their own and figure out how to measure this. Then you have a lot of fragmentation. People are like, oh, I measure it PRs per dev. There are lots of different angles on it. You see a lot of people just kind of do it themselves. I think that was certainly the alternative. Then what DORA Metrics gave was a little more rigor, a little more discipline around it.

Abi Noda: That ties into the question I wanted to ask, which is really around what was it that you were posing? Was it getting buy-in to invest time and money into obtaining the DORA metrics? Was that it? Or was it more the idea that Airbnb as an organization should lean in and pay attention to these metrics? Was it more the ideology or just the budget and approval to invest in standing these up?

Christopher Sanson: Yeah, I mean, it was a little bit of both.

If we get the buy-in that we want to leverage these metrics in how we communicate to developers and how we talk about the success of our team, that is the driving factor. And then to deliver value, we have to back into the investment to actually make it happen.

We didn’t just go and ask for headcount. We went and said, hey, we should be actually using these and then to justify the expense. That was more the approach. But there was a non-trivial amount of work to actually build this up. It was less around spinning up a new team and more just that this is worth prioritizing within like developer platform.

Abi Noda: Let’s talk about the work and effort that goes into spinning these metrics up. Because so many people say, ‘oh yeah, we track the DORA metrics’ and there’s far less talked about in terms of how the sausage is actually made.

I shared the story of how we did this at GitHub, which was quite the undertaking and I know it’s taking quite an effort at Airbnb as well.

I’m not sure the best place to start, but maybe just take us through. You got the green light, you got the approval. What happened next? How did you plan the engineering effort, the actual initiative around delivering these metrics to the organization?

Christopher Sanson: We got together with me and a couple of engineers, and we have a couple of contractors for data scientists and research. We said, okay, like, here’s what it says on the tin. There’s lead time to change. This is how it’s defined. Can we get this information? Then we played it out. Let’s do a POC, let’s try to wire it up.

We were fortunate at Airbnb in that we have a fairly paved path developer workflow between GitHub, BuildKite for CI, and Spinnaker for deployments. We didn’t have too fragmented of an ecosystem where we’re trying to pull deployment information from like 20 different tools or anything like that. So that alone was really key for making this more feasible.

It was straightforward again. We had, for years, streamed this data to a central database as well. So we’re capturing all the GitHub events. We’re capturing all the Spinnaker events. That’s what we’ve been building, like setting KRs against and setting metrics against. So we had a lot of the data already available, but it was quite a process to get in to set the DORA metrics up.

Even step one is that we talked about the nuance around what this thing actually is. So like lead time, where do you start it? You started like from PR merge, PR create when they open a Jira ticket. When is the deployment finished? Is it, do you count feature flags? Is it like when it’s fully rolled out or like, was it when the 1% traffic is over?

So that was a lot of the conversation: defining our opinion of actually how it should be measured. So we first started, we said, look, this is gonna be, we want it to be quantitative, we don’t want it to be qualitative. We are gonna define it as when the PR is created to when the deployment is successfully completed in Spinnaker.

We’re not going to worry about like feature flags. We’re not going to try to go back to when the story is that the ticket has opened. So those were the boundaries that we decided on.

Abi Noda: Yeah. I was going to ask about lead time. Cause as we’ve joked about, there’s, there’s a lot of questions that can come up in the process of figuring that out. I want to ask you about some of the calculations for the other metrics as well. And for someone to ask you, was the scope of this all four metrics or just some of them? Cause I know a lot of organizations kind of do lead time, and deployment frequency, then some of the other ones get tricky. Did you guys go for all four or just a few?

Christopher Sanson: We went in phases. So for the initial POC, we experimented and wanted to do all four. We can get into that. We found that challenging.

The change failure rate and the mean time to resolve metrics were harder to have high confidence in. And so after the POC, in order to operationalize it, we can talk about how we did that. But we did focus then on kind of lead time and deployment frequency for a couple of reasons, but we had to pivot along the way with how we even calculated lead time, because that was something that our first attempt didn’t work out perfectly.

The other ones we had as well, but we didn’t really go as far in terms of operationalizing them.

Abi Noda: Makes sense. So before we move on to kind of the POC, the feedback you got, and things like that, I do want to ask you, how did you calculate deployment frequency? Like, what was it? Because that again is something I’ve seen organizations.

Is it just counting the number of deployments, are you calculating the number per day or per week? And then is that global, or is it by team? What was your number?

Christopher Sanson: Spinnaker helped a lot with this. So we have like a default release pipeline and things that are labeled deploy into prod. We calculated the total number of deployments to prod across the org, and we’re like, ‘oh, we’re done.’ And then we actually look at the data and say, ‘oh, no, we’re not done.’

There’s a whole lot of robot accounts, there’s a whole bunch of mislabeled pipelines. We cleaned the data. We went through and removed robot accounts. We had like debates around ‘does an automated biweekly deployment count or does it not count?’ Should that count?

We ended up counting anything that was a service deployment with a user-facing change. So anything that was a prod deployment that wasn’t, we also use Spinnaker pipelines for things like running smoke tests or health checks. So we filtered all those out, and that left us with what we were pretty confident around was our representative deployments. Because we have service owner mapping, we can map that to specific teams.

Abi Noda: How did that work for you all? Did you feel like there were some hard limitations or things you had to figure out? Tell us about the tech stack.

Christopher Sanson: A lot of it was already in place. So we didn’t have to make any major overhauls to our tooling. We had Spinnaker, and we self-hosted Spinnaker internally. Then all the events and data get streamed to a database. And then we have an internal metrics dashboard, a third-party vendor that we use for displaying and reporting the data.

Abi Noda: I want to ask you now about the POC that you talked about. How did you design this POC? What was kind of the rollout process and what did you learn?

Christopher Sanson: We mocked up what we thought it should look like. We did some low-fidelity wireframes of the data, here’s how it will work.

We looked at other examples across the industry of people who had built DORA Metrics dashboards and what those look like. We started wiring it up, and then we dealt with the idiosyncrasies of our tool to sort of adjust to that and what’s possible and what’s not possible. Then got to the first concept. Then we started just spot-checking the data like, ‘Okay, here it is. does this look right?’ Digging into the data and seeing if everything was playing out the way we thought it should.

Abi Noda: And what were some of the things you found?

Christopher Sanson: Yeah, so we found something that seemed off pretty early on. So the data wasn’t unreasonable. Like the metrics seemed like, OK, this could be it. But it’s hard to know. If it’s the first time you’re doing it, it’s hard to know if this is actually correct or not, versus if you’ve been doing it for a while and you have trends, you can sort of track that.

What ended up happening is that our initial approach was like, we would just follow a single change through the pipeline. So we’ll take a PR when that PR is created, take it through merge to main, take it through release, and then when that change is deployed to users, that’ll be done.

So for every PR, we could see the lead time, and then we would aggregate that all together to get our org and team-level values, like P50 and P90 values. But what we found is that didn’t work because we have a multi-mono repo structure. repos for the major frameworks went for web, back end, data, native, iOS, and Android share one.

It’s not as straightforward as that, right? A single PR doesn’t just get released once. A single PR may touch multiple services, creating multiple snapshots. Spinnaker and the service owners release those separately.

So a single PR may get released multiple times, and then likewise, the inverse of that as well is you may have multiple PRs that get rolled up into single snapshots and not all snapshots get released. So. As we dug into it, it was like, oh, this is challenging to just sort of draw a straight line for these changes. So we had to step back and reassess.

Abi Noda: I mean, what you’re describing is a problem I’ve run into personally. I’ve written about it. We’ve heard of others. So how did you solve that?

Christopher Sanson: We ended up decoupling the PR and the release sides, essentially. So we gave up on the idea that we could trace a single PR the way through. We looked at it as a PR lead time and release lead time. Then combine those together to get a total lead time, which again, gives you the signal you want, which is the effective lead time for your teams.

PR was pretty straightforward. So PR creation to successful merge to mainline was very straightforward. Then we were able to decompose that into sort of the major steps of like, time to first review, time, and code review.

Then what we looked at too, which is interesting, was once something was, how long did it take from when a code was finally approved to when it was actually merged in the main. And then we looked at the other side of the equation, so from when the snapshot was created, which happens right when a merge occurs, how long from when that snapshot was created to when it was released.

Then we looked at how long the snapshot sat around before the release started. Then, how long did the actual deploy take? So essentially, we were able to stitch together all the different phases of the lead time to kind of get the number. And that worked really well for us. Like that, we spot-checked that, and that checked all the boxes. That’s where we landed.

Abi Noda: You decompose the problem. Just to clarify, you do still report a total lead time. That’s the sum of all of the sub-steps?

Christopher Sanson: Yeah. So we measure total lead time, and then we have it broken down into the PR lead time and then deploy lead time.

Abi Noda: A talk prompts this question I think we both heard at DPE summit from Grant at LinkedIn. One of the things that struck me was how intentional they were. whether they were looking at medians, means, or P50. He had some statistical stuff that I had never even heard of, which I want to ask him more about. But what are the cuts of the numbers that you primarily focus on?

Christopher Sanson: We look at P50 and P90, which we report on. And then we have P99, and we have others that we sometimes will dig into to understand if those are the right two to report.

But we’ve essentially standardized on P50 and P90, and we tend to use those across almost all the metrics we report. just because people can understand it, and it captures the signal from our point of view. And that’s where we see a lot of the changes.

A lot of the metrics will have, 50% as the main experience. And then you have these sort of outlier longtails like failure mode type metrics, like code review time. It’s like time first to code review, P50 is super short, and P90 is super long because it either happens right away or it just sits around for a while if it’s some cross-team review. So we tend to find value there.

Abi Noda: Another common challenge I want to ask you is about dealing with time zone stuff, like excluding weekends. Is that something you guys have done or are looking at?

Christopher Sanson: Yeah, good question. Airbnb has a holiday break as well. So we have a bailout break at the end of the year. The first thing teams would say when we shared it with them is, ‘I was on holiday’ or we have weekends, or, it affected my numbers, and we looked into it. In an ideal world, we would be able to isolate that out.

But in practice, it was too complicated. You solve one problem, and two more pop up. Weekends aren’t the same everywhere. PTO isn’t the same everywhere. So we reported it as is. And we said, just take that into consideration when you look at the numbers.

So with lead time – use lead time as an example – we would show a rolling 90-day average. We would also show daily and weekly changes. So you could see spikes and say, like, OK, is this trend significant, or is this a week outlier?

We would do a little bit of cleaning at the beginning of the year, so it’s been like P90, which got like Thanksgiving and Christmas holiday break. We would shorten that, but in general, we didn’t go too far down the rabbit hole trying to account for all that

Abi Noda: I think that’s a really good solution because I remember talking to one leader about this problem when they said, It doesn’t really matter if the number is hyper perfectly precise, what matters is how that number is changing within your organization. So as long as there’s consistency there and folks are taking it into consideration, at least the absolute doesn’t exclude non-business hours. I don’t know if that is a problem. It sounds like you kind of landed a similar conclusion.

Christopher Sanson: Yeah, the example I used to give is if you’re judging a race or something, you have a stopwatch that’s like five seconds off, it doesn’t matter. You still know who won the race, you still know if you use that same stopwatch again, people got faster or slower. And so you still have really good signal.

Was it the absolute truth of the universe? Maybe not. But what we’ve had is like the more we tried to take that into account and intervene in the numbers, the more convoluted and confusing it got and the more explanation you would need about it and the more edge cases you would surface. So keeping it simple and understandable was more beneficial than trying to over-engineer it.

Abi Noda: Your stopwatch example reminds me that I purchased three different pressure gauges for my car. Not including the gauge that’s built into the car, like the sensor. None of them would give you the same reading, and it drives me absolutely crazy. So yeah, I’ve applied the same principle to my tire pressure.

I want to ask you about getting these metrics together. What was the process of operationalizing them? What was the process of rolling these out and communicating about them? Of course, with any metric, there’s always this concern that people are going to game these or people are going to be graded by these. So, in your case, what was the process to bring the metrics to life within the organization?

Christopher Sanson: One thing I like about DORA metrics is: please game them.Go ahead, game them. Deploy more frequently with fewer issues. Go ahead.

As metrics go, they’re harder to game than most because they balance each other out, which is another benefit of them. So we were lucky in that. I think this was a key lesson again with developer metrics: don’t try to get the data and then figure out what to do with it.

We have this larger org-wide effort called commitment to craft, which is an Airbnb CTO-led initiative with goals around specific, not just for developer activity, but just in general, around the state-of-development at Airbnb and sort of goals that we want all teams to help achieve and hit.

So we could include, or go-to-market if you will, the time and deployment frequency through this commitment to craft effort.

Once we had a baseline that we really felt confident in, we then included it as part of that. We’re able to go against that and say, ‘Hey, here’s the number. Here’s where we think it should be, or want it to be. Here’s, based on our analysis, the way to get there.’

So that was our vehicle to enact change across the organization.

Abi Noda: What did you see as the actualization of that? I mean, were there teams that were part of commitment to craft that heard about it there, and then they started going to your dashboard, filtering it down to just their team, and talking about these metrics? What have you observed happening?

Christopher Sanson: Exactly that. All teams are accountable for these CTC goals, and we do monthly reviews. We do monthly progress updates, where we talk about the top-level number, as well as breakdowns by team. Teams are accountable for hitting those goals.

The other thing we did was try to help them in terms of understanding how to make that number better. One of our big takeaways with all this was that by far, the largest portion of our driving our lead time was the lag time to deploy, which is like how long from when a change is successfully merged before the deploy is kicked off.

What we found is that a large number of teams are doing continuous delivery and so they’re being deployed every, on an ongoing basis or every day or two, but there was a large number of mostly older services that are just kind of in maintenance mode that are deployed manually or sort of kicked off manually at least. They’re getting deployed a lot less frequently. They would have changes sitting around that wouldn’t get released.

We could go to teams and say, ‘Hey, here’s the lead time number that every team should be hitting. And by the way, the way to do this is just to deploy more frequently these like services that sit around.’

We saw a lot of engagement from product teams understanding ‘Hey, how do we get our number better?’

We got those questions of like, ‘Hey, can we exclude this or exclude that?’ Or, ‘can we just exclude this service from the data? Cause we don’t really work on it that much anymore.’ We’re like, ‘no, that’s the whole point of the data: either decommission it or adopt it.’

Abi Noda: I want to ask you about how you do these reviews of the data, and you provide the data by team. We talked a bit about the tech stack, like how this is built, but how do you get the team data? Where’s that coming from, and how accurate is it? How’s that kept up today? I imagine teams come into the dashboard and say, hey, this isn’t what our team does or is made of; how did you tackle that problem?

Christopher Sanson: The confidence in the data is pretty high. Based on the updated approach we did, and because we have service owner mapping for which teams are on which services, we had a pretty good idea. We were able to filter pretty cleanly again, based on how our data is structured by different org levels.

We would look at the high-level orgs and their lead time based on the services they own. So, teams largely felt that was accurate and correctly represented their progress. I think we did have philosophical questions like ‘is that the right deployment cadence for us or should we be deploying that frequently?’

It was actually great because it spurred a lot of conversations around like, ‘Hey, what is the right frequency or what is the right number to hit here in these different use cases?’

Abi Noda: I want to wrap up today’s conversation with some of the higher-level takeaways and recommendations for listeners. There were a bunch from your talks that I want to pull from. One of the top recommendations you shared was the importance of using DORA metrics, and really any metric alongside other signals. So share more about what you mean by that and how you accomplished that at Airbnb.

Christopher Sanson: DORA metrics have made a big difference in terms of answering the top-level productivity, but there are some known shortcomings around them. They’re not particularly actionable directly. If you want to improve a number, there are so many inputs that it’s hard to know. They’re not great for measuring the impact of specific projects. Because the first question people ask is, ‘oh, that one project that we just spent a lot of resources on, how much did it improve X?’

Typically it’s very hard to draw that because there are lots of other things changing at the same time. So it’s hard to do that. It’s overly focused on the DevOps pipeline. It’s very focused on the tooling. So, to address that, we wanted to step back and use it as one part of a larger puzzle where we look at things like the developer survey feedback. Not just the CSAT scores but free responses. What are people telling us is bad, or needs improving, or is not bad, but just needs to be better.

Then, what are the key drivers that are driving these high-level metrics that we can focus more directly on, whether it’s CI flakiness or build times, or things like that?

Then again, beyond just the tooling, the work environment itself. How much focus time are developers getting? How streamlined is their roadmap process? How much tech debt is there? How good is the documentation? So DORA metrics won’t tell you a lot of that.

But theoretically, they all are related to each other, and hopefully, as you move one and move the other, it all works together. That’s what we’ve seen as well. We made an effort to improve CI flakiness, and we heard about it in the survey; seeing it show up in the time, we drove this effort to improve it. And then, sure enough, six months later, it showed up in the survey data as being better, and it showed up in the DORA metrics as being better. That’s the win there. Your confidence level is higher because you see the same thing in multiple places.

Abi Noda: That makes sense. I want to double-click on one thing you talked about: the challenge of measuring specific initiatives in projects. This is something I see folks struggling with all the time. And hey, oh wait, we can’t. The DORA metrics didn’t change because 100 other things changed as a common problem. I imagine you don’t have a magic solution for this. I think everyone’s trying to figure it out. But what do you guys use for measuring specific projects? Is it just more granular quant metrics? orCSAT scores, what do you guys lean into?

Christopher Sanson: We use OKRs at Airbnb. We tend to then think of these things as either leading indicators or lagging indicators. So, things like survey data and DORA metrics tend to be lagging indicators. They tend to take longer to show up and be less concrete. So we tend not to use those for KRs because they’re less. We want metrics that teams feel like they can directly influence. And then, based on what we know, we trust that these will ladder up into the outcomes that we desire. So we set KRs around things like against CI flakiness or build times or code review wait times and things like that that are more measurable. Then, the belief and the hope is that these will emerge over time and show up in things like the developer survey and DORA metrics. It’s usually not as immediate as people hope, and they are very noisy signals, but I think they’re still valuable. Don’t use them that way.

Don’t think of them as solutions to that problem. Think of them as complements to let you know at a high level if things are going in the right direction. Where’s the bottleneck? Where do we need to prioritize?

Abi Noda: I always share similar advice, like the top-level survey metrics and quant metrics are influenced by so many things. And there’s a delay. We talk about even if you make an improvement, there’s a lag time in that improvement being delivered to internal customers affecting people’s perceptions and experiences daily. Maybe folks must adopt the thing before you see the impact and the numbers. So yeah, the more you can focus on the granular things you can directly control, the better off you’ll be, and hopefully, you’ll have something you can show for your efforts.

Christopher, this has been an awesome overview of how you’re tackling the elusive problem of measuring developer productivity at Airbnb, your journey of DORA metrics, and many useful insights for listeners. Thanks so much for coming on the show today.

Christopher Sanson: Great, thanks that would be my pleasure.

How Airbnb measures developer productivity

Timestamps

Transcript