Uber’s journey of measuring AI impact on developer productivity

Ty Smith:

I’m Ty, and this is my colleague, Abhishek. We’re here representing Uber today. It’s good to see a lot of familiar faces from some of the past events in the pretty small developer experience community.

We’re here today to probably talk about something you haven’t heard much of, AI. But hopefully this one’s a little bit different than some of the past AI talks. While we have given some talks recently that were kind of a portfolio of AI developer productivity things that we’re doing, today we wanted to narrow in a little bit more and talk about our journey with measurement around ROI. So it’s going to be a little more focused in that space. This isn’t a success story necessarily. It’s a journey. It’s things we’ve tried that worked, things that broke, and we’re still figuring out what comes next.

The world’s changed drastically over the last few years. I think everybody feels it. A couple years ago, we were talking about auto complete. And today at Uber, one in nine of our PRs is fully production ready with zero human authoring from our background agents. That’s quite a distance that’s covered that doesn’t include your normal cloud code usage or anything else, which is another huge percentage of usage. This pace of change outran our measurement playbooks for how we thought about anything developer productivity before then. And so that’s what this talk is really about.

Steve Yegge, he coined these stages in his Welcome to Gas Town blog, and I liked these, where, as we’re looking through the evolution of AI and agent usage and developer productivity, it really kind of breaks us down into these eight. We start with really just auto completes and basic usage, moving into the chat agent where people are interacting with it in the IDE, to basic agent usage that’s very conservative with permission access up into more autonomy, the YOLO moats that came out into background agents, then starting to move from a single agent to multiple agents to a lot of agents.

Now people are getting overwhelmed by the amount of agents and the context switching and we’re seeing all these orchestration tools pop up. So the question is, how do we measure any of this? It’s moving so fast. We can’t use what we started with. The world has changed and our measurements need to change too for agentic coding.

And with agentic coding, we need to look at the full SDLC. Writing code used to be the biggest bottleneck, but now that we’ve made drastic improvement there, you see those huge bottlenecks just become more prominent in other parts of the SDLC. And so our investments aren’t just with agents to write code, they’re investments in all of these places. They’re in the planning and the research and the design and the code review and on call and deployment and collaboration with peers and support and all of these types of areas we have agentic support. It’s the entire SDLC lifecycle. And the developers moved from just being code writing, an author, to an orchestrator of these agents of these systems trying to accomplish a goal of building and delivering software that’s customer facing. And that shift changes a lot of how we think about productivity.

Last year, our company leadership, Dara and the other C levels and the board, they threw down the gauntlet. They said, “We are going to fundamentally transform how this company operates and we’re going to become a generative AI-powered company.” This is a strategic pillar for the entire company for all roles, not just for engineering. And at Uber, this is a strategic bet. It’s not a pilot, it’s not an experiment. This isn’t something that we see that we’re leaning in partially. Generative AI as a company pillar means that every suitable task, every function, every role are augmented by AI. And our CEO, Dara, his framing to engineers is pretty explicit. He sees engineers with AI as superhumans. And with these, he’s not satisfied with the amount of velocity that we’ve had in the past. He wants to hire more engineers because they’re even more valuable than they were before.

So that’s the bet the company’s making. And when you make a bet that’s that large, the next question is obvious. How do you know it’s working?

So the journey has been executed at lightning speed. Things are changing frequently, often changing month over month, sometimes week over week. Choosing the right tool and converging on it was way too high risk for us. We can’t see where the industry’s going. We need to enable our internal engineers, our internal engineering community with a variety of tools, with a portfolio of options, and then figure out what’s successful, put investments behind that one. And with each of these waves comes more options, and that creates more measurement challenges. With one tool, it can be manageable. With many tools at the same time, that becomes significantly harder. And then with agents, writing code with zero human input, that changes entirely. We had to rethink at every phase.

So we’ve already been thinking about developer productivity for years before AI came along, us and the industry. And we all know there’s no silver bullet to solve how you measure it. We take a qualitative and a quantitative approach that’s probably similar to many of your companies. We do developer surveys, things like MPX and DX surveys, customer interviews, experience sampling surveys, and then we mix those with a bunch of quantitative proxy metrics that give us a general picture of, are we heading in the right direction with productivity? Those quantitative metrics could be things like PR throughput, review times, the PR authoring time-

Ty Smith:

Review times, the PR authoring time, mean time to merge, mean time to deploy, all the typical type of stuff that you would expect. These have existed for a long time, and they’ve been heavily used at Uber, and I’m sure at many of your companies as well. AI didn’t create the need to measure developer productivity, but for Uber and I think for the rest of the industry, it really put the spotlight on needing to measure that. It raised the stakes. Of course, everybody is seeing this get more and more expensive, heavier and heavier usages. And I’m sure all of our finance departments are coming to us and saying, “What is the exact value that we’re getting out of this? How would you quantify that?”

So it really has taken what was the measurement for the individuals, to closing the gap on that so we can say, “These engineers are doing all of these things that’s augmented by AI, and here’s the output for the company. Here’s the value for it.” And the industry has fundamentally shifted. And so what used to work, isn’t feasible, isn’t really suitable now. And we need to figure out if things are working well, how do we improve it further?

So typically when you’re doing these measurements, you want to start with the questions that your stakeholders are asking. What do we actually care about this data? What are the type of debugging? What’s the information we want out of this? And so these are the types of questions that are not hypothetical, that are being asked to us today by leadership, by finance, and they drive things like budget, and roadmap, and purchasing decisions. These are big questions. Is it working? Is it actually making developers more productive? Who is it making more productive? What should we invest in? Is it specific models, specific tools, specific harnesses? Where are the areas they’re not good at? When should we double down? And what does productivity even mean? Are old metrics broken now? If agents are inflating the volume, is that volume necessarily value that’s an output? What’s the right unit of measurement?

These are all big questions. And our existing measurement playbook wasn’t really built to solve these. So here’s the mental model that we’re taking. We are roughly categorizing this in the four eras. The pre-AI, AI-assisted, Agentic, and the software factory coming soon. And for each of these, we’re asking these three questions. For adoption, how are they using it? For engagement, how well are they using it? And then what is the value for the company? And then for each one of these sets of questions, we have two types of data. The qualitative, of course, that we talked about, and then the different quantitative approaches that we’re going to dig deeper in here.

So I want to start with one of these, which is the qualitative signal. I’m only going to touch on this once at the beginning here before we spend the rest of the time on the quantitative, but I want to impart how important this is. We were running these developer surveys long before AI showed up. And when they started to show up, the telemetry coming out of these tools was non-existent for the most part. And we needed to make a decision. Do we wait to build telemetry and build a real understanding of these before investing? Or do we bias for action and do we take a bet? And we wanted to bias for action.

And so we needed to rely on the qualitative data. We needed to double down on focus questions, and surveys, and experience sampling surveys to get that information from our developers, to start to build the plane and launch it before we could make those future approaches. And then we could add in the other telemetry as it was built out. And so as we think about the qualitative data, I wanted to mention a few notes around how to think about this. First, when you’re designing these questions, you want to anchor to the behavior, not the perception. So you don’t ask, “Do you find AI helpful?” It’s, “Did you accept a question from Copilot today?” Perception questions, they invite socially desirable answers. Behavioral questions don’t.

Second, we want to track this over time longitudinally, because if we’re taking one single snapshot, it’s very hard to compare the relative change that’s been experiencing. And we’re not absolutely believing that specific snapshot in time, its absolute numbers aren’t as important as the relative change.

Third, validate against the telemetry. Our survey said that our power users felt more productive, and when we checked that against the PR output and the numbers agreed, we could believe those and trust those signals even more. And when they diverged, that divergence was the more interesting spot that we could debug, and figure out where our data was wrong.

And fourth, need to watch for selection bias. The developers who fill out your surveys aren’t a random sampling. The power users are more likely to respond. The frustrated users are more likely to respond. And the big silent majority in the middle, the people who are just using these and feel mostly neutral, you’ll have a lot more trouble getting them to engage with you. And so you need to be careful not to have very strong voices that represent and dominate all of the roadmap decisions.

And so now Abhishek is going to walk you through the quantitative story, but just know that for every one of these eras, we were especially using the qualitative upfront to sanity check and make the quantitative decisions.

Abhishek Tibrewal:

Thanks, Ty. Before I start, a quick question to the room. How many of you are finding that the metrics that you had for AI last year are starting to break out? Okay, quite a few. The same thing happened with us, not once, but three times. So let me take you through a journey, not the version where everything worked perfectly, but the version where the playbook kept failing and we had to reiterate on what came next. Let’s start with the AI-assisted era. Developers were using tools like Cursor, Copilot, and with any tool launch, we are to really answer three main questions, as Ty said. Are they using it? Are they using it well? Is it making them better?

So let’s start with the first one. Adoption. And the traditional metrics worked here. Things like monthly active users, MAU10, MAU20, these are metrics told us the reach, where to focus enablement on, and which orgs needed the most support. One classic example that I always talk about while talking about adoption in the AI-assisted era is, the IS developers were not picking up these tools, and so we figured out the X code was lagging behind in AI capabilities. And so we reprioritized our efforts in Swift LSP and made Cursor available to them. Adoption data told us exactly what they were supposed to do. They told us where to intervene and they told us where to act. So far, so good. Adoption metrics worked. Question one answered.

As these numbers will grow, growing up, the next question came up. Are they using it well? So when we thought about engagement, the first thing that came to our mind was looking into the demographics of the engineers, things like org, role, tenure, and so on. When we sliced and diced data with all these different dimensions, nothing worked. The insight wasn’t in who the user was. And so we turned into the behavioral segment. We looked into the distribution of suggestions shown per week, and we found something very interesting. It followed a Pareto distribution. So a very small number of engineers saw dramatically very higher number of responses per week, and surprisingly, this cohort also correlated with very higher output. And we coined a term power user. These users were about 8% of the total engineering cohort. We thought it was a big win, but correlation is not causation.

And so with this, we had to really answer two main questions. Are power users more more productive because of AI? Or productive users tend to use AI more?. One anecdote that I would really like to give here is, during our early days of migration from Java to Kotlin, we saw Kotlin developers showed very high positive sentiment, and they also had very higher throughput as compared to the Java devs, while working on the same repo. Was it Kotlin or was it the highly enthusiastic, highly productive Kotlin developers who adopted Kotlin first? At that time, we didn’t have the answer, but in the hindsight, maybe it was both. So with AI, we had to really answer the same question.

Going into impact, our hypothesis was simple. AI saves time and saving times means shipping more. Let’s talk about shipping more first. We looked into PRs per SWE. I know PRs per SWE isn’t the best metric out there. We run developer surveys, collect qualitative feedback. As Ty said, we track what slows a developer down. We do all sort of things, but in order to do a rigorous study, we needed something quantifiable, consistent, and reliable. And so PRs per SWE gave us exactly that proxy. To make things very concrete and rigorous, we basically ran difference in differences study.

The core question was, if there was no Copilot, what would have the same developer shipped? And that’s your counterfactual, and the delta between that counterfactual and what the developers actually shipped is the incremental impact that we were looking at. We got 19% incremental impact for this power user cohort. We thought it was a very big win. We ran a similar study for a different group of cohorts of engineers, and the results were quite heterogeneous. Junior to mid-level engineers saw the highest lift, whereas the senior most and the management also saw some incremental gains, with the limited focus time that they have and how thinly spread they are.

I want to talk about why rigor mattered here. If we would have just looked at pre and post numbers, or just calculated the delta between the power user, PRs per SWE and non-Copilot users, we would have highly overestimated the impact. And there are tooling and high investments decisions made out of some of these numbers, and so we had to do the right thing. And so going back to our hypothesis, one of the hypothesis was that AI saves time. And as these numbers are growing up and we are showing the impact, the next question came in, what is the ROI? And we thought that we could wrap some new metric around saving time, and show leadership that number. And Ty, you were in a lot of these conversations with the leadership about ROI. Why don’t you talk about the ROI number?

Ty Smith:

Yeah, thanks. This was an interesting time because like I mentioned, a lot of these business leaders, they do want an ROI metric. And we started to see one that was commonly used in the industry, which was dev years saved or dev time saved. And we saw it in non-engineering roles as well. You start to see lawyer years saved, or HR years saved in some of these tools. And so we took an approach that’s not entirely unusual, where we did for each one of our AI solutions, we had a group of experts in that space come together and estimate the time savings for the task. And then we double checked that by doing customer surveys and reviews. And then we had quantitative tracking for the number of actions, or if those actions varied in size, what that looked like over time. And we started to aggregate all of that.

And we thought, “Hey, this might be a silver bullet. This is great. We’re saving a ton of developer time with the AI.” And when we started to use this, it really broke down in a few ways that were initially unexpected.

Ty Smith:

It really broke down in a few ways that were initially unexpected. First one, this is a really important side effect that we didn’t think about ahead of time, is a lot of folks are really nervous about this change to AI. What does our field look like in the future? Do we have jobs in the future? And using the developer time saved as a metric has a subconscious impact to a lot of folks to say, am I being measured on replaceability? If this is one dev year saved, is that one engineer that we don’t hire? And as I mentioned at the beginning, that wasn’t the intention, our CEO is very bullish on hiring more. But this was a point of contention with the sentiment and the anxiety that we were coming to. So that was one big problem.

The second was this space is changing so fast and things are changing, our baselines had to keep being revisited. We would set a goal and then new stuff would come out and now we’re changing it and changing it and changing it as the industry is changing. And so the trust in that from the stakeholders was questionable. And then the operational overhead of maintaining that became quite high. Even with the automation around the different parts, all of these different projects around the company being aggregated and shown in different leadership updates, it was a high load.

Finally, the other part was it didn’t actually solve the question that the business leaders wanted. They don’t see a employee’s time for a year as an output for the business. They see it as a cost for the business. What they’re wanting is something like, this made the company X amount of money. This was new revenue. This new feature unlocked. It’s the actual business output. So in the end, for these three reasons, we decided to retire this. It just didn’t make sense for the goal we were trying to get.

Abhishek Tibrewal:

And so that didn’t work. And before we really got time to regroup and think about this, the next era arrived, agentic era. And the foundational assumption underneath everything that we have just built that humans right code and AI assist just flipped. Now one agent task can write 10s of PRs. And so think about what that does to PRs per suite. Think about what that does to the engagement signal that we found in suggestions shown by week. Every metric that we had just validated broke. And so we had to start fresh. We had to start new.

So we started with something that we knew could be easier. Adoption in agentic AI. Adoption was literally on fire in agentic AI. 95% of engineers use AI monthly. Claude Code adoption more than doubled in less than three months. Our in-house background agents, Minions, now write one in nine diffs, which was even less than 1% few months ago, and the rate of growth is exponential there. And here’s where the journey told us something reassuring. The adoption metrics still worked. Traditional metrics told us where to reach, where to do enablement, we could track rollout curves or upticks, we could do education sessions and all that. And then like clockwork, the next question came in. Are they using it well?

From the AI assisted era, we knew that the users who engage with the tool deeply got more value. And so we ask the same questions to agents again. Who is engaging more deeply? But here we can’t measure suggestions as agents run in the background. And so we instead turned the metric into a frequency based metric, which was 20 days a month usage on agentic tools. This felt like a very good proxy at that time, but the numbers jumped fast from 7% to 61% in less than six months. But what we had really done is we have turned a behavioral metric into a frequency based metric and we have just measured adoption again. In reality, we are just asking whether you have used Claude Code or Minions or some agentic tool more than 20 days a month. And that’s just an activity metric.

And by the time we realized it, a lot of teams have taken KRs around this metric, and so it was hard to go back. We were stuck with it. And so this was the second time that we were wrong. We adopted a metric from the last era and didn’t evolve over it. And so while these numbers were growing, we had to really look for impact and the impact metrics were already breaking into the agentic era. But before that, let’s talk about what we are doing next with the engagement signal.

The concept we are working towards is the AI native engineer. And what we are really doing is instead of just taking the frequency metric, we are asking questions whether the engineer has used agents in breadth, depth, and consistency. What we mean by that is how much of the software development lifecycle has engineer really delegated to agents. An engineer who is just using agents for coding is fundamentally different than the engineers who is using it for planning, coding, review, testing, and deployment. And the depth is really how much of the task in each phase has the engineer delegated to agents, and frequency and consistency is just the frequency.

We are still building towards it, it needs to be tool agnostic so that it doesn’t break when the new tooling shifts arises. And as I said, while we are building this, the impact metrics had already been broken. So let’s talk about that. One agent task can now write tens of PRs. PR size is growing up. We are shipping more code than ever, but does that really mean more value? We can’t tell yet. Think about it. A one liner PR that fixes a null point and exception for users in Brazil versus a 2,000 liner PR that introduces an internal tool that nobody uses. PRs per suite remains the same, but there’s a huge difference in value. They’re not even close. Don’t get me wrong. PPs per suite has given us a lot of great directional insights on top of qualitative feedback. But in my opinion, PRs per suite has always been an activity metric. It worked when humans wrote code because activity and value was roughly correlated, but now agents has just flipped the whole scenario.

A very good example I can think of is think about you want to change a method name in 500 different files. You write one agent task. And so the PRs has gone up. So the activity has inflated, but the value still remains unchanged. And so this was a third time that we were wrong, that we didn’t realize it fast enough because the world has changed very rapidly. And so what are we doing with this?

Before we need to build a new metric, we are to really understand the value part of it. So we are building a classification framework for PRs. Basically for every PR we are asking three main questions. What kind of PR is it? Is it a bug PR, a refactor PR, a test, a chore, and so on? The second question is how complex the PR is? Is it a trivial task or is it very tough? And the third is who wrote this PR? Is it human who authored? Is it agentic or is it fully autonomous? Early data shows that 70% of all our in house background agents, tasks which is Minions are toil. Things like refactoring, bug fixes, conflict changes, feature flag cleanups and all of that. But classification really tells us what exactly RPR doing. It doesn’t tell us whether it moved the product forward.

And so for that, we had to really build a entirely new North Star, feature velocity. PR’s measure activity, features measure value. That is the whole insight. We are calling it feature velocity, basically the number of features shipped per unit of timeframe. It’s agent proof. It doesn’t care who wrote the PR. It only cares about whether the value was delivered. It doesn’t stand alone. There are three supporting pieces to it. The first is flow efficiency, things like cycle time, review latencies, build times. If AI is truly accelerating the work, friction should drop. Second is quality. And we all know here that shipping faster doesn’t mean better quality. It only means faster debt. Third is capability expansion. AI should really let us explore and do things that we could not have done before AI.

We are still working towards it. And honestly, I don’t even know whether we’ll get to the right estimate because what counts as feature is harder than it sounds at Uber scale and the attribution of PRs to services and getting alignment with different teams and leadership on what we call as feature adds another layer of complexity to it. But for the first time, we have a North Star that we think is agent proof. It doesn’t break when agents write the code.

So let me show you what it looks like when we put it all together. This is the framework that we have been building across this entire journey. PR classification tells you the what, what kind of work agents are doing, but feature velocity tells you the so what. Is it really moving the product forward? And if we really cross these two, we can really answer questions that have been unanswerable by now. Think about questions like, is AI truly accelerating the product roadmap? The questions that we have been asked from leadership multiple times every month. And we think that this framework can have the potential to answer that question. Think about questions like, can we really trust agents to do autonomous work? Are agents really doing the real work or easy choice? The thing that I talked about, capability expansion, are agents really doing the things that we can’t even do that we could not have done pre AI? Or is it just making things faster?

And there are six operational levers to it. In order to really increase high autonomy, high complexity feature like this, we think that these are the six levers that we can pull and maybe Ty, you can talk more about that.

Ty Smith:

Sure. Thanks, Abhishek. So as we’re trying to increase the overall autonomy and capabilities of the agents plus all of the integration points that they have, I don’t think anybody looks at it today and says, “It’s a solved problem. It can do any type of engineering work.” Everybody’s still building towards that. So we see these as kind of like six rough categories of things that as we build in, we will see further capabilities and velocity unlocked.

The first is obviously the model changes. They’re great, they’re not perfect today. We’re strong believers in being very quick to be on the latest and greatest to unlock those capabilities, to get those in the hands of the right folks. And as we stay on top of that and we make those available, we will increase the overall level of the velocity.

Second is the improvements in the harnesses. I’m sure you all have seen the improvements in Claude Code and Codex and OpenCode and all of the others that are coming out. It’s not just the model itself that’s creating the value, but it’s these complex harnesses with integration points, new features. It might be a team of them working together, subagents, et cetera. Uber’s going through a heavy, heavy-

Ty Smith:

… Agents, et cetera. Uber’s going through a heavy, heavy process right now of skillification of many things. We had had a huge MCP investment last year now with skills being kind of the new solution for lower context bloat with tools. We have an entire community going through all of the different points internally at Uber that we might need to create tools around and give those to the agents. By giving the skills to the agents, we give them more capabilities, thus taking them towards better autonomy and feature velocity.

Feedback loops is absolutely critical. I’m sure we’re similar to you in that not everything is tied down in one tight feedback loop for an agent. We have plenty of big repos that are separate from each other and some require a manual deployment or some weird lookup. Having an agent understand all of that abstraction and complexity that isn’t right there can be very difficult. And by intentionally looking for solving feedback loops, we can go in and we can give surface areas that now future engineers can go in and add agents and get value add.

Great example here would be maybe an IDE upgrade. You have folks that are using IDEs, they have their laptops or the remote dev environments, we need to update those. Now, can an agent do that? Well, what does it mean? How do you test that, the IDE with the right plugins, with the right experience and the repos? That seems more challenging for an agent to do, but you can do it, you just have to build it. You maybe use the IDE orchestration API used for testing and wire up in a VM or a container and hand that to an agent. And now you have a feedback loop that you can do that task for that the agent can now take care of those going forward.

These are the types of feedback loops that we’re intentionally trying to create and think about. Another one of these levers is context. And not just context like the code context in the repo that it’s working in, although that’s obviously incredibly important, but all of the surrounding context that’s needed for the agents to make the right decisions. And not just the technical context, but the business reasons as well. I think it’s frequent now when you look at a PR and the agent will give you a description of it.

And the description is the exact code change. It’s like, I added this method and I changed this variable name and I wired it up to DI over here, but what is it really solving? What’s the business problem that it was solving? That’s what you would want out of a human, like why are we making this change? The agents don’t know that without the context. That adds more autonomy, that adds more alignment to the agent and helps with feature velocity.

Finally, the tech foundations. The foundations I think are obvious. We see an upcoming 10X velocity in code throughput because of agents. Now, can your existing infra and code base handle 10X the volume? That’s huge, right? Not just things like our CI or our merge queue, which we’re afraid of the volume coming through those and we’re actively working on optimizing as much as possible, but also things like tech debt or your architecture. Can it really stand to be 10X’d without things falling over? These are the types of things that we need to get in place to have the increase in feature velocity that agents can utilize and really hit the goals that we’re thinking about.

So the software factory era was the fourth one that we mentioned, and this is an industry term that started to become popular where goals go in, high level problems go in and software comes out and is deployed. The engineers, they manage systems and they manage intent and they don’t need to focus as much on the implementation. It starts to become abstracted away. Every error we’ve described, the metric traps and the AI-assisted, the borrowed definitions in Agentic, they’ll point us towards questions we genuinely don’t know yet. The journey we just walked, it’s the foundation, but it doesn’t give us the map. So with the agentic era, sorry, with the software factory era, we don’t know yet. We’re starting to figure it out.

There’s still a lot of open questions there. And honestly, we don’t know exactly what’s coming next. The space is moving too fast for anyone to claim they figured it all out. But some of the questions that we’re actively thinking about are things like, how do you measure judgment when an agent writes 80% of the code? Does output per engineer still really mean anything when a single engineer can orchestrate dozens of agents and we begin to have automated agents and creations of value that don’t have a single responsible individual? Are we accumulating tech debt faster than we can detect it? We don’t have answers to these yet, but we’re naming them because we think that’s the right habit to do, to acknowledge the questions, to resist the temptation for easy metrics or scapegoats, and to start to be ready to solve these hard problems.

So let’s talk about what we learned a little bit. Here, we have two rows and two lessons from each arrow. The green arrow is what we added, the measurement practices we built up, the red arrow, the bottom rows, the assumptions that we had to unlearn, some that broke. In the beginning era, in the pre-AI, we had surveys and NPS. They were already part of our measurement methodology for engineering health alongside baselines like PR throughput, cycle time, a bunch of the other ones that we talked about earlier. The assumptions that held from this era are output roughly equals to value when the humans are writing the code.

In the AI-assisted era, when tools like Copilot and Cursor launched, usable telemetry wasn’t ready yet. So we ran these focused AI qualitative feedback surveys and interviews. And then once telemetry arrived, we started to layer in causal methods. And what broke were the correlations, the high acceptance rate didn’t beat productivity, and ROI isn’t necessarily time saved. In the agentic era, background agents are writing a significant amount of the code, and it’s growing still. The volume is up, but one agent task coupled to one PR coupled to value is broken, right? Volume inflates without any real value.

So what broke here was the assumption that activity means more value, and we’re now trying tying ROI to business outcomes, not code output. And in the software factory era, we don’t know yet. We’re still writing this chapter. Once we figure that out, I’m sure we’ll come back to DX Annual in another year and tell you all about what the findings are.

Four things that we tell ourselves two years ago when all this was starting, use the qualitative signals first, bias for action. As you’re inventing a metric, this qualitative feedback is your fastest feedback loop. Collect baselines and use those baselines before AI arrives, because this is becoming the new way of working, and at some point that data is gone once adoption starts and we can’t look back to say, what does a non-AI era look like compared to the AI era?

Earn the causal number. Correlation results will be used in ways that you didn’t intend and expect the metrics to break. Every era was breaking the previous one for us, budget for that, learn, evolve it, change it, invent new ones. The metrics that outlast agents is the one tied to outcomes, not output. And thank you, that’s all for our content. I believe we’re open for a couple questions now.

Speaker 2:

We’ve really only got a couple of minutes for questions. And again, we’ve got such great stuff coming in from the audience. Thank you for all that participation. This is really hard to pick because we’re limited on time. But one that I found was pretty compelling is how do you define a feature?

Ty Smith:

Abhishek, you want to take this?

Abhishek Tibrewal:

Sure. It’s very tough to define a feature. I can give a quick 30-second rundown on what we think. We are not calling an XP a feature, and we are not calling a Jira a feature. What we are really doing is we are taking all these data sets, PRs, DIFs, XPs, flipper configs. And what we are saying is maybe there’s one too many relationships between these, and we are clustering all these data sets together that can combine since we lack telemetry joining all these from PRD, ZRDs right to the point where it shipped from XP.

So what we are really doing is clustering all of that together and we are saying, can we really find deliverable feature from these clusters? And each cluster could be a feature. And you might have question around what about infra teams and other things, right? So what we are really thinking about is maybe it could be into three layers.

The first could be like user-facing features, second could be features supporting that feature, and third could be infrared-related features. And we don’t have a clean answer yet, but we are still experimenting with it. We’ll probably come back and share more when we have it.

Speaker 2:

Thank you. We’ve had several questions that you… And you touched on this at the end. You talked about, yes, eventually you want to be able to understand if you’re creating more value with all this new throughput. So I think I’ll reframe because there’s like four questions in here kind of just about that. What does success look like when you’re able to actually start measuring those metrics about value? What’s the threshold? What does success look like for that?

Ty Smith:

I think right now, obviously costs are going up everywhere. Levels of investment are going up. We think we’re just getting started in adoption. Right now, there’s still a lot of folks that are only in single agent usage out of our non-engineering cohorts, just some of those are organically using some of these tools. There’s a lot more that can happen to give value there. A.

Nd with that, there’s a lot more constant investment that would go into that. And at some point, you do need to look at the bottom line and say, what is the amount that we need to spend? Do we need to think about budgets? Do we need to think about limits? Do we need to think about guidelines? Are the right models? Do we use the most expensive model or a lesser model? Do we break up the tasks?

All of these are the big questions that you would potentially want to have use the outcome metric or the ROI metric to help with the decision making. And so I think success is having something that’s high enough confidence there that we can look back and we could make the right decisions to keep it moving forward without just unknown unlimited spend.

Speaker 2:

Great answer. Listen, Ty, Abhishek, thank you so much. I always love hearing what Uber is doing in this space. You’ve been ahead of the game for so long with this stuff. So thank you very, very much. I really appreciate it.

Abhishek Tibrewal:

Thank you so much for this.

Speaker 2:

Yeah. Thanks so much.

Ty Smith:

Thank you everyone.

Uber’s journey of measuring AI impact on developer productivity

Show notes

Why AI breaks traditional productivity metrics

Start with stakeholder questions

Correlation is not causation

Why measuring AI ROI is difficult

Building an AI-native measurement framework

Timestamps

Transcript