Measuring AI code assistants and agents with the AI Measurement Framework

Laura Tacho: I think we’re getting ready to get into the big topic that’s in the room, which is how do we measure developer productivity in the age of AI? Abi, do you want to give us an overview of some of the problems that you and I are seeing out there, when we talk to companies about this problem of what do we do about measuring productivity in the age of AI?

Abi Noda: Yeah. Well, right now I think it’s a big priority for every organization to look at how do we bring AI into the SDLC? How do we get AI in the hands of our developers, and how do we do it in a way that actually moves the needle in terms of productivity and quality, and how do we right size our investment? What are the right investments? So I just went too long already in terms of the types of questions we see organizations asking, and at the same time as we were just showing in the intro slide you had up, there’s all kinds of numbers, and the headlines now coming out of research, the different vendors, we hear sound bites of, “Hey, AI is going to be running 90% of our code in the next six months,” from Anthropic. We hear Google coming out and say, “Hey, we’re seeing a 10% improvement.” And then we just saw that meta study, which came out with a finding that, “Hey, actually senior developers are slower when they’re using AI for certain types of tests.” So all these different polarizing viewpoints combined with the moment we’re in the industry, means that organizations really need ways to get their own data, to get really accurate understanding of, how is this impacting our organization? How should we be thinking about decisions? How do we actually successfully plan and roll out AI? So that’s what I’m seeing and hearing across the industry.

Laura Tacho: Yeah, same for me. We do talk to a lot of the same people, but I think this is just a really common pain, and it’s just sort of the state of the world that we’re in. What I’ve noticed is that there is a really big gap between the expectations of what AI can do. As you said, there’s big promises, there’s a lot of hype, there’s cherry-picked data that doesn’t necessarily represent the mainstream or median adoption. We’re looking at very narrow studies and various specific circumstances, and then coming up with 90% of code being written by AI, or very big claims, which I think it’s great to keep the industry optimistic, there’s a lot of promise here. But there’s a gap between those kinds of expectations and then the reality of what most engineering leaders are seeing in their organization. And so this gap is … I mean a lot of engineering leaders are stuck in it right now, because we just don’t know how to measure, what to measure, how to communicate that.

And what I always say is, “Data beats hype every time.” So in order to combat that hype, we need really solid data. So with that, let’s get into the big question that we have, and I know this is what’s on everyone’s mind, that’s why you are spending 30 minutes with us today. Which is how do we measure engineering productivity in the AI era? And Abi, I just want to pose the question to you which is, do we need to rethink the way that we are measuring productivity now that AI is here to assist and augment developer workflows?

Abi Noda: And the response I give is always both yes and no, right? So there’s some things that are changing, and some things that I think remain constant. I think what remains constant is that overall the inputs and outputs of software development remain the same, and we still need holistic ways of measuring overall engineering productivity. And in fact, it’s important that those remain constant so that we can compare pre-AI baselines against post-AI improvements and gains. At the same time, there’s also a lot that’s changing, where the SDLC itself is evolving, and really rapidly in ways we are all still only beginning to fully understand, and we’re introducing new types of tools and paradigms of software development that are fundamentally different than how software has been built before. And so that introduces the need for new types of measurements, new approaches to measurement. So a little bit of yes and no. Laura, what do you think?

Laura Tacho: Yeah, I think broadly the same. My guidance is always, to really answer this question or to understand how AI is impacting your organization’s performance, we need to go back to the foundations of what good performance looks like, those physics of what makes good software? Something that serves customer needs that’s easy to change, scalable, reliable, all of those fundamentals don’t change just because AI has entered the room, and we really need to return to the fundamentals in a lot of ways. When you have a really good grasp of engineering performance, of developer productivity, how to measure it, you can then apply AI and see where things are lifted and where they’re perhaps dipping a little bit, and really understand the impact. At the same time, of course, you do need some specific measurements to talk about, or to understand how AI is penetrating your org. What does adoption look like? How are people actually using it? Are we able to do different … solve use cases in different ways because of AI? And really understand the full picture of everything that’s happening.

So as you said, in a lot of ways the world is exactly the same and yet everything is entirely different, but I can’t emphasize enough that we don’t need to go back to a blank slate and try to rebuild the world of developer productivity, just because AI has entered the world. It’s really just a continuation and addition, on top of the fundamentals that are always true and will be evergreen.

Abi Noda: We’ve been spending the past few months working at this problem. We’ve worked with leading researchers in the field. We’ve also been collaborating closely with several of the leading AI co-assistant vendors, and hundreds of organizations have also given their input on this. The goal here is to provide a evolving, but at least point in time recommendation on how organizations should be thinking about measurement in the AI era. One of the things, Laura, I’ll hand it off to you to talk about this, is that this is Core Four, which is the previous framework we published end of last year, around how to measure engineering productivity, plus AI measurement. I think, can you speak more on that and also share with folks, how does this mirror the organizational adoption journey that we’re seeing?

Laura Tacho: Yeah, so you have the AI measurement framework. Abi and I as he explained, co-authored this along with collaboration from industry leaders, because we thought the last thing that you need is a vendor specific framework. We wanted to make your life easier by uniting a lot of industry practices and thought, and giving you something that was vendor agnostic and just able to be used for whatever your use case is, and meet you where you are. We released this just a few weeks ago actually, publicly, but we’ve been working on this for months now, with a lot of research. So the framework comes down to three main dimensions, that’s utilization, impact and cost. And I think a general universal truth of developer productivity metrics or frameworks like this, is it’s so important to look at all of those dimensions, those three dimensions together, as part of a whole story, a basket of metrics and not over index on one.

And so we just want to caution you and highlight that right from the start, that these need to be looked at together in context in order for it to work the best way. How we landed on utilization impact and cost, is that this often mirrors the adoption journey of AI tools within an organization. So the first thing that organizations look at is utilization. Are users actually adopting the tool? And if you think about any other dev tool, like a CI-CD tool, you would do the same thing. You would look at how many builds are actually running, how many projects have automated their CI-CD with this particular tool. And the mental model for AI is similar, so you can look at things like daily active users, look for utilization.

The impact though, this is where we return to core four, and to those foundational definitions of performance because those don’t change. Those are the things that are staying the same. So we can look at baseline that we had before introducing AI tools, and then see how those values change over time once we get developers onboarded into using tools for code authoring, for other things, other parts of the SDLC.

And then finally moving on to cost. We want to make sure that we’re investing the right amount, and so that might mean checking to see, are we investing too much? Are we investing too little? Is this investment the right amount? So looking at things like spend on licenses, on usage-based consumption, or consumption-based usage, but then also on things like training and enablement, and trying to get the full picture of ROI so that when you look at these things together, you can tell a really comprehensive story, and you have the full picture of what AI is actually doing in your organization. And then those impact metrics tell you, are these helping us get to market faster? Are they increasing quality, increasing maintainability, and have visibility there? So it’s not just about short-term gains, but also about long-term sustainable gains as well.

Abi Noda: Yeah, I can’t emphasize enough this very much mirrors, as you said, Laura, the adoption journey for organizations starting with, “Hey, let’s get these tools enabled. Let’s get these tools in the hands of developers. Let’s experiment with a lot of different tools.” Then moving into, “Okay, we’re experimenting with different tools, folks are starting to adopt them and use them. How can we actually understand what kind of impact they’re having on the SDLC, on developer experience, on developer productivity?”

And then folks are just starting to talk about this, in all honesty, which is the cost aspect, right? Hey, in a world where a single developer can burn thousands of dollars worth of tokens on AI tooling, in minutes, how do we think about what is the right amount of spend per developer? What are the right places where AI spend is giving us positive net ROI? Those are the types of questions organizations are starting to ask.

I would say in terms of what we’re seeing, folks are a lot more in the first two columns of their journey right now. So just turning these tools on, let’s encourage developers and enable them to learn how to use these tools, increase their maturity of how they’re using these tools. Then really starting to study impact, not just longitudinally, so over time, but also looking across different tools. So vendor evaluations and bake-offs, which tools are actually working most effectively for developers, or different types of developers, or senior versus junior developers. So a lot of different use cases for this right now. And ultimately I think what a lot of organizations are looking to do is detach themselves from the marketing and the hype and the headlines, and really get data grounded in their own organization to be having rational discussions around what is the impact we can expect to see today, in the future, and how do we get more value out of these tools?

Laura Tacho: Yeah, one question, Abi that I get all the time is, how do we know that AI is not hurting our organization long term? It’s not causing us to have long-term tech dot problems, we’re not just kicking the can down the road. Because when you look at the headlines, a lot of it is just about activity or output, how many lines of code were authored, how many PRs, and those can give you useful but limited insight, and we do recommend tracking some of those as part of utilization, just to understand what’s the surface area that these tools are touching? Because you need to make decisions based on … if you’re an organization that has a bunch of … you have a lot of adoption, and actually no code is being shipped to production written by these tools, that’s something that you want to know. So there is utility in understanding how much code is being written, but we agreed a long time ago that lines of code isn’t developer productivity, and that doesn’t change just because AI is the one authoring it.

And so my response when someone asks me, “Well, how do we make sure that it’s not just garbage code?” Is kind of like, "Well, how do you make sure it’s not garbage code right now? We look at things like quality, change failure rate, we look at developer satisfaction, we look at change confidence, or maintainability, and all of those things together can help you get a full picture of the real impact that AI is having, so that you don’t get too fixated on a number like percentage of code or acceptance rate, for example, which I think we’ll talk about in a little bit Abi, that we didn’t include in here. We don’t want to get tunnel vision, and we want to make sure to have a really broad picture, not just in near-term gains, but also mid and longer-term gains as well.

Abi Noda: To add to that, we definitely recommend looking at downstream quality signals to understand what are the potential side effects of increased AI adoption? One thing we look at really closely with the organizations we work with is code maintainability, which is a perceptual measure, it’s a self-reported measure from developers on how easily they feel they can understand and maintain and iterate on the code base. As we would expect, as more code is written, not by humans, humans are less knowledgeable about that code. And we do see, not in all cases, but in many cases a decrease in self-reported code maintainability scores. Now what I get asked a lot is, so what? What do we do with that? And I think it’s really interesting, because on one hand I think that’s both an intuitive and slightly concerning signal. On the other hand, perhaps … I really like to acknowledge you that perhaps AI augmented coding is just a new abstraction layer, whereas we started out writing machine code, then moved into higher level abstractions. Perhaps this is just the next abstraction, and if that’s the case, then code-based maintainability maybe, is not actually as important in a world where we’re just operating at a higher abstraction level.

Laura, I’ll also move into the topic around acceptance rate, and I see a lot of questions about this in the Q&A and comments, so I want to address it. A lot of organizations are wanting to track how much of their code is being generated by AI. One of the proxy metrics we’ve seen for that is acceptance rate. However, our point of view, and I think the consensus point of view amongst many practitioners and researchers, is that acceptance rate is an incredibly flawed and inaccurate metric because when developers accept code, much of that code is often then later modified or deleted, and then human authored. And so in terms of techniques for tracking AI authored code, I’ve seen comments about this, yes, tagging the code, tagging PRs is one easy way to get started. Some of the AI tool vendors have, or are also developing different types of techniques and technologies for trying to more accurately assess this.

At DX, we’ve been developing a technology that is really observability at the file system level, so looking at the rate of changes to files, to be able to detect changes that are coming from human typing as opposed to AI tools that are making batch modifications to the files. So that’s another approach to consider for folks, is looking at the file system level, because this is able to cross-cut all IDEs, all AI tools, CLI agentic tools. For folks who are interested in learning more about that, sign up for a demo of DX, that technology is something that’s coming out soon, but just wanted to give clarity to that question, because that is one of the questions I hear often from organizations is, “Hey, we really can’t figure out a way to track AI-generated code.”

Laura Tacho: Yeah, absolutely. I’m really excited to see what that kind of data unlocks in terms of decision-making usage patterns. There were a couple other questions about measurement that maybe this is a good time to get into, because actually getting the data is a bit of a question mark for some of the metrics that we have here in the framework, or other things that you might want to be measuring, because the tools themselves are very new, and so the telemetry that they offer is also maturing, and some have more mature API endpoints that you can call for team utilization, and it’s not necessarily standardized. Broadly speaking, just to give a bit of an overview, there’s kind of three different ways that you can think about gathering these metrics. The first would be what’s been mentioned a couple of times, can you get some of these metrics from tools? Are you looking at GitHub? Can you get it from the AI tool itself? And yes, a lot of them do have APIs that are available for you to get usage data, for example, or to get consumption data, and you can kind of put together a picture of that. So there’s workflow metrics.

Those also … those workflow metrics are looking at GitHub, looking at GitLab, getting data from your systems to understand what’s PR throughput like? Other kinds of metrics there from the workflow systems.

The second kind I would say would fall into the category of periodic surveys, so a quarterly survey. Someone asks the question of how are you measuring developer satisfaction or developer experience? And a survey is a great way to do that. We do surveys usually quarterly, is what we recommend for cadence, and these are going to give you longer term trends. You can see how those lines are moving up or down over the course of several quarters, and you can see we introduced this tool in Q1 of this year, and we can see the Q2 results have these interesting things and do some analysis from there. These are really valuable for looking at trends over time.

The third way of getting this data is something called experience sampling, and that is asking one very targeted question, extremely close to the work. So you can imagine a developer closing or merging a pull request, and you asking the question, “Did you use an AI tool to author code in this?” Or asking the reviewer, “Did you review this? Was it more difficult, or was it more difficult or same difficulty or less difficult to understand the code, because it was authored with AI?” So we’re asking very, very targeted questions in the workflow, and that allows you to get extremely specific feedback about a very specific thing, because you’re staying very close to the work. So a lot of that recency bias, or other things that might be a concern for folks in the periodic surveys tend to dissipate a bit with the experience sampling, because we’re kind of catching people in the moment and we’re getting their feedback.

That’s also beyond looking at usage statistics from the tools themselves, an interesting technique for capturing adoption and usage, because if you’re doing experience sampling in such a way where you’re getting feedback about who’s opening a PR that’s been authored with AI assistance, you can get kind of a pulse of how many of your developers are using AI tools on a daily, weekly basis.

So those are the three main ways to get this data. And so I wanted just to take a second to answer some of those questions in the Q&A about measurement techniques.

Abi Noda: As always, often there’s more than one way to get to the same data point, and as we always like to remind folks, it’s always better to get multiple data points that you can triangulate. I saw a question in the chat about self-reported … or sorry, AI-driven time savings. That’s something that we generally recommend as a survey-based measure, to start to get a baseline, but that’s also something that you can triangulate your way into by looking at logs-based data. For example, the Google headline around 10% productivity gains that came out, they measure time savings through analysis of logs-based system-based data. On the topic of AI-driven time savings measured through surveys, I think I did want to touch on that recent meta study that came out, because one of the really interesting findings from that study was that developers’ self-reports of the fact that they were faster with AI than without it, were actually wrong in many cases, that observationally, they were slower than the folks who use AI.

So I think I’ve commented on this on LinkedIn and online, but it’s a really interesting phenomenon. I think that is actually pretty pervasive right now in the industry where folks feel the magic of these AI tools, and the superpower of these AI tools and the ease of these AI tools, and sort of conflate that with actual time savings or productivity gains. I see even Mitchell Hashimoto talking about, for example, “Hey, it was incredible what I was just able to build with AI, but was I actually faster? Questionable.” So I think there’s a lot of that going on in the industry right now. So taking a step back, again, we need multiple data points to cross validate, to triangulate, to correlate, really understand the ground truth of what’s happening, but again, combining both self-reported data with telemetry and systems data is the way to go, as always.

Laura Tacho: Yeah, absolutely. There’s many different ways to arrive at the data, and having data to confirm what the other thing says, is always going to give you the confidence that you’re getting a signal that is real. So one of the things that we’ve been talking about now is there’s a human in the loop. We’re talking about individual developers using AI in a couple of different modalities. Maybe it’s chat, using it as a pair programming partner, and rubber ducking some things with Chat GPT or whatever. We’ve got code completion in the IDE. Then we have agentic workflows, which are actually quite different from those two earlier modalities. Can you talk a little bit, Abi, about where we landed with measuring those kinds of workflows and even what we could expect to change in the next six months, 12 months?

Abi Noda: Yeah. First of all, there’s a lot of industry discussion just around, “Hey, we need clear definitions of what an agent is,” and AI tools in general, so there’s a distinction between say something like AI powered auto complete in an IDE, versus some of the newer tools that are autonomous loops, if you will, that can go … really complete entire end-to-end tasks, can even do discovery of tasks fully autonomously. That’s an important distinction when understanding AI impact in general, we not only just need to think about adoption from a daily weekly usage standpoint, but also maturity. Are we talking about developers using AI for auto complete versus agentic tools? I think the really interesting question that we’ve been wrestling with as part of this research is, so how do we think about the measurement of that? Do we begin to look at agents as people in terms of, are we measuring that the agent’s productivity? Are we measuring the agent’s … even experience? That’s something that’s come up in our discussions.

As of now, and we acknowledge that the space is very quickly evolving and our guidance will be updated as well. But as of now, our point of view and recommendation is to treat agents as extensions of people. And so what that effectively means is that we’re treating people as managers of agents, we’re treating people plus agents as teams, and so the way we should be measuring the efficacy and productivity of agentic software development is by looking at the agents as extensions of the people, and measuring them as a group, as a team, essentially.

Laura Tacho: To illustrate this, I have maybe a bit of a funny example, but I think it works really well, which is thinking about the OG agent Jenkins, and so I know Jenkins is a tool that we love to hate. I used to work at CloudByz, so I feel like I get a pass and I can talk about Jenkins. But whether it’s Jenkins or any other CI-CD tool, we don’t treat Jenkins as an employee, or think about him as his own team. We look at the efficiency gains in the context of the team in which those CI-CD tools, or Jenkins, is operating. And so this is an important distinction as well when we’re talking about agentic AI. They’re not digital employees necessarily, they belong to the teams that are still overseeing the work that’s coming in and out. So I think the one thing where we do get … one thing that’s very interesting to think about with agentic AI, autonomous loops as you called them, Abi, or just AI in general, is expanding the definition of what a developer is.

So when we’re talking about measuring AI as part of development teams, of course we’re thinking about people who are trained as developers, but one of the other things we’re observing is that there are now a lot more people in an organization that are able to contribute code, and so we have to sort of make sure we’re capturing the full footprint of where AI is having an impact, because it’s not just people who have developer engineer in their title that are now contributing code. This sort of enables folks who were maybe adjacent to engineering to contribute code, and we want to make sure to kind of bring that phenomenon into the fold when we’re measuring, as well.

Abi Noda: Laura, I’d love to start moving into some of the questions we’re getting. I also want to make sure we wrap up our talk track here with the really important point that two really important things we’re seeing. One is that one of the biggest questions, at least I’m getting Laura, is, “Hey, how do we go beyond code gen?” All the hype right now is around AI code assistance, vibe coding tools. How do we think about other ways to impact the SDLC with AI? I would admit that I think we’re really early there. We’ve actually been trying to do a study on that internally. We’ve been talking to a lot of companies, and quite frankly we’re not seeing that much yet. We’re not seeing a lot of major case studies, so I think it’s really early days there. I think one piece of guidance around that is to sort of aim before you fire. It’s really important … going back to the importance of more classical developer productivity signals and metrics, like really understanding where are the opportunities in our SDLC? To think about AI experimentation and investments is really important right now.

And the other thing is that we should also make sure we remember that AI is just but one tool, and it’s certainly a transformative tool, but we need to be thinking beyond AI in terms of engineering productivity overall. There’s still a lot of human bottlenecks, there’s still a lot of tooling bottlenecks, systems bottlenecks outside of AI tools. And so organizations that are ultimately looking to accelerate, need to not just be holed in thinking about AI right now. So that’s another important, I think, takeaway for listeners today.

Laura Tacho: Yeah, absolutely. I want to answer this question about how do you measure second order effects of AI? Like much more code generated and time needed for code review, risk of massive flood of code to work through, moving the bottleneck down the line, and this is where those impact metrics in that second column here become really important, because the utilization metrics here on the far left column, these are just showing penetration of the tool. How is it being adopted in your organization? But it’s these impact metrics, and I’ll share again the DX core four framework. If you’re not familiar with this, this is a framework that Abi and I authored and released last year.

This unifies Dora Space and DevEx and gives people just a really clear guidance on what to measure to measure developer productivity and engineering org performance. But this is where you’re going to see the hotspots from those second order effects of AI.

So we can think of a hypothesis of, well, what would happen if we have too much code generated, that is then leading to a maintainability burden on the organization? We might expect PR throughput to slow down, because the system is not as efficient, there’s too much friction. We should expect to see developer experience take a dip. And so this is where understanding the fundamentals of what makes up engineering organization performance, and we’re using the core four here as an example, seeing the before and after and keeping a close eye on things like developer experience, change failure rate, throughput innovation ratio, that is where you’re going to see those second order effects really come in. So that you’re not trading in that short-term efficiency gain of now we can generate more code and we can get a lot of work into the system faster with those longer term … longer term pain, we’re not just kicking the can down the road.

Abi Noda: Laura, I’m going to try to take a few questions here, rapid fire. So we have a question from Roxana. Once you’ve measured, if you’ve measuring all these things, how do you use them? What types of actions do you take based on them? My response to that is it depends where you are in your journey. So we see folks, for example, just starting out with the utilization metrics at the very beginning of their rollout efforts to really drive enablement, training, encouragement, comms, just to make that shift to get developers to start using these tools. We see a lot of organizations using this data to evaluate tools. So there’s a new tool literally every day, right now, in the AI space. So being able to take a data-driven approach to evaluations. We’re also seeing folks use the data to again, aim before they fire. So think about planning and rollout strategy, what parts of the organization do we focus on? What tools are most conducive to different types of developers across our organization? And lastly, folks are using this data to really understand, hey, I missed all this hype, what is the ROI? What is the impact we are seeing right now in our organization? That’s a really important question that I think every board, every corporation is trying to get answers to. So being able to show up to that conversation with real data, and a narrative around that is really valuable.

I’m going to move into another question I really liked, which is from Tony. How much are we thinking about the learning curve of AI tools? This is something I touched on earlier, but it’s really important to not only think about utilization in terms of logins. So who touched an AI tool each day? By that measure I think a lot of organizations are making rapid progress. A few weeks ago I was out saying, “Hey, look, a lot of even the top organizations we see are only at about 60, 70%, let’s say weekly daily active usage.” I think just in a few weeks that numbers already increased. Now, another way to look at utilization is from a maturity angle. So not just how often is someone touching a tool, but how are they using it? Big difference between developers using auto complete, versus pulling up a tool like cloud code and giving it large end to end multi-step tasks. So maturity is really important. I know there are folks out there working on maturity models for developer AI fluency or adoption. I think that’s something at DX we’re looking at as well. The space is evolving again so quickly that we haven’t published guidance there yet, but are planning to. Laura, if I can take one more question. I said this would be rapid fire.

Laura Tacho: Please.

Abi Noda: Which was a question from Carrie about their design department, who is interested in using AI tools to generate full prototypes. One of the key points we make in the AI measurement framework … the white paper by the way which, folks, you can access that just by visiting our website getdx.com, is that the very definition of who is a developer is being somewhat challenged and expanding, right? In a world where business analysts, designers, product managers, can now generate code, we need to rethink sort of the definition of who is a software developer? That being said, the biggest value we’re seeing from organizations right now is in giving designers and PMs some of these vibe coding tools as a way to do more rapid prototyping and exploration. We have not seen any case where designers and PMs are now using these tools to really contribute production level code. It seems, from Carrie’s question, that there may be some tension going on around do we allow PMs to be contributing to production code? I think our recommendation around that is, that’s not something we’re seeing happen successfully right now. However, I don’t see any reason to necessarily prohibit it. I think the key would be to have that code go through the same code reviews, the same gating quality processes as any other code in an organization.

Laura Tacho: Yeah, absolutely. Thanks, Abi. Let’s tie up for today with just a little bit guidance on how to actually bring these metrics into your organization, and what a rollout kind of looks like. So as with the core four and every other sort of universal truth of developer productivity metrics, when we suggest tracking utilization, we’re always talking about aggregating it on a team level. We’re not ever talking about using it for individual performance management, or tracking it on an individual level. I think that’s worth calling out again and again, because there is some sensitivity around measuring developer productivity in general, but also in the age of AI, there’s extra sensitivity around people not feeling always confident about whether they’re allowed to use AI, or if they’re putting themselves in a risky position by relying a lot on AI to get their job done, if that’s a good thing or a bad thing.

And so looking at aggregate values, maybe at the team level, department level, these are enough to give you the directional trends to provide support and intervention. So make sure that it’s clearly communicated who’s going to see the data, what decisions are being made from it, and that’s just going to put you in a better position to have success with these metrics. I think the point that I’d like to end on today is that we can’t lose sight of the bigger picture, like AI is another tool. AI works because it improves developer experience, and developer experience leads to better organizational outcomes. And AI is not a magic silver bullet that’s magically going to allow you to get to market in 30 seconds, when things used to take 30 weeks before. There is a lot of promise though, and like any tool, we really need to deeply understand how it’s working, the best ways that it’s working, so that we can provide support and enablement to engineers, that we can identify the use cases that are the most valuable to our company. And the only way to do that is to have really good data.

So instead of focusing on the hype and the headlines, focus on your data. Treat this as an experiment, with your scientist hat on. Have your baseline of your core four metrics, or whatever you’re using to measure developer performance, organizational performance. You can see how AI impacts it, and then you also have to keep an eye on things like utilization impact and cost, in order to have the most complete picture of how AI is impacting your organization.

Abi Noda: I would just plus one, the framing and the rollout is really important. A lot of these tools enable a level of telemetry into the developer day to day, that is even more invasive, if you will, than the types of telemetry we’ve had before, literally down to keystroke level telemetry. And so naturally, there’s a lot of concern from developers and organizations about where this is all headed. Lots of organizations really pushing adoption from the top down, even mandating AI adoptions. There’s a lot of pressure, there’s a lot of fear right now on becoming obsolete. That’s the narrative around AI. So really important for organizations to be proactive in higher communicating about the use of measurements in your organization. Be clear about what you are and what you aren’t using this data for, and make sure that this is ultimately about helping the organization and all developers make this transition, in a rational and data-driven way.

Laura Tacho: Yeah, data beats hype every time, and so if you’re stuck trying to explain why your organization isn’t replacing or shipping 50% of your code with AI, having the data using the AI measurement framework is the best way to get yourself out of that stuck place. Data beats hype every time, you need data to make these big decisions with big budgets, and so we hope you find this AI measurement framework useful.

Please find Abi and I on LinkedIn, we’re posting and talking about AI developer productivity, developer experience all the time, and so we’d love to interact with you there. Thanks so much for spending a bit of your day with us, and we will see you around the internet.

Engineering acceleration tools

Measuring AI code assistants and agents with the AI Measurement Framework

Show notes

AI’s hype vs. reality gap

AI doesn’t change engineering fundamentals

The AI Measurement Framework

The pitfalls of acceptance rate

Collecting measurement data

Perception vs. reality in time savings

Measuring agentic workflows

Expanding the definition of developer

Thinking beyond AI

Watching for second-order effects

Rolling out metrics successfully

Timestamps

Transcript