Takeaways: How to measure engineering productivity in the AI era

Brook Perry

Marketing

Organizations are caught between conflicting claims about AI’s impact on software development. Headlines range from Anthropic’s prediction that “AI will be writing 90% of code in six months” and Google’s measured "10% productivity improvement,” to the recent METR study showing senior developers are actually slower with AI than without it.

These headlines highlight a growing gap between industry expectations and the reality of what engineering teams are seeing on the ground. This means that engineering leaders need their own data to make informed decisions about where and how to invest in AI, and clear evidence to communicate the impact they’re seeing back to the business.

DX’s CEO and CTO, Abi Noda and Laura Tacho, recently held a live discussion on measuring AI in engineering, covering topics such as the metrics they recommend using, how to think about measuring agents, and the expanding definition of “developer.”

Below is a summary of some of the key questions covered during the discussion. You can also watch the full recording here.

Q: Do we need to rethink the way we’re measuring productivity now that AI is here?

Laura: To really answer this question or understand how AI is impacting your organization’s performance, we need to go back to the fundamentals of what makes good software. Those are things like serving customer needs, being easy to change, scalable, and reliable. These fundamentals don’t change just because AI has entered the picture.

When you have a solid grasp of engineering performance, developer productivity, and how to measure them, you can then apply AI and see where things are improving or where they might be lagging. That helps you understand the true impact. At the same time, you need specific metrics to understand how AI is penetrating your organization: what does adoption look like? How are people using it? Are we solving problems in new ways because of AI? This gives you the full picture of everything that’s happening.

So, in many ways, the world is exactly the same, and yet it’s entirely different. But I can’t emphasize enough that we don’t need to start from scratch or rebuild the world of developer productivity just because AI is here. It’s really more of an extension, building on the fundamentals that have always been true and will always be relevant.

Q: Is there a framework to measure AI’s impact?

Abi: We’ve been working on this problem for months, collaborating with leading researchers in the field and several AI code assistant vendors. Hundreds of organizations have also provided input. Our goal has been to offer a research-backed recommendation for how organizations should think about measurement in the AI era.

What we’ve developed builds on DX Core 4—the framework we published at the end of last year for measuring engineering productivity—and extends it specifically for AI measurement.

Laura: The AI Measurement Framework is something we wanted to develop because we saw a critical gap. Existing approaches were often vendor-specific or overly narrow. Our goal was to create something vendor-agnostic, research-backed, and practical—something that could unite industry practices and meet organizations wherever they are on their AI journey.

The framework focuses on three core dimensions: utilization, impact, and cost.

These three dimensions mirror the typical adoption journey of AI tools within organizations:

Utilization is where most organizations start—are users actually adopting the tool? This follows the same logic you’d apply to any dev tool. For a CI/CD tool, you’d look at how many builds are running or how many projects have automated their processes. For AI, you can track metrics like daily active users to understand adoption patterns.
Impact is where we return to those foundational performance definitions that don’t change—the Core 4 metrics. You establish a baseline before introducing AI tools, then track how those values change over time as developers get onboarded for code authoring and other parts of the software development lifecycle.
Cost ensures you’re investing appropriately. This means checking whether you’re spending too much, too little, or just right. Look at license costs, usage-based consumption, training and enablement expenses—the full ROI picture. When you examine these dimensions together, you can tell a comprehensive story about what AI is actually doing in your organization.

Abi: This really does mirror the adoption journey Laura described. Organizations typically start with “Let’s get these tools enabled and in developers’ hands. Let’s experiment with different tools.” Then they move to “Okay, people are adopting these tools—what kind of impact are they having on the SDLC, developer experience, and productivity?”

The cost aspect is just starting to get serious attention, and honestly, it needs to. In a world where a single developer can burn thousands of dollars in AI tokens within minutes, organizations are asking critical questions: What’s the right spend per developer? Where is AI spend delivering positive net ROI?

Most organizations we’re seeing are still in the first two phases—enabling tools, encouraging developers to learn them, and increasing usage maturity. Then they’re studying the impact both longitudinally over time and across different tools through vendor evaluations and bake-offs. They want to understand which tools work most effectively for different types of developers, or compare senior versus junior developer outcomes.

Ultimately, what organizations are looking to do is detach themselves from the marketing, hype, and headlines. They want data grounded in their own organization to have rational discussions about realistic impact expectations today and in the future, and how to extract more value from these tools.

Q: How do you ensure AI isn’t hurting long-term code quality?

Laura: When someone asks me, “How do we make sure that it’s not just garbage code?” my response is: “How do you make sure it’s not garbage code right now?” We look at things like quality, change failure rate, developer satisfaction, change confidence, and maintainability. All of those things together can help you get a full picture of the real impact that AI is having as well, so that you don’t get too fixated on numbers like percentage of code or acceptance rate.

Abi: One thing we look at really closely with the organizations we work with is code maintainability, which is a perceptual measure—it’s a self-reported measure from developers on how easily they feel they can understand and maintain and iterate on the code base.

As we would expect, as more code is written not by humans, humans are less knowledgeable about that code. And we do see, not in all cases, but in many cases a decrease in self-reported code maintainability scores. What I get asked a lot is, “So what? What do we do with that?”

I think it’s really interesting, because on one hand that’s both an intuitive and slightly concerning signal. On the other hand, perhaps AI-augmented coding is just a new abstraction layer. We started out writing machine code, then moved into higher-level abstractions. Perhaps this is just the next abstraction, and if that’s the case, then code-based maintainability maybe is not actually as important in a world where we’re just operating at a higher abstraction level.

Q: How do you actually track AI-generated code?

Abi: A lot of organizations want to track how much of their code is being generated by AI. One of the proxy metrics we’ve seen for that is acceptance rate. However, our point of view, and the consensus point of view amongst many practitioners and researchers, is that acceptance rate is an incredibly flawed and inaccurate metric because when developers accept code, much of that code is often then later modified or deleted, and then human-authored.

In terms of techniques for tracking AI-authored code, tagging the code and tagging PRs is one easy way to get started. Some of the AI tool vendors have, or are developing, different types of techniques and technologies to assess this.

At DX, we’ve been developing a technology that is observability at the file system level, looking at the rate of changes to files, to be able to detect changes that are coming from human typing as opposed to AI tools that are making batch modifications to the files. This approach can cross-cut all IDEs, all AI tools, and CLI agentic tools.

Q: How do you gather these metrics in practice?

Laura: There are three main ways to think about gathering these metrics:

1. Tool-based metrics: Many AI tools have APIs that are available for you to get usage data or consumption data. You can also look at workflow metrics from GitHub, GitLab, and other systems to understand PR throughput and other workflow indicators.

2. Periodic surveys: A quarterly survey is a great way to measure developer satisfaction or developer experience. These give you longer-term trends and let you see how those lines are moving up or down over the course of several quarters.

3. Experience sampling: This involves asking one very targeted question extremely close to the work. You can imagine a developer closing or merging a pull request, and asking, “Did you use an AI tool to author code in this?” Or asking the reviewer, “Was it more or less difficult to understand the code because it was authored with AI?” We’re asking very targeted questions in the workflow, which allows you to get extremely specific feedback about a very specific thing.

Organizations can either use a solution like DX, which combines all three measurement approaches in one platform, or choose to build and manage their own system.

Abi: As always, there’s often more than one way to get to the same data point, and we always recommend getting multiple data points that you can triangulate. It’s always better to cross-validate, to correlate, and really understand the ground truth of what’s happening by combining both self-reported data with telemetry and systems data.

One interesting finding from that recent METR study was that developers’ self-reports of being faster with AI than without it were actually wrong in many cases—observationally, they were slower than folks who didn’t use AI. I think there’s a lot of that going on in the industry right now where folks feel the magic of these AI tools and sort of conflate that with actual time savings or productivity gains.

Q: How should we think about measuring AI agents?

Abi: There’s a lot of industry discussion around needing clear definitions of what an agent is. There’s a distinction between something like AI-powered auto-complete in an IDE, versus some of the newer tools that are autonomous loops that can complete entire end-to-end tasks, and can even do discovery of tasks fully autonomously.

The really interesting question we’ve been wrestling with is: Do we begin to look at agents as people in terms of measuring the agent’s productivity? Are we measuring the agent’s experience?

As of now, our point of view and recommendation is to treat agents as extensions of people. What that effectively means is that we’re treating people as managers of agents, we’re treating people plus agents as teams, and so the way we should be measuring the efficacy and productivity of agentic software development is by looking at the agents as extensions of the people, and measuring them as a group, as a team, essentially.

Laura: To illustrate this, I have a funny example: thinking about the OG agent Jenkins. Whether it’s Jenkins or any other CI/CD tool, we don’t treat Jenkins as an employee, or think about it as its own team. We look at the efficiency gains in the context of the team in which those CI/CD tools are operating. This is an important distinction when we’re talking about agentic AI—they’re not digital employees necessarily, they belong to the teams that are still overseeing the work.

One thing that’s very interesting to think about with AI in general is the expanding definition of what a developer is. We’re observing that there are now a lot more people in an organization who are able to contribute code. We have to make sure we’re capturing the full footprint of where AI is having an impact, because it’s not just people who have “developer” or “engineer” in their title that are now contributing code.

Q: How do you use these metrics to drive action?

Abi: It depends on where you are in your journey. We see folks just starting out with the utilization metrics at the very beginning of their rollout efforts to really drive enablement, training, encouragement, and communications to get developers to start using these tools.

We see a lot of organizations using this data to evaluate tools. There’s literally a new tool every day right now in the AI space, so being able to take a data-driven approach to evaluations is valuable.

We’re also seeing folks use the data to plan and rollout strategy—what parts of the organization do we focus on? What tools are most conducive to different types of developers across our organization?

And lastly, folks are using this data to really understand ROI. What is the impact we are seeing right now in our organization? That’s a really important question that every board, every corporation is trying to get answers to. Being able to show up to that conversation with real data and a narrative around that is really valuable.

Guidelines for rolling out AI metrics

Focus on team-level aggregation: When tracking utilization metrics, always aggregate at the team level. Never use these metrics for individual performance management. There’s sensitivity around measuring developer productivity in general, but also extra sensitivity in the age of AI around whether developers are allowed to use AI, or if they’re putting themselves in a risky position by relying on it.

Communicate clearly: Be proactive in communicating about the use of measurements in your organization. Be clear about what you are and what you aren’t using this data for. Make sure that this is ultimately about helping the organization and all developers make this transition in a rational and data-driven way.

Remember the bigger picture: AI is another tool that works because it improves developer experience, and developer experience leads to better organizational outcomes. AI is not a magic silver bullet. There are still a lot of human bottlenecks, tooling bottlenecks, and systems bottlenecks outside of AI tools. Organizations looking to accelerate need to not just focus on AI right now.

Treat it as an experiment: Have your baseline of your core productivity metrics. See how AI impacts them, and keep an eye on utilization, impact, and cost to have the most complete picture of how AI is impacting your organization.

As Laura puts it: “Data beats hype every time.” Instead of focusing on the hype and headlines, focus on your data. If you’re stuck trying to explain why your organization isn’t shipping 50% of your code with AI, having the data using the AI Measurement Framework is the best way to get yourself out of that stuck place.

The AI Measurement Framework whitepaper is available here. For more on how DX measures AI adoption and impact, get a demo.

Last Updated

July 17, 2025

Engineering acceleration tools