How do we interpret AI impact without overclaiming causation?
A better way of framing the impact of AI.
This post was originally published in Engineering Enablement, DX’s newsletter dedicated to sharing research and perspectives on developer productivity. Subscribe to be notified when we publish new issues.
Last week I hosted a live AMA about measuring the impact of AI and where organizations are with their rollout today. (Special thanks to Jesse Adametz for hosting the discussion.) One of the questions that came up had to do with measuring ROI: specifically, how should we think about correlation vs. causation? Leaders want to be careful about whether they should attribute changes in metrics to AI tools.
I answered the question live—you can download and watch the full recording here—or read this week’s newsletter for my take.
Many organizations make this mistake: They compare developers who use AI heavily to those who use it less, notice that the heavy users have higher throughput, and then conclude that AI must be the cause. The problem with this kind of analysis—“do people who use AI more have higher code throughput?”—is that in many cases, the developers who use AI more are the ones who were already coding more in the first place.
That’s why most of the time, longitudinal analysis (looking at how things change over time) is more informative. But it’s also harder to do. You need clean data over a long enough period, and you still have to account for confounding factors. For example, one big confound right now is the heightened pressure in many companies to increase throughput. Leaders are simultaneously pushing for more output and more AI usage. Because of this, some of the increase in throughput is likely due to this pressure itself—a classic case of Goodhart’s law, where once a metric becomes a target, it stops being a good measure.
When thinking about the ROI of AI more broadly, it’s useful to break it into two buckets:
- Amplification: How much more productive are humans thanks to AI? Here we look at:
- Throughput (are engineers shipping more by using AI?)
- Time saved (how much time do developers feel they’re saving in specific workflows?)
- Developer experience scores (are AI tools improving the overall developer experience?) We can then convert improvements in developer experience into time savings, for example using something like a developer experience index.
- Augmentation: To what extent are you actually extending your engineering capacity by using agents, as if they were additional headcount? One unit we like to use here is human-equivalent hours: how much work are agents delivering, how much would that have taken a human, and then you divide that by how much it cost. Dividing the human-equivalent hours by the cost gives you an agent hourly rate. If that effective rate is low, it indicates a high-ROI place to invest.
This amplification/augmentation framing is also helpful when talking to executives. You can say: to some extent we’re amplifying our humans, and to some extent we’re augmenting our workforce with agents.
Another common and related question that tends to come up in this discussion: if we agree that lines of code is a poor metric, what should we use instead to measure AI’s impact?
LOC is a noisy metric for a simple reason: A low-effort change can involve many lines of code, and a high-effort change can involve very few lines of code. That was already true before AI, and it becomes even worse with AI-generated code, which tends to produce more lines than a human might. This makes LOC even more inflated and noisy.
For these reasons—both pre- and post-AI—we’ve preferred metrics like PR throughput. It’s still imperfect, but it gives a more normalized view of “change throughput”: how many atomic changes are we pushing through the system? That makes it a less noisy high-level signal than raw LOC.
At DX, we’ve also developed a metric called TrueThroughput, which is a weighted version of PR throughput. It incorporates lines of code as one of several inputs to weight PRs, but LOC is not the sole or primary metric. Even so, all of these metrics are imperfect; you’re really choosing between different degrees of imperfection. In that space, LOC is significantly less useful than PR throughput as a primary signal. This lines up with broader industry experience as well. Many large tech companies that have studied this problem in depth have also converged on some form of change throughput as their preferred signal for tracking impact, rather than relying on raw lines of code.
To summarize my perspective on the correlation vs causation question: a simpler way to frame AI’s ROI is to look at how much it amplifies your existing developers, and how much it augments your organization with agent-driven capacity.
Download and rewatch the full AMA discussion here. My response to this specific question starts at 16:50.