Skip to content

How to measure AI performance in software engineering

Real benchmarks, the right metrics, and a six-month roadmap for engineering leaders

Taylor Bruneaux

Analyst

This guide is written for engineering leaders and CTOs measuring the impact of AI coding tools on their software development teams.


TL;DR

  • Industry-wide AI tool adoption has reached 93%, but most organizations see only 5–15% gains in PR throughput. This is real, but far below vendor claims
  • Measuring AI performance requires three layers: utilization, impact, and cost. Tracking any one without the others produces a misleading picture
  • Developers save an average of 3.9 hours per week; daily AI users merge 60% more PRs than non-users
  • Quality metrics are the most overlooked dimension: some teams are shipping up to 50% more defects since AI adoption
  • Structured enablement, not tool access, is the primary driver of AI performance gains

AI-assisted engineering is a core part of engineering strategy. DX’s analysis of 400+ companies found that industry-wide adoption has now reached 93%, and developers are saving an average of 3.9 hours per week, up from 3.6 hours just one quarter ago.

The tools are working. The harder problem is knowing how well they’re working, and for whom.

Microsoft claims AI writes 20–30% of code. Google puts that number at 30%. Some predictions have AI authoring 90% of all code within months. But even with a 65% increase in AI tool usage across those 400+ companies, median pull request throughput increased by about 8%. Most organizations landed in the 5–15% range.

That’s a real gain, not a failure. But it’s a far cry from the 3x or 10x improvements that vendor marketing implies.

Most engineering teams are measuring AI performance incompletely. They are tracking adoption or output in isolation, while missing the quality risks and cost signals that determine whether AI is truly working. This guide provides a framework for measuring all three dimensions together.

Why traditional engineering metrics don’t capture AI performance

Measuring AI development productivity requires a different strategy than tracking traditional engineering performance metrics. AI coding assistants are used in a fragmented way. Developers might write code with GitHub Copilot in their IDE, brainstorm using ChatGPT, and use Claude for documentation, all within the same hour.

DX’s longitudinal analysis identified five structural reasons why AI productivity gains are lower than expected:

  1. Coding represents only ~14% of a developer’s day, so accelerating it yields modest overall gains
  2. AI speeds up code generation, but creates new bottlenecks in review and integration
  3. Social friction between pro- and anti-AI engineers slows team-level adoption
  4. Skill and tooling gaps mean most developers are still early on the learning curve
  5. AI tools lack the institutional context needed for complex, real-world engineering problems

What makes this measurement problem unusual is the disconnect between what developers report and what shows up in the data. Developers report saving nearly 4 hours a week. But those savings aren’t showing up proportionally in throughput.

The gap between perceived productivity and measured output is itself a signal. It suggests developers are reinvesting AI-generated time into work that doesn’t produce PRs: deeper thinking, better architecture, harder problems. A measurement approach that only counts output will systematically undervalue what AI is doing for your teams.

AI performance metrics framework

DX developed the AI Measurement Framework in collaboration with researchers, leading AI vendors, and customers. It is research-backed, vendor-agnostic, and designed to track AI’s impact across three dimensions: utilization, impact, and cost. Each layer tells a different part of the story.

Layer 1: Utilization metrics — are developers using the tools?

Adoption is the foundation. Before measuring impact, you need to know whether developers are using the tools and how consistently. With industry-wide adoption now at 93%, the question has shifted from whether developers have access to how effectively they’re using what they have.

Metric

What to measure

Measurement method

Target benchmark

Warning signs

Monthly active users (MAU)

% of developers using AI tools monthly

Tool analytics dashboards

93%+ (industry-wide)

<75% adoption

Weekly active users (WAU)

% of developers using AI tools weekly

Tool analytics dashboards

74%+ (heavy + moderate users)

<50% adoption

Daily active users (DAU)

% of developers using AI tools daily

Integration analytics or surveys

44%+ (heavy users)

<25% daily usage

% of PRs that are AI-assisted

Share of merged PRs with AI involvement

Git analytics + AI tool telemetry

Trending upward quarter over quarter

Flat or declining

% of committed code that is AI-generated

Share of merged code authored by AI

Developer surveys + telemetry

27%+ (Q1 2026 benchmark)

<15%

Layer 2: AI impact metrics — is performance improving?

Once adoption is established, measure the direct effect on how developers work and what they deliver.

Metric

What to measure

Measurement method

Target benchmark

Warning signs

AI-driven time savings

Hours saved weekly via AI assistance

Periodic pulse survey

3.9 hrs average; 4.72 hrs for daily users

<2 hours reported savings

PR throughput

PRs completed per developer/week

Git analytics by AI usage cohort

Daily users: 2.4 PRs/week; 10–15% YoY gain

No measurable change

Developer experience score (DXI)

Team satisfaction, confidence, and flow

Quarterly developer experience survey

Maintain or improve baseline

>10% score decrease

Code maintainability

How easy it is to understand and modify code

Developer surveys

Stable or improving

Declining quarter over quarter

Change confidence

Developer confidence a change won't break production

Developer surveys

Stable or improving

Declining quarter over quarter

Change failure rate

% of changes causing degraded performance

CI/CD pipeline monitoring

Stable or improving

>2 percentage point increase

Layer 3: Cost metrics — is AI spend delivering measurable ROI?

The final layer connects utilization and impact to investment. As AI adoption scales, cost tracking becomes as important as productivity tracking.

Metric

What to measure

Measurement method

Target benchmark

Warning signs

AI spend per developer

Total AI tooling cost divided by developer headcount

Finance + procurement data

Trending down as adoption matures

Rising without corresponding impact

Net time gain per developer

Time savings minus AI spend (converted to hours)

Survey data + spend data

Positive and growing

Negative or flat

Agent hourly rate (for agentic tools)

Human-equivalent hours delivered per dollar of AI spend

Agentic tool telemetry + spend

Higher than human hourly rate

Below human hourly equivalent

AI performance benchmarks: What the data shows

Usage frequency is the strongest AI performance indicator

Data across 400+ companies shows a consistent correlation between AI usage frequency and pull request throughput. In Q1 2026, developer-reported throughput by cohort was:

  • Daily AI users: 2.4 PRs/week (up from 2.3 the previous quarter)
  • Weekly users: 1.9 PRs/week
  • Monthly users: 1.6 PRs/week
  • Non-users: 1.5 PRs/week

The throughput gap between heavy and light users has remained steady quarter over quarter.

Throughput alone doesn’t tell the full story. Our analysis found that even as AI usage increased by 65%, median PR throughput grew by about 8%. Most organizations were in the 5–15% range.

These gains are meaningful. An organization with 500 engineers seeing a 10% improvement is getting the output equivalent of 50 additional engineers without the headcount cost. The mistake is measuring that against an expectation of 2–3x.

Quality metrics: The AI performance indicator most teams overlook

Throughput is only half the picture. When DX looked at quality alongside output, the results were volatile in both directions: some organizations saw meaningful quality improvements as AI usage increased, while others saw it degrade significantly.

The clearest signal is Change Failure Rate. The industry benchmark sits at 4%. Some companies in DX’s data have seen their defect rate increase by almost 2 percentage points since adopting AI tools—meaning they are now shipping 50% more defects than before.

The likely causes are code hygiene practices, the presence or absence of formal AI training, and the complexity of the codebase. But the underlying mechanism is consistent: AI accelerates code production faster than teams can absorb the review and validation burden that comes with it.

Same-engineer analysis: the gold standard for measuring AI impact

The most rigorous approach to measuring AI ROI is tracking performance over time within the same engineers, rather than comparing across teams or roles. Measuring engineers’ productivity against their own pre-AI baseline eliminates confounding variables like tenure, seasonality, and team composition.

Using this methodology, one major financial services company found that engineers using AI tools showed a 30% increase in pull request throughput year-over-year, compared to just 5% among non-adopters. This same-engineer approach is the clearest way to isolate how AI tools directly affect developer productivity.

How to measure AI performance: a six-month implementation roadmap

Months 1–2: Set your baseline

Start by understanding where your teams are today:

Use this data to identify the teams and workflows that stand to benefit most from AI tool adoption.

Months 3–4: Roll out and start tracking

Begin introducing AI tools in a controlled, measurable way:

  • Launch with a few pilot teams or opt-in users
  • Track weekly adoption at the team and individual levels
  • Run short pulse surveys to capture time savings and early feedback
  • Share adoption progress in company-wide meetings to build momentum

Months 5–6: Connect usage to outcomes

Link tool usage to real productivity outcomes:

  • Compare AI usage data against your baseline metrics
  • Group users into cohorts: heavy, frequent, occasional, and non-users, then compare results
  • Identify where AI is driving the most value and document what high-performing users are doing
  • Use these insights to improve onboarding and training programs

Ongoing: Optimize and expand

  • Share monthly usage and impact reports with engineering leadership
  • Run quarterly deep dives that combine metrics with engineer interviews
  • Adapt your measurement strategy as new tools emerge and usage patterns mature

Where AI drives the most measurable value for engineering teams

Research consistently shows that coding represents only a small part of a developer’s typical day. Even cutting that time in half wouldn’t move the overall needle significantly. This is why leading engineering organizations measure AI impact across the full SDLC, not just at the code generation step.

Developers report that the highest-value use cases are often the least discussed:

  • Stack trace analysis and debugging: AI interprets complex error messages and suggests solutions faster than any other workflow. Developers consistently rank this above code generation in developer-reported time savings.
  • Developer onboarding: Time to 10th PR has dropped from 86 days in Q1 2024 to 33 days in Q4 2025, correlating directly with rising AI adoption. New hires who start with AI tools ramp faster and stay ahead of peers who don’t.
  • Code refactoring and cleanup: AI suggestions for improving code quality and maintainability reduce the burden of technical debt review.
  • Test generation and documentation: AI reduces time spent on repetitive but necessary tasks that often get deprioritized.
  • Learning new frameworks or languages: AI assistance accelerates cross-training and upskilling without requiring dedicated learning time.

It’s also worth naming what AI does not yet solve. Meeting-heavy days and interruption frequency remain the biggest drags on developer productivity, both larger than AI time savings in annualized developer time analyses. AI is a local optimizer. Fixing the human and systemic processes surrounding the code requires a different set of interventions.

Five pitfalls that undermine AI performance measurement

Pitfall 1: Benchmarking AI metrics against vendor claims instead of industry data

Vendor marketing sets expectations at 3x or 10x developer productivity improvements. When leaders see more modest results, a 10% increase in throughput or a few hours of weekly time savings, they assume something is wrong.

The data says otherwise. A 5–15% throughput gain is where most organizations land, and that is a meaningful return. The mistake is calibrating expectations against hype rather than against validated signals from peer companies.

Pitfall 2: Tracking throughput KPIs without tracking quality

Faster shipping only creates value if quality holds. DX’s Q1 2026 data shows Change Failure Rate swinging significantly across companies. Some are improving. Others are seeing up to 50% more defects than before AI adoption.

Change Failure Rate, Change Confidence, and Code Maintainability should be core software quality metrics in any AI measurement program, not afterthoughts. Engineering leaders who track software development KPIs only on the output side are missing half the picture.

Pitfall 3: Evaluating AI performance metrics before adoption has matured

Measuring AI tool effectiveness before developers have learned to use them strategically produces misleading results. Allow for a 3–6 month learning curve before drawing conclusions.

Early measurements should focus on adoption trends, not productivity outcomes. The most significant gains come when developers move from no AI use to regular, periodic use. That transition is where the measurable lift is largest.

Pitfall 4: Ignoring shadow AI in your generative AI monitoring

A significant share of AI-driven productivity is happening outside of enterprise-licensed tools. Developers using personal licenses or preferred tools outside of code authoring still report meaningful time savings and AI-authored code.

A measurement program that only tracks enterprise tool telemetry will undercount actual usage and misattribute the gains it does see. Acceptable use policies are critical for security and governance. Complete monitoring also requires accounting for all AI activity, not just the sanctioned stack.

Pitfall 5: Confusing activity metrics with AI performance indicators

More PRs do not automatically mean more business value. DX’s longitudinal research identifies “false velocity” as a real risk: teams may produce more activity without delivering proportionally more outcomes.

Leaders should ensure that throughput increases correspond to real progress on business goals, not just increased coding activity.

Turning AI metrics into engineering decisions

Metrics are only useful if they change how you operate. The data surfaces four consistent patterns worth knowing how to read.

Low adoption rates

Low adoption rates signal cultural resistance, insufficient training, or friction in tool integrations, not a problem with the technology itself.

DX’s data shows that structured enablement is the key differentiator. Organizations that invest in training, rollout programs, and governance see significantly better results than those that distribute licenses and expect developers to self-serve. A 25% increase in structured enablement correlates with meaningful improvements across code maintainability, change confidence, speed, and engagement.

High time savings but flat throughput

High time savings but flat throughput means developers are reinvesting their efficiency into higher-quality work: better architecture, deeper learning, harder problems. That is a positive signal. It may also mean the bottleneck has shifted.

If AI is accelerating code generation but code review and integration haven’t scaled with it, the saved time gets consumed downstream.

Slipping quality metrics

Slipping quality metrics indicate it’s time to revisit training and code review standards, with specific attention to AI-generated code. For teams where Change Failure Rate is rising, the focus should shift from velocity to validation. Investing in AI-driven automated testing is no longer optional.

Power users emerging

Power users emerging is an opportunity. The workflows those developers have discovered can be documented and shared to lift the rest of the organization.

DX’s data shows that junior engineers who use AI daily are now matching staff+ engineers in weekly time savings. Consistent, high-frequency use is a learnable skill, and it belongs in any developer productivity metrics program.

Measuring AI performance: The bottom line for engineering leaders

The question is no longer whether AI coding assistants deliver value. The data shows they do. Developers are saving nearly 4 hours per week, daily users are shipping 60% more PRs than non-users, and developer onboarding time has been cut in half since AI tools became widespread.

The question is how to measure that value accurately enough to separate signal from noise, catch quality risks before they compound, and make the case for continued investment with confidence.

The organizations getting the most from AI are not necessarily using the most advanced models. They are the ones measuring utilization, impact, and cost together, adapting quickly as the tools evolve, and investing in the human and systemic conditions that allow AI to be effective.

Frequently asked questions

What are the best metrics for measuring AI performance on a software team?

The most useful AI performance metrics for engineering leaders fall into three categories: utilization (DAUs/WAUs, % of PRs that are AI-assisted, % of code that is AI-authored), impact (developer time savings, PR throughput by cohort, change failure rate, code maintainability, change confidence), and cost (AI spend per developer, net time gain per developer).

Tracking only one category produces a misleading picture. Utilization tells you whether AI is being used; impact tells you whether it is working; cost tells you whether the return justifies the investment.

What is a realistic AI productivity gain for an engineering team?

DX’s longitudinal analysis of 400+ companies found a median PR throughput increase of 7.76% over a period where AI usage increased 65%. Most organizations landed in the 5–15% range.

A 10% improvement for a team of 500 engineers is the output equivalent of 50 additional engineers without the headcount cost. The mistake is measuring results against vendor claims of 3–10x gains, rather than against validated peer benchmarks.

Why aren’t AI productivity gains higher, given how much time developers report saving?

Developers report saving an average of 3.9 hours per week with AI tools, but those savings don’t translate proportionally into output.

Research shows coding represents only ~14% of a developer’s day, so accelerating it has a limited ceiling. Time saved writing code is often consumed by increased review burden for AI-generated output, bottlenecks in downstream processes that haven’t scaled with code velocity, and the ongoing learning curve of using AI tools effectively.

Where that time is reinvested remains an open research question.

How do you measure AI performance before you have enough adoption data?

In the first 3–6 months, focus on utilization metrics rather than productivity outcomes. Track weekly and daily active users, adoption by team, and developer-reported time savings through pulse surveys.

Draw conclusions about productivity impact only after developers have had time to build effective workflows with the tools. The most significant measurable gains come when developers move from no AI use to regular, periodic use — and that transition takes time.

What is shadow AI, and why does it matter for measurement?

Shadow AI refers to AI tools developers use outside of their organization’s officially sanctioned stack — personal licenses, preferred tools, or AI assistants used for tasks beyond code authoring. DX’s data shows that developers with no enterprise AI tool telemetry still report meaningful time savings and AI-authored code, confirming shadow AI is widespread.

A measurement program that only tracks enterprise telemetry will undercount actual AI usage and misattribute the productivity gains it does capture. Acceptable use policies and broader monitoring approaches are both necessary.

Last Updated
May 19, 2026