The cost of implementing AI in engineering: why ROI is lower than expected and how to measure it accurately

What the data actually shows — and how to use it to justify your investment

Taylor Bruneaux

Analyst

Boards are approving AI budgets. Engineering teams are shipping with AI tools. And in most organizations, nobody can say with confidence what it is adding up to.

That is not a technology problem. Our research across 400+ companies found that AI tool usage increased 65% over 15 months — and median PR throughput moved 8%. The tools are being used. The measurement infrastructure to capture what they are delivering largely does not exist.

This piece covers why standard productivity metrics fail to capture AI’s value, what the real cost of implementing AI looks like beyond licensing fees, and how engineering leaders can build an ROI framework that holds up internally.

What is AI ROI in engineering?

AI ROI for engineering is the measurable return an organization realizes from AI coding tools and automation, calculated as the business value generated relative to the full cost of building, deploying, running, and maintaining those systems. Accurate measurement requires tracking three dimensions simultaneously: utilization, impact, and cost. Improvements in one often come at the expense of another, and approaches that measure throughput alone routinely produce misleading results.

The cost of implementing AI is also structurally different from traditional software. Traditional software cost is largely front-loaded: you build it, ship it, and the cost curve flattens. AI engineering introduces a permanent operational cost that scales with usage, compounds with model updates, and does not follow the predictable infrastructure curves most organizations use to build business cases.

The expectation gap and why it’s normal

Our longitudinal analysis of engineering velocity ran from November 2024 to February 2026 across a sample from 400+ companies where AI adoption rose sharply. During the study period, AI tool usage increased by an average of 65%. Median PR throughput increased by 8%. We validated findings against TrueThroughput, DX’s AI-weighted PR metric that accounts for relative size and complexity. Both signals showed consistent trends.

The distribution tells a more complete story.

Percentile	PR throughput gain
P10	-3%
P25	2%
P50 (median)	8%
Average	13%
P75	17%
P90	44%

For most organizations, today’s AI coding tools are delivering a 5 to 15% throughput gain. Leaders whose numbers fall in this range are not behind. They are in line with the industry.

The mistake is measuring against an expectation set by vendor marketing rather than validated signals.

A 10% throughput improvement across 500 engineers is the equivalent output of 50 additional engineers without the headcount cost. That is a real, defensible return. It requires accurate measurement and credible communication to hold up internally.

Why the cost of implementing AI is higher than most models assume

The most common ROI modeling error is building a cost estimate around the proof of concept and scaling it linearly. The true cost of implementing AI at production scale includes categories that rarely appear in early business cases.

Cost category	What gets missed
Inference at scale	Most organizations default to the largest model without evaluating whether a smaller one would do. Production costs are substantially higher than the build estimate.
Integration and organizational change	Security review, compliance validation, tooling integration, workflow redesign, and adoption work are routinely absent from business cases.
Model update costs	Each foundational model update can require rework of prompt logic, output validation, and downstream processes. This is an ongoing cost line, not a one-time expense.

The total cost of ownership for AI coding tools must include build cost, inference at projected scale, integration and services, and model update and maintenance costs over the solution lifecycle. Projections that omit any of these categories understate cost and overstate ROI. The omissions tend to be the costs that grow largest over time.

Why standard productivity metrics don’t capture AI’s value

Even with accurate cost accounting, most measurement approaches fall short on the benefit side. The problem is not that AI is underperforming. It is that the metrics most organizations use cannot see where AI is and is not creating value.

The metrics organizations reach for first, PR throughput, story points, and velocity, only capture one slice of where AI does and does not create value. Our research on measuring AI’s impact on developer productivity shows five factors consistently limit measurable gains.

1. Coding isn’t the main bottleneck. Microsoft’s research puts coding at approximately 14% of a developer’s typical day. Even cutting that time in half would not meaningfully move overall throughput. As one developer in our sample put it: “A four-day task might take three. But that doesn’t mean I’m shipping 3x more PRs.”

2. Speeding up one stage creates bottlenecks in others. AI has accelerated code generation, but code review and integration remain largely unassisted. Time saved writing code often shifts to extended reviews, fact checking, or issue remediation. The net productivity gain can be zero.

3. Social friction slows adoption. Pro/anti-AI polarization, unclear norms, and the absence of peer champions inhibit teams from developing shared workflows. Being an isolated solo-adopter does not allow you to materialize gains in a meaningful way. Software development is a team sport.

4. Skill and tooling gaps compound each other. Using AI effectively is its own discipline. Developers early on the learning curve extract less value, and immature tooling steepens that curve further. Developers with 1,000+ hours of agentic coding experience report they still have significant learning ahead. Most developers have a fraction of that.

5. AI tools lack institutional context. AI performs well on self-contained, well-documented problems. Most real engineering work is neither. An AI assistant cannot reason over a Slack thread from an archived channel or the mental model of the engineer who built the system.

These five factors explain why self-reported developer time savings are not showing up proportionally in output. They also point directly to where investment should go next.

What actually drives AI ROI

Our research points to three conditions that separate organizations seeing durable AI gains from those treading water.

1. Build foundational readiness first

AI tools are amplifiers. They take what developers are already doing and scale it.

A well-documented, well-structured codebase with fast feedback loops gives developers more to work with. Poor documentation and high friction increase the remediation burden, because developers in those conditions have less context to guide AI output effectively.

Assess AI readiness at the service and team level before scaling AI use, across four domains.

Domain	What to assess	Why it matters for AI
Validation maturity	Code coverage, automated linting, recent commits, dependency health, code complexity	AI tools perform best on well-maintained codebases. Stale repositories produce lower-quality output and higher remediation burden.
Documentation and context	AGENTS.md, README, architecture decision records, API schema, onboarding guide	Documentation provides the context layer AI tools need to produce relevant output. For agent adoption, this extends to machine-readable context files.
CI/CD feedback loops	Pipeline health, build time, test flakiness rate	Slow builds and flaky tests consume the time AI saves. Fast, reliable feedback loops are the primary guardrail against AI-generated defects reaching production.
Standards	Vulnerability SLAs, security scorecard, dependency policy, compliance tier	AI-generated code introduces a larger surface area. Without automated enforcement, increased velocity can accelerate the introduction of vulnerabilities.

2. Identify where AI creates value

Coding represents approximately 14 to 16% of a developer’s time. The highest-leverage AI opportunities are likely elsewhere in the development lifecycle. Identifying them requires data on where developer-reported friction is highest — which is exactly what DX AI Strategic Planning is designed to surface.

The starting point is understanding where engineers experience the most friction. Google’s developer productivity research team describes using periodic developer experience surveys to identify the top hindrances, then directing AI investment toward those specific pain points. This data-driven approach produces more durable gains than broad rollout.

This assessment must be ongoing. Accelerating one part of the SDLC can create bottlenecks in others.

When evaluating use cases, three factors determine suitability:

Whether context is readily accessible
Whether the task is tightly scoped with clear outcomes
The business criticality and risk threshold

Areas with high suitability and low current AI penetration are the highest-leverage targets. The untapped opportunities are in planning, orchestration, code review, and operations.

Some stages of development are not ready for automation. Scoping a quarter’s roadmap or evaluating an architectural trade-off requires judgment today’s models cannot reliably provide. Leaders need to be intentional about what stays human-led.

3. Measure gains and trade-offs together

The DX AI Measurement Framework tracks AI’s impact across three dimensions. They must be measured together, because improvements in one can come at the expense of another.

Dimension	The question it answers	Key signals
Utilization	How much are developers actually using AI tools?	DAUs/WAUs, % of PRs AI-assisted, % of committed code AI-generated, tasks assigned to agents. Tracked via DX Usage Analytics.
Impact	How is AI changing engineering productivity?	PR throughput, perceived rate of delivery, DXI, code maintainability, change confidence, change fail percentage. Tracked via DX Impact Analysis.
Cost	Is our AI spend and ROI optimal?	Total and per-developer AI spend, net time gain per developer, agent hourly rate. Optimized via DX AI Workflow Optimization.

How to calculate AI ROI in engineering

The formula is straightforward. Getting the inputs right is where the discipline is.

ROI = (Total Business Value − Total Cost) / Total Cost

Total cost includes five categories:

Build cost
Inference at projected scale
Infrastructure
Integration and services
Model update and maintenance costs across the solution lifecycle

The cost model built around a proof of concept is not the cost model for production.

Total business value is measured across the three AI Measurement Framework dimensions: developer time savings traceable to a pre/post baseline, quality improvements with a measurable defect cost, and throughput gains tied to business outcomes rather than activity counts.

What the calculation looks like in practice

Consider a 200-person engineering organization operating near the industry median, with an 8% throughput improvement after a year of AI tool adoption at 75%+ usage.

Pre-AI baseline: 4 merged PRs per engineer per month
Post-AI: 4.3 merged PRs per engineer per month
Effective gain: output equivalent of roughly 16 additional engineers
At $200K fully-loaded cost per engineer: approximately $3.2M in equivalent capacity
All-in AI spend (licensing, integration, training, model updates): $800K annually
Net return: $2.4M
ROI: 300%

That number holds only if the throughput gain is real and sustained, change fail percentage has not risen, and the cost model is complete. Any assumption left unvalidated produces a business case that does not survive board review.

The process that makes the calculation defensible

Start with a pre-deployment baseline using actuals, not estimates. Capture PR throughput, perceived rate of delivery, developer satisfaction, change fail percentage, and time allocation across SDLC phases. Without this, post-deployment claims cannot be substantiated.
Define specific outcomes before deployment. “Improve developer productivity” is not measurable. “Increase merged PRs per engineer per month by 8% within six months, without increasing change fail percentage above current baseline” is.
Measure at two checkpoints. At 60 to 90 days, capture early utilization signals and surface adoption friction before it calcifies. At six months, impact becomes attributable. AI systems deliver their most significant value after workflow integration and proficiency ramp. Organizations that force an ROI verdict too early often pull back exactly when the return is beginning to compound.
Aggregate across utilization, impact, and cost. Present with assumptions visible. Stakeholders who can see the methodology can engage with it constructively.

Three risks that accompany AI-driven velocity gains

Increased velocity from AI comes with risks that, left unmonitored, can erode the gains. Leaders should track whether throughput improvements are coming at a cost.

Defective code. AI-generated code can introduce defects that are difficult to catch, particularly in complex production systems. Amazon’s experience in early 2026 is instructive. AI-generated code contributed to outages resulting in approximately 120,000 lost orders in one incident and a 99% drop in North American orders in another. Amazon responded with a 90-day safety reset, mandatory two-person code review, and audits across 335 Tier-1 systems. Change fail percentage and code maintainability are the signals to watch.

Cognitive debt. As AI accelerates code production, teams risk losing shared understanding of their own systems. Dr. Margaret-Anne Storey, co-author of the SPACE and DevEx frameworks, describes this as cognitive debt: the erosion of the collective mental model of what a system does, how it was designed, and how it can be safely changed. It manifests as loss of confidence when making changes, heavier review burden, debugging friction, and slower onboarding. Whether it proves to be a material risk at scale is still an open question, but early signals warrant monitoring.

False velocity. More PRs do not necessarily mean higher business velocity. METR’s research documents a tendency for developers and teams to over-report the perceived benefits of AI tools. Leaders should ensure throughput increases correspond to real progress on business outcomes.

Where leaders should focus next

The gains visible today, a median of 8% with most organizations in the 5 to 15% range, are meaningful but modest. They are also not the ceiling.

The organizations closing the gap share one characteristic: they treat measurement as an engineering requirement, not a reporting exercise. They establish baselines before deployment, define specific outcomes, track utilization alongside impact and cost, and surface tradeoffs before they become problems.

That requires a measurement infrastructure most organizations do not yet have. Building it means three things in practice.

Assess foundational readiness by service and team using DX AI Readiness. Address gaps in documentation, validation maturity, CI/CD reliability, and security standards before scaling AI use.
Find the highest-leverage opportunities using developer-reported friction data and AI suitability evaluation. Direct investment toward areas with high friction and low current AI penetration. Be explicit about what stays human-led.
Measure with the full AI Measurement Framework, tracking utilization, impact, and cost together. Gains should be traceable to baselines. Tradeoffs should be visible before they compound.

The organizations that capture the next wave of AI-driven productivity gains will be the ones that invest in the measurement infrastructure that makes those gains legible — to their teams, their CFOs, and their boards.

See how DX helps engineering leaders measure AI impact, justify investment, and find what to fix next

FAQ

What is AI ROI in engineering?

AI ROI in engineering is the measurable business value generated from deploying AI tools and automation across the software development lifecycle, relative to the full cost of building, running, and maintaining those systems. Accurate measurement requires tracking utilization, impact, and cost simultaneously. Approaches that measure throughput alone routinely overstate or understate true returns.

What is a realistic AI productivity gains expectation for engineering teams?

For most organizations, AI coding tools are delivering a 5 to 15% increase in PR throughput. Our longitudinal analysis across 400+ companies found a median gain of 8% during a period when AI tool usage increased by 65%. At the 90th percentile, gains approached 44%, still well below vendor claims of 3 to 10x. Calibrating to these validated signals is the first step toward communicating AI impact credibly.

Why is the cost of implementing AI often underestimated?

The cost of implementing AI goes beyond licensing fees. It includes inference cost at production scale, developer time on learning curves, increased code review burden, defect remediation, integration and organizational change, and ongoing model update costs. The most common error is building a cost model around the proof of concept and scaling it linearly. The costs that grow largest over time are almost never the ones that appear in the original business case.

How do you measure AI productivity gains accurately?

Accurate measurement starts with a pre-deployment baseline using actuals, not estimates. Impact metrics should cover DX Core 4: throughput, perceived rate of delivery, code maintainability, and change confidence, alongside developer satisfaction and change fail percentage. Cost metrics should include per-developer AI spend and net time gain. Tracking only throughput misses quality tradeoffs and produces impact claims that will not withstand scrutiny.

What’s the difference between AI velocity and business velocity?

AI velocity refers to engineering throughput: PRs merged, code generated. Business velocity refers to how much of that activity translates into shipped product value and business outcomes. More PRs do not necessarily mean more value delivered. METR’s research documents a tendency for developers and teams to over-report the perceived benefits of AI tools. Leaders should watch change fail percentage as a leading indicator of whether gains are durable.

How should engineering leaders communicate AI ROI to the CFO?

Ground the case in validated signals: adoption data, before/after throughput trends, quality metrics, and cost-adjusted returns. Show the full cost model, including inference at scale, integration, and model update costs. Calibrate expectations to industry benchmarks. Demonstrate that measurement is ongoing and that tradeoffs are tracked alongside gains. Present assumptions visibly. That rigor builds credibility more effectively than optimistic projections.

How long does it take to see AI ROI in engineering?

Early utilization signals typically appear within 60 to 90 days. Attributable impact on throughput and quality usually requires six months, after workflow integration and proficiency ramp. Organizations that force an ROI verdict at 30 days are measuring adoption friction, not return. Our research surfaces a consistent pattern: teams pull back on AI investment exactly when the return is beginning to compound.

Last Updated

May 21, 2026