The cost of implementing AI in engineering: why ROI is lower than expected and how to measure it accurately
What the data actually shows — and how to use it to justify your investment
Taylor Bruneaux
Analyst
Boards are approving AI budgets. Engineering teams are shipping with AI tools. And in most organizations, nobody can say with confidence what it is adding up to.
That is not a technology problem. Our research across 400+ companies found that AI tool usage increased 65% over 15 months — and median PR throughput moved 8%. The tools are being used. The measurement infrastructure to capture what they are delivering largely does not exist.
This piece covers why standard productivity metrics fail to capture AI’s value, what the real cost of implementing AI looks like beyond licensing fees, and how engineering leaders can build an ROI framework that holds up internally.
What is AI ROI in engineering?
AI ROI for engineering is the measurable return an organization realizes from AI coding tools and automation, calculated as the business value generated relative to the full cost of building, deploying, running, and maintaining those systems. Accurate measurement requires tracking three dimensions simultaneously: utilization, impact, and cost. Improvements in one often come at the expense of another, and approaches that measure throughput alone routinely produce misleading results.
The cost of implementing AI is also structurally different from traditional software. Traditional software cost is largely front-loaded: you build it, ship it, and the cost curve flattens. AI engineering introduces a permanent operational cost that scales with usage, compounds with model updates, and does not follow the predictable infrastructure curves most organizations use to build business cases.
The expectation gap and why it’s normal
Our longitudinal analysis of engineering velocity ran from November 2024 to February 2026 across a sample from 400+ companies where AI adoption rose sharply. During the study period, AI tool usage increased by an average of 65%. Median PR throughput increased by 8%. We validated findings against TrueThroughput, DX’s AI-weighted PR metric that accounts for relative size and complexity. Both signals showed consistent trends.
The distribution tells a more complete story.
Percentile | PR throughput gain |
|---|---|
P10 | -3% |
P25 | 2% |
P50 (median) | 8% |
Average | 13% |
P75 | 17% |
P90 | 44% |
For most organizations, today’s AI coding tools are delivering a 5 to 15% throughput gain. Leaders whose numbers fall in this range are not behind. They are in line with the industry.
The mistake is measuring against an expectation set by vendor marketing rather than validated signals.
A 10% throughput improvement across 500 engineers is the equivalent output of 50 additional engineers without the headcount cost. That is a real, defensible return. It requires accurate measurement and credible communication to hold up internally.
Why the cost of implementing AI is higher than most models assume
The most common ROI modeling error is building a cost estimate around the proof of concept and scaling it linearly. The true cost of implementing AI at production scale includes categories that rarely appear in early business cases.
Cost category | What gets missed |
|---|---|
Inference at scale | Most organizations default to the largest model without evaluating whether a smaller one would do. Production costs are substantially higher than the build estimate. |
Integration and organizational change | Security review, compliance validation, tooling integration, workflow redesign, and adoption work are routinely absent from business cases. |
Model update costs | Each foundational model update can require rework of prompt logic, output validation, and downstream processes. This is an ongoing cost line, not a one-time expense. |
The total cost of ownership for AI coding tools must include build cost, inference at projected scale, integration and services, and model update and maintenance costs over the solution lifecycle. Projections that omit any of these categories understate cost and overstate ROI. The omissions tend to be the costs that grow largest over time.
Why standard productivity metrics don’t capture AI’s value
Even with accurate cost accounting, most measurement approaches fall short on the benefit side. The problem is not that AI is underperforming. It is that the metrics most organizations use cannot see where AI is and is not creating value.
The metrics organizations reach for first, PR throughput, story points, and velocity, only capture one slice of where AI does and does not create value. Our research on measuring AI’s impact on developer productivity shows five factors consistently limit measurable gains.
1. Coding isn’t the main bottleneck. Microsoft’s research puts coding at approximately 14% of a developer’s typical day. Even cutting that time in half would not meaningfully move overall throughput. As one developer in our sample put it: “A four-day task might take three. But that doesn’t mean I’m shipping 3x more PRs.”
2. Speeding up one stage creates bottlenecks in others. AI has accelerated code generation, but code review and integration remain largely unassisted. Time saved writing code often shifts to extended reviews, fact checking, or issue remediation. The net productivity gain can be zero.
3. Social friction slows adoption. Pro/anti-AI polarization, unclear norms, and the absence of peer champions inhibit teams from developing shared workflows. Being an isolated solo-adopter does not allow you to materialize gains in a meaningful way. Software development is a team sport.
4. Skill and tooling gaps compound each other. Using AI effectively is its own discipline. Developers early on the learning curve extract less value, and immature tooling steepens that curve further. Developers with 1,000+ hours of agentic coding experience report they still have significant learning ahead. Most developers have a fraction of that.
5. AI tools lack institutional context. AI performs well on self-contained, well-documented problems. Most real engineering work is neither. An AI assistant cannot reason over a Slack thread from an archived channel or the mental model of the engineer who built the system.
These five factors explain why self-reported developer time savings are not showing up proportionally in output. They also point directly to where investment should go next.
What actually drives AI ROI
Our research points to three conditions that separate organizations seeing durable AI gains from those treading water.
1. Build foundational readiness first
AI tools are amplifiers. They take what developers are already doing and scale it.
A well-documented, well-structured codebase with fast feedback loops gives developers more to work with. Poor documentation and high friction increase the remediation burden, because developers in those conditions have less context to guide AI output effectively.
Assess AI readiness at the service and team level before scaling AI use, across four domains.
Domain | What to assess | Why it matters for AI |
|---|---|---|
Validation maturity | Code coverage, automated linting, recent commits, dependency health, code complexity | AI tools perform best on well-maintained codebases. Stale repositories produce lower-quality output and higher remediation burden. |
Documentation and context | AGENTS.md, README, architecture decision records, API schema, onboarding guide | Documentation provides the context layer AI tools need to produce relevant output. For agent adoption, this extends to machine-readable context files. |
CI/CD feedback loops | Pipeline health, build time, test flakiness rate | Slow builds and flaky tests consume the time AI saves. Fast, reliable feedback loops are the primary guardrail against AI-generated defects reaching production. |
Standards | Vulnerability SLAs, security scorecard, dependency policy, compliance tier | AI-generated code introduces a larger surface area. Without automated enforcement, increased velocity can accelerate the introduction of vulnerabilities. |
2. Identify where AI creates value
Coding represents approximately 14 to 16% of a developer’s time. The highest-leverage AI opportunities are likely elsewhere in the development lifecycle. Identifying them requires data on where developer-reported friction is highest — which is exactly what DX AI Strategic Planning is designed to surface.
The starting point is understanding where engineers experience the most friction. Google’s developer productivity research team describes using periodic developer experience surveys to identify the top hindrances, then directing AI investment toward those specific pain points. This data-driven approach produces more durable gains than broad rollout.
This assessment must be ongoing. Accelerating one part of the SDLC can create bottlenecks in others.
When evaluating use cases, three factors determine suitability:
- Whether context is readily accessible
- Whether the task is tightly scoped with clear outcomes
- The business criticality and risk threshold
Areas with high suitability and low current AI penetration are the highest-leverage targets. The untapped opportunities are in planning, orchestration, code review, and operations.
Some stages of development are not ready for automation. Scoping a quarter’s roadmap or evaluating an architectural trade-off requires judgment today’s models cannot reliably provide. Leaders need to be intentional about what stays human-led.
3. Measure gains and trade-offs together
The DX AI Measurement Framework tracks AI’s impact across three dimensions. They must be measured together, because improvements in one can come at the expense of another.
Dimension | The question it answers | Key signals |
|---|---|---|
Utilization | How much are developers actually using AI tools? | DAUs/WAUs, % of PRs AI-assisted, % of committed code AI-generated, tasks assigned to agents. Tracked via DX Usage Analytics. |
Impact | How is AI changing engineering productivity? | PR throughput, perceived rate of delivery, DXI, code maintainability, change confidence, change fail percentage. Tracked via DX Impact Analysis. |
Cost | Is our AI spend and ROI optimal? | Total and per-developer AI spend, net time gain per developer, agent hourly rate. Optimized via DX AI Workflow Optimization. |
How to calculate AI ROI in engineering
The formula is straightforward. Getting the inputs right is where the discipline is.
ROI = (Total Business Value − Total Cost) / Total Cost
Total cost includes five categories:
- Build cost
- Inference at projected scale
- Infrastructure
- Integration and services
- Model update and maintenance costs across the solution lifecycle
The cost model built around a proof of concept is not the cost model for production.
Total business value is measured across the three AI Measurement Framework dimensions: developer time savings traceable to a pre/post baseline, quality improvements with a measurable defect cost, and throughput gains tied to business outcomes rather than activity counts.
What the calculation looks like in practice
Consider a 200-person engineering organization operating near the industry median, with an 8% throughput improvement after a year of AI tool adoption at 75%+ usage.
- Pre-AI baseline: 4 merged PRs per engineer per month
- Post-AI: 4.3 merged PRs per engineer per month
- Effective gain: output equivalent of roughly 16 additional engineers
- At $200K fully-loaded cost per engineer: approximately $3.2M in equivalent capacity
- All-in AI spend (licensing, integration, training, model updates): $800K annually
- Net return: $2.4M
- ROI: 300%
That number holds only if the throughput gain is real and sustained, change fail percentage has not risen, and the cost model is complete. Any assumption left unvalidated produces a business case that does not survive board review.
The process that makes the calculation defensible
- Start with a pre-deployment baseline using actuals, not estimates. Capture PR throughput, perceived rate of delivery, developer satisfaction, change fail percentage, and time allocation across SDLC phases. Without this, post-deployment claims cannot be substantiated.
- Define specific outcomes before deployment. “Improve developer productivity” is not measurable. “Increase merged PRs per engineer per month by 8% within six months, without increasing change fail percentage above current baseline” is.
- Measure at two checkpoints. At 60 to 90 days, capture early utilization signals and surface adoption friction before it calcifies. At six months, impact becomes attributable. AI systems deliver their most significant value after workflow integration and proficiency ramp. Organizations that force an ROI verdict too early often pull back exactly when the return is beginning to compound.
- Aggregate across utilization, impact, and cost. Present with assumptions visible. Stakeholders who can see the methodology can engage with it constructively.
Three risks that accompany AI-driven velocity gains
Increased velocity from AI comes with risks that, left unmonitored, can erode the gains. Leaders should track whether throughput improvements are coming at a cost.
Defective code. AI-generated code can introduce defects that are difficult to catch, particularly in complex production systems. Amazon’s experience in early 2026 is instructive. AI-generated code contributed to outages resulting in approximately 120,000 lost orders in one incident and a 99% drop in North American orders in another. Amazon responded with a 90-day safety reset, mandatory two-person code review, and audits across 335 Tier-1 systems. Change fail percentage and code maintainability are the signals to watch.
Cognitive debt. As AI accelerates code production, teams risk losing shared understanding of their own systems. Dr. Margaret-Anne Storey, co-author of the SPACE and DevEx frameworks, describes this as cognitive debt: the erosion of the collective mental model of what a system does, how it was designed, and how it can be safely changed. It manifests as loss of confidence when making changes, heavier review burden, debugging friction, and slower onboarding. Whether it proves to be a material risk at scale is still an open question, but early signals warrant monitoring.
False velocity. More PRs do not necessarily mean higher business velocity. METR’s research documents a tendency for developers and teams to over-report the perceived benefits of AI tools. Leaders should ensure throughput increases correspond to real progress on business outcomes.
Where leaders should focus next
The gains visible today, a median of 8% with most organizations in the 5 to 15% range, are meaningful but modest. They are also not the ceiling.
The organizations closing the gap share one characteristic: they treat measurement as an engineering requirement, not a reporting exercise. They establish baselines before deployment, define specific outcomes, track utilization alongside impact and cost, and surface tradeoffs before they become problems.
That requires a measurement infrastructure most organizations do not yet have. Building it means three things in practice.
- Assess foundational readiness by service and team using DX AI Readiness. Address gaps in documentation, validation maturity, CI/CD reliability, and security standards before scaling AI use.
- Find the highest-leverage opportunities using developer-reported friction data and AI suitability evaluation. Direct investment toward areas with high friction and low current AI penetration. Be explicit about what stays human-led.
- Measure with the full AI Measurement Framework, tracking utilization, impact, and cost together. Gains should be traceable to baselines. Tradeoffs should be visible before they compound.
The organizations that capture the next wave of AI-driven productivity gains will be the ones that invest in the measurement infrastructure that makes those gains legible — to their teams, their CFOs, and their boards.
FAQ
What is AI ROI in engineering?
AI ROI in engineering is the measurable business value generated from deploying AI tools and automation across the software development lifecycle, relative to the full cost of building, running, and maintaining those systems. Accurate measurement requires tracking utilization, impact, and cost simultaneously. Approaches that measure throughput alone routinely overstate or understate true returns.
What is a realistic AI productivity gains expectation for engineering teams?
For most organizations, AI coding tools are delivering a 5 to 15% increase in PR throughput. Our longitudinal analysis across 400+ companies found a median gain of 8% during a period when AI tool usage increased by 65%. At the 90th percentile, gains approached 44%, still well below vendor claims of 3 to 10x. Calibrating to these validated signals is the first step toward communicating AI impact credibly.
Why is the cost of implementing AI often underestimated?
The cost of implementing AI goes beyond licensing fees. It includes inference cost at production scale, developer time on learning curves, increased code review burden, defect remediation, integration and organizational change, and ongoing model update costs. The most common error is building a cost model around the proof of concept and scaling it linearly. The costs that grow largest over time are almost never the ones that appear in the original business case.
How do you measure AI productivity gains accurately?
Accurate measurement starts with a pre-deployment baseline using actuals, not estimates. Impact metrics should cover DX Core 4: throughput, perceived rate of delivery, code maintainability, and change confidence, alongside developer satisfaction and change fail percentage. Cost metrics should include per-developer AI spend and net time gain. Tracking only throughput misses quality tradeoffs and produces impact claims that will not withstand scrutiny.
What’s the difference between AI velocity and business velocity?
AI velocity refers to engineering throughput: PRs merged, code generated. Business velocity refers to how much of that activity translates into shipped product value and business outcomes. More PRs do not necessarily mean more value delivered. METR’s research documents a tendency for developers and teams to over-report the perceived benefits of AI tools. Leaders should watch change fail percentage as a leading indicator of whether gains are durable.
How should engineering leaders communicate AI ROI to the CFO?
Ground the case in validated signals: adoption data, before/after throughput trends, quality metrics, and cost-adjusted returns. Show the full cost model, including inference at scale, integration, and model update costs. Calibrate expectations to industry benchmarks. Demonstrate that measurement is ongoing and that tradeoffs are tracked alongside gains. Present assumptions visibly. That rigor builds credibility more effectively than optimistic projections.
How long does it take to see AI ROI in engineering?
Early utilization signals typically appear within 60 to 90 days. Attributable impact on throughput and quality usually requires six months, after workflow integration and proficiency ramp. Organizations that force an ROI verdict at 30 days are measuring adoption friction, not return. Our research surfaces a consistent pattern: teams pull back on AI investment exactly when the return is beginning to compound.
Related reading
- AI-assisted engineering: how AI is transforming software development
- How to communicate the impact of AI in engineering
- AI coding assistant pricing 2025: complete cost comparison
- Measuring developer activity: what the research says
- What is the SPACE framework and when should you use it?
- The 25 DevOps KPIs that connect engineering work to business results