Why most engineering AI adoption stalls

Eighty-four percent of developers now use AI coding tools.^[1] Ninety percent of engineering organisations report adopting them.^[2] And in the same surveys, only three point one percent say they highly trust the output. Forty-five point seven percent actively distrust it — up from thirty-one percent a year earlier. Positive sentiment has fallen from above seventy percent in 2023–2024 to fifty-nine point seven percent in 2025. The honeymoon is statistically over.

The story isn't that AI adoption is failing. It's that adoption and value have come apart. Throughput goes up; bugs, churn, and review time go up faster. Engineers feel faster. The data says they aren't. The standard advice — more licenses, more training, better dashboards — is the wrong intervention because the problem isn't the tooling. The problem is the operating model the tooling now sits inside.

This essay names what that operating model has to become. Most of the gains are still on the table. They go to teams that change how work flows, gets reviewed, and ships.

The honeymoon

Every adoption story starts the same way. Personal Copilot subscriptions in 2023. A formal rollout in 2024. By mid-2026, numbers nobody had a frame for in 2022.

Stack Overflow's 2025 developer survey — the largest annual census of practitioners, around forty-nine thousand respondents — finds that eighty-four percent now use or plan to use AI tools, up from seventy-six percent in 2024.^[1] Daily use among professional developers sits at fifty-one percent. DORA's 2025 AI-Assisted Software Development Report, surveying roughly five thousand technology professionals globally with more than a hundred hours of qualitative interviews, puts adoption at ninety percent.^[2] Eighty-five percent of those users report at least some productivity gain.

The first thirty days look like a unicorn. Time-to-first-PR for new hires collapses. The internal Q&A bot defuses the questions that used to consume staff engineers' afternoons. Tickets that previously bounced between three people get resolved by one. The autocomplete catches typos before the typer notices them. Across every team I've talked to in the last eighteen months, the early-adoption presentation slide has the same shape: a hockey stick in tickets closed, in PRs merged, in self-reported satisfaction.

What's harder to see in the slide is that nothing else has changed. The PR template is the same. The review checklist is the same. The definition of done is the same. The on-call runbook still says page Sarah. Roadmap planning still groups work by epic-and-ticket, with story points sized to a humans-typing-code throughput model. The CI is still tuned to a pre-AI push volume.

This works during the honeymoon because the model is doing the easy work. Autocomplete is a productivity layer on top of how the team already operated. Q&A bots offload knowledge retrieval, not decisions. None of it crosses into the org's load-bearing structures — review, ownership, testing, planning, deployment. The early gains are the AI-equivalent of what an org gets from buying its team better keyboards. Real, measurable, and capped.

The plateau begins when teams start asking the model to do harder things. That's when the operating model that built the honeymoon starts to crack.

The whiplash

For most of 2024 and the first half of 2025, the consensus narrative was that AI engineering tools were on a roughly linear improvement curve and the productivity story would scale with model capability. The data from the second half of 2025 forced a revision.

Faros AI, an engineering-intelligence vendor, published the most consequential study of the year in early 2026: telemetry from twenty-two thousand developers across roughly four thousand teams, two years of IDE / version-control / CI signal, comparing low- and high-AI-adoption windows inside the same organisations.^[3] The headline is in the title they chose for the report: The AI Acceleration Whiplash.

Throughput rises sharply. Epic completion climbs 66.2 percent. Task throughput climbs 33.7 percent. PR merge rate climbs 16.2 percent. Deployments per week, somehow, fall 11.7 percent.

Then comes the other side of the ledger. Bugs per developer rise fifty-four percent. Code churn — the percentage of code rewritten within two weeks of being merged — rises tenfold. Incidents per PR triple. The unit of change inflates: median PR size grows fifty-one percent. Median review time grows fivefold.

This is the operating-model bottleneck made visible. AI accelerates the generation side of the loop. The review side, the test side, the deployment side, the on-call side — all of those scale roughly linearly with the same humans they had before. The system was not designed to absorb a fifty-percent increase in PR size at a 5× review-time multiplier. So it doesn't.

Faros makes a second claim that is harder to swallow than the first. Looking at the data sliced by engineering-maturity tier, they conclude that maturity does not insulate teams. Their words: engineering maturity is not a shield. This contradicts the framing in the most authoritative survey of the same year — DORA's 2025 report, which characterises AI as an amplifier in which strong teams compound and weak teams degrade.^[2] We'll return to that tension in the next section. Both can be partially right, and the synthesis is more useful than either claim alone.

A second body of evidence is uglier still. The independent research nonprofit METR ran a pre-registered randomised controlled trial in 2025 on sixteen experienced open-source developers, each working in repositories they had spent years inside.^[5] The result: when allowed to use AI tools — Cursor Pro plus Claude 3.5 / 3.7 Sonnet — developers took nineteen percent longer to complete real tasks. The same developers had predicted a twenty-four percent speedup before the trial. After the slowdown they had actually lived, they still reported feeling about twenty percent faster. METR ran a follow-up in February 2026 with a different design, larger N, and new attempts to control for selection bias. The perception/reality gap held, even as the point estimate moved.^[6]

Developers expected AI to speed them up by twenty-four percent. They were nineteen percent slower. They still believed they had been twenty percent faster.

The METR finding has caveats — sixteen developers, mature open-source codebases averaging twenty-two thousand stars and a million lines of code, early-2025 model generation predating the agentic-IDE cohort common in mid-2026. None of these are reasons to dismiss the finding. They are reasons to bound it. The bound that matters: AI can make experienced engineers slower on the kinds of work where being slow is least visible — quality, judgment, navigating depth — while making them feel like they sped up.

A third body of evidence quietly corroborates this. A two-year longitudinal study of Norway's NAV IT — 26,317 commits across 703 repositories, twenty-five Copilot users versus fourteen non-users, with thirteen follow-up interviews — found no statistically significant change in commit-based activity after Copilot adoption.^[7] Worse: developers who adopted Copilot were already more active than non-adopters before Copilot existed. The X percent productivity gain from AI numbers that dominate vendor case studies are partly measuring who self-selects into adoption.

Add to all of this Stack Overflow's 2025 trust numbers, which read like a graph of a honeymoon ending in real time.^[1] Positive sentiment has dropped from above seventy percent in 2023–2024 to fifty-nine point seven percent. Highly trust AI accuracy: three point one percent. Active distrust: forty-five point seven percent, up from thirty-one percent the year before. Sixty-six percent cite AI solutions that are almost right, but not quite as their top frustration.

This is not adoption failing. This is adoption running into the wall of an unchanged operating model.

Amplifier — and where the frame breaks

DORA 2025's headline framing is one of the cleanest organising ideas in this field. Nathen Harvey, the report's lead author, puts it this way:

In well-organised organisations with strong practices, AI amplifies that flow and accelerates value delivery. And in fragmented organisations with brittle processes, AI will expose those pain points and bottlenecks.^[4]

In other words: AI is not a transformer. It is an amplifier of whatever you already have. Teams whose existing control systems are mature — meaningful tests, disciplined version control, fast feedback loops, deployable trunk — compound. Teams whose systems are brittle find that AI accelerates the bad parts faster than the good parts.

This framing is right on the existence side. AI does amplify. But Faros's contradiction is also right: maturity alone does not protect against the quality and review-capacity collapse documented in their telemetry. The synthesis is not one report is wrong. It is that the two reports are measuring different things at different time horizons.

DORA measures self-reported outcomes around one year into adoption, in a snapshot dominated by teams still riding the honeymoon. Faros measures telemetry across two years, an interval long enough for the operating-model bottleneck to become visible in the numbers. DORA's framing is correct about which teams will benefit at all. Faros's framing is correct that even those teams will not benefit without operating-model change.

The implication is sharper than either headline alone. Engineering maturity is necessary and not sufficient. A team with strong tests, fast CI, and disciplined trunk-based development will still hit the whiplash if its definition of done, its PR-size norms, and its review-capacity allocation were written for a pre-AI throughput model. The amplifier is real; the amplifier amplifies the wrong things first.

The wrong things, specifically, are volume metrics. AI lifts the legibly visible numbers — PRs merged, tickets closed, lines committed — well before it lifts the harder-to-measure ones. By the time the team notices that bugs per developer are up fifty-four percent, churn is up tenfold, and incidents per PR have tripled, the new throughput rhythm is already established and the org has hired against it.

This is why so many AI adoption programmes look successful for six months and then quietly decay. The KPIs they were measured against were the ones AI helps with first. The ones that matter — defect rate, mean time to detect, time spent on rework, cycle time including review — surface their decline later.

The choice that determines which side of the whiplash a team lands on is operating-model design. Specifically: the team has to decide, before the whiplash arrives, that it will treat PR size as a load-bearing variable, that it will scale review capacity alongside generation capacity, that it will run evals where it currently runs tests, and that it will update its definition of done to validate behaviour, not just lint pass. These are not technical choices. They are managerial choices, and they have to be made consciously.

Three operating-model assumptions that must die

Most engineering operating models were written for a humans-typing-code world. Three assumptions buried inside them survive on momentum and now have to die.

Assumption 1 — "PR size is irrelevant if review is fast"

Engineering tradition treats PR size as a hygiene preference, not a load-bearing constraint. Small PRs are nice. Big PRs are reviewable if reviewers are fast and motivated. Linters and CI catch the obvious problems; reviewers catch the human-judgment ones; the system tolerates significant variance in PR size because human typing speed is the rate-limiting step.

AI removes the rate-limiting step. Faros's telemetry shows median PR size growing fifty-one percent inside two years of adoption, with median review time growing fivefold over the same window.^[3] The arithmetic is brutal: a team that absorbed a hundred average-sized PRs per week with two reviewers cannot absorb a hundred PRs at 1.5× size in a fifth of the review attention without something breaking. What breaks is reviewer depth. Reviewers move from line-by-line judgment to skim-and-approve. Subtle bugs slide through. Trust in the review system erodes — and engineers learn that the review motion is theatrical, which makes them care less about it themselves.

The first death has to be the assumption that PR size is reviewable on demand. Treating median PR size as a tracked KPI — and decomposing PRs above a stated cap by default — is now a load-bearing operational discipline.

If your team has not changed its PR-size norms since adopting AI, it is running pre-AI review capacity against post-AI generation volume. That arithmetic has one outcome.

Assumption 2 — "Engineers review what the AI wrote"

The implicit contract of AI-assisted engineering is that the engineer is the gate. The agent writes, the human checks, the system stays correct because the human is paid to catch mistakes.

The METR study and the NAV IT longitudinal attack this assumption from different angles, and the attacks converge. METR's finding is that experienced engineers using AI tools are slower than they think they are; they perceive a twenty-percent speedup while delivering a nineteen-percent slowdown.^[5] The slowdown lives in the gap between I generated something and I confirmed it is right. Reviewers underestimate that gap because the generated artefact looks done.

NAV IT goes the other direction: developers who adopted Copilot were already more active than non-adopters before adoption.^[7] Vendor case studies that compare adopters to non-adopters are partly measuring selection. The X percent productivity gain is, in part, the differential between the kinds of people who self-select into AI tools and the kinds who don't.

Both findings have the same operational consequence: engineers review what the AI wrote is an assertion, not a control. Teams that want it to be a control have to instrument it — review-time-per-PR tracking, mandatory test-coverage deltas, structured review checklists that surface the questions the engineer might skim past. Trust in review is a thing the team builds, not a thing it inherits from job titles.

Stack Overflow's three-point-one-percent highly trust number is not a complaint. It is a leading indicator.

Assumption 3 — "Adoption is transformation"

The third assumption is the most expensive. It says: once developers are using the AI tool every day, the transformation is complete. The work changes. The numbers move. The bet pays off.

NAV IT's null finding — no statistically significant commit-activity change after Copilot adoption, despite high daily use — calls this directly.^[7] Adoption is a usage metric. Transformation is an output metric. They correlate weakly without deliberate operating-model change. A team can have a hundred percent daily AI use and still produce roughly the same volume of value it did before, because nothing about how work flowed, got reviewed, or shipped was redesigned for the new rhythm.

The fix is to stop measuring adoption as a proxy for transformation. The leading indicators we recommend tracking are the share of code paths covered by behavioural evals running in CI, median PR size, and review-time-per-LOC. None of these are usage metrics. All of them move only when the operating model changes.

What "AI-native" actually means

OpenAI published the most operationalised existing definition of AI-native engineering in early 2026, in a developer-facing Codex guide.^[8] They organise work along three axes — Delegate, Review, Own — applied across every SDLC phase from Plan to Maintain.

The agent becomes the first-pass implementer; the engineer becomes the reviewer, editor, and source of direction. — OpenAI

The framework is genuinely good. It moves the conversation off which IDE and onto which task should which agent do, and where does human judgment apply. The Plan / Design / Build / Test / Review / Document / Deploy / Maintain matrix is concrete enough to grade against. Several teams we've talked to in the last quarter have started using it as the spine of their internal AI-engineering rubric.

It also has a clear gap. OpenAI's framework is about task partitioning. It does not name the operating-model prerequisites for the partitioning to work. Specifically: it does not say what happens to PR review capacity when the agent is doing first-pass implementation. It does not say how the team's definition of done changes for code the agent wrote. It does not say what evals run in CI to catch the failures the human reviewer will not. It does not say how platform quality — CI, test, version control — becomes the load-bearing precondition of the framework.

OpenAI tells you what AI-native engineering looks like at the task level. It does not tell you what AI-native engineering looks like at the operating-model level. The gap is where most teams stall.

The Whitescroll working definition extends OpenAI's framing one layer down:

An AI-native engineering team is one whose operating model is designed for a world where the implementation cost has collapsed and the review cost has not.

This is a noun definition, not a posture. A team is AI-native to the degree that its conventions, rituals, capacity allocations, and definitions of done have been deliberately redesigned for the new arithmetic. Tooling is a precondition; it is not the definition.

The seven-question checklist below operationalises the definition. A team can grade itself out of seven. Most teams we work with score two or three on first pass. A score of five or above correlates, in our experience, with teams that have crossed the whiplash. The benchmark is not perfect; the questions are the operational levers we have seen matter.

PR-size discipline. Is median PR size tracked as a team KPI? Is there a stated cap above which PRs are decomposed by default?
Review capacity scaling. Has the team explicitly increased reviewer time-allocation in proportion to generation throughput? Is review-time-per-PR a metric anyone looks at?
Evals in CI. For every LLM-touched surface, do behavioural evals run on every PR — failing the build if the eval suite regresses?
Definition of done updated for AI-generated code. Does the team's DoD distinguish between human-written and AI-generated code, requiring different validation for each?
Delegate / Review / Own partitioning is explicit. For each task type, has the team named which work is delegated to agents, which is reviewed, which is owned? Is the partition reviewed quarterly?
Platform quality at maturity bar. Is the team's CI, test, and version-control infrastructure at the DORA-equivalent of high-performer? If not, AI is amplifying the wrong things.
Trust calibration. Are engineers explicitly taught when to override the agent — with documented heuristics, not just intuition? Is the override rate tracked?

A team that scores seven out of seven is rare. A team that scores zero is the majority case six months into a tooling rollout. The checklist is not a one-time grade. It is a year-long change programme disguised as a self-assessment.

The first three changes that compound

Of the seven levers in the checklist, three compound faster than the others. We always start with these. They are the changes that, in our experience, unlock the rest.

1 — Cap PR size, decompose by default

This is the boring one. It is also the one with the most direct operational effect.

The Shopify operating model is the most public example. CTO Mikhail Parakhin described on the Latent Space podcast that Shopify's monthly PR merge rate grew about thirty percent month-on-month through late 2025, with estimated complexity per PR also rising.^[9] The bottleneck did not stay at generation. It moved through the system: first to review, then to CI/CD throughput, then to deployment cycle time as the probability of any single PR breaking a test grew. Shopify's response was to invest in PR review tooling that did not exist on the market — top-tier models running expensive multi-turn review passes rather than swarms of cheap agents. Parakhin's framing is direct:

I would claim by now a good model writes code on average with fewer bugs than the average human. But since they write so much more of it, more of it will make it into production. So you have to have very rigorous PR reviews. — Mikhail Parakhin, CTO, Shopify^[9]

The implication for any team that is not Shopify: PR size is now the variable that protects the rest of the operating model from the whiplash. Treat it as you would a tracked SLI. Set a cap — a reasonable starting point is four hundred lines of changed-and-not-test code; tune from there. Make decomposition the default expectation, not the exception. Track the median weekly. The discipline does not have to be heavy-handed — it has to exist.

2 — Run evals where you run tests

The second compounding change is to install behavioural evals on every LLM-touched code path, running in CI on every PR, failing the build when they regress.

This sounds obvious until you look for it. The 2024–2026 practitioner discourse on evals — Hamel Husain, Eugene Yan, Chip Huyen, the Pragmatic Engineer newsletter — has been consistent that evals are the skill most teams skip and most regret skipping. Husain's framing is the clearest:^[10]

Documentation tells the agent what to do. Telemetry tells it whether it worked. — Hamel Husain

Without evals in CI, the team has no early-warning system for the kinds of regressions Faros documents — generated code that looks right and breaks at runtime. The eval suite is to LLM-touched code what the test suite was to pre-AI code: the load-bearing safety net that lets the team move fast without breaking things silently.

A reasonable starting structure for a CI eval suite has four categories. Regression evals — does the system still produce the right output for cases it produced the right output for last week. Capability evals — can the system handle the cases we said it could handle. Behaviour evals — does the system avoid the cases we said it would avoid: refusals, hallucinations, leakage. Cost evals — token-cost and latency stay inside budget. One of these is rarely enough. Four runs in CI buys most of the protection.

# What an evals-in-CI step looks like, conceptually.
- name: evals
  run: |
    npm run evals:regression    # are we still right on what we were right on?
    npm run evals:capability    # can we handle what we said we could?
    npm run evals:behaviour     # do we avoid what we said we would?
    npm run evals:cost          # latency + tokens inside budget?
  # block merge if any suite regresses by more than the configured threshold

The team that runs this discovers regressions days, not months, after they ship. The team that doesn't, doesn't.

3 — Update the definition of done for AI-generated code

The third change is the smallest visible artefact and the largest cultural shift. Most teams have a single definition of done that they apply to every PR. AI-native teams have two — one for human-written code, one for AI-generated code — and the second is stricter than the first.

A starting point for the AI-generated DoD that we install:

## Definition of Done — AI-generated code
- [ ] Behaviour validated against eval suite (not just lint + type pass)
- [ ] Test coverage delta is non-negative for touched files
- [ ] Reviewer has read the diff end-to-end, not skimmed
- [ ] Generated comments / docstrings checked for accuracy, not just presence
- [ ] No new dependencies introduced without an explicit reason
- [ ] Subtle behavioural changes (error handling, side effects) called out in PR description

The point is not the specific checklist. It is that the team makes the AI-generated path a deliberately slower one. The asymmetry — fast generation, slow review — is what protects the operating model. Teams that try to keep both fast end up with the Faros pattern. Teams that deliberately slow review end up with the Shopify pattern. The Shopify pattern is the one that compounds.

What to do Monday

None of these changes require a tool. None of them require a vendor. None of them require a budget reallocation. They require the engineering leader to decide — out loud, in front of the team — that the operating model has to be different now.

The first move is the smallest. Pick one team. Run the seven-question checklist with them. Find the question where they score the lowest. Make that question the focus for a quarter. If review capacity hasn't scaled with generation throughput, that's a staffing decision and a metrics decision. If evals aren't in CI, that's a one-engineer-for-a-month investment. If PR size has drifted, that's a cultural recalibration that costs nothing and pays off in weeks.

The second move is to measure what you decided to fix. Not adoption. Not AI use. Not satisfaction. The leading indicators are median PR size, review-time-per-PR, code churn, and the share of LLM-touched paths covered by CI evals. These are unsexy in a board deck. They are the ones that distinguish teams that crossed the whiplash from teams that didn't.

The third move is to read the seven-question checklist again in three months and grade the same team. If the score has moved by two questions in a quarter, the operating model is shifting. If it has moved by zero, the team rolled out tools and called it transformation — and is now living the back half of the Faros curve.

The teams that compound are not the ones with the biggest license budget. They are the ones that decided, before the whiplash hit them, that AI-native engineering is an operating-model commitment, not a tooling rollout. That commitment is what we install. That commitment is also what any team can install for itself, starting Monday.

If your team scored under four out of seven and you'd like a second pair of eyes on the change programme, we read every note that lands at hello@whitescroll.com.

References

Stack Overflow. 2025 Developer Survey — AI. 2025. Primary source; ~49,000 respondents; the largest annual developer census.
Google Cloud DORA. 2025 DORA AI-Assisted Software Development Report. 2025. Primary source; ~5,000 respondents + 100+ hours qualitative interviews. Google Cloud sponsorship; methodology independent.
Faros AI. The AI Acceleration Whiplash — 2026 Engineering Report. 2026. Vendor-published telemetry; 22,000 developers across 4,000 teams; two years of IDE / VCS / CI signal. Not peer-reviewed — cite as telemetry, not as a controlled study.
Jellyfish. What the 2025 DORA Report Tells Us About AI in Software Engineering. 2025. Secondary source; interview with DORA lead author Nathen Harvey.
METR. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. July 2025. Pre-registered RCT; 16 experienced developers; 246 real tasks in mature codebases. Most-cited skeptical finding in this domain.
METR. Uplift Update — Follow-up Study. February 2026. Methodology refinement of the 2025 RCT; perception/reality gap holds.
Saito et al. A Two-Year Mixed-Methods Study of GitHub Copilot Adoption at NAV IT. arXiv preprint. 2025–2026. Independent academic; non-vendor; named public-sector org; 26,317 commits across 703 repositories.
OpenAI. Build an AI-Native Engineering Team — Codex Guide. 2026. Vendor framework; the strongest existing operationalisation of "AI-native" at the task level.
Latent Space. Inside Shopify's AI-First Engineering Playbook — interview with Mikhail Parakhin. 2026. Primary CTO interview; single-source self-report; quotes are attributed, not asserted as audited data.
Husain, Hamel. Evals as Skills for Coding Agents. Substack. 2025. Practitioner essay; widely cited in the AI-engineering eval discourse.

Cite this essay: Whitescroll. "Why Most Engineering AI Adoption Stalls." 2026-06-19. whitescroll.com/writing/why-most-engineering-ai-adoption-stalls.