Good tools, stalled gains.
A 120-engineer platform org had every AI coding tool licensed and a flat adoption curve to show for it. Throughput was up; review time, churn, and rework were up faster. The plateau wasn't the model — it was how the team worked around it.
They didn't need another tool. They needed the conventions, reviews, and evals that turn raw assistance into shipped, trusted change.
Installed the operating model, not a tool.
Over six weeks we installed four pillars and an eval loop the team could run without us. We set conventions for how AI-authored change is scoped and reviewed, tightened the review pipeline so machine-written diffs get human judgment where it counts, stood up an eval harness that scores output against the team's own bar, and trained the leads to own and tune all three.
- Conventions for prompting, scoping, and PR hygiene — written down, not tribal
- A review pipeline that puts human attention on risk, not on boilerplate
- An eval harness scoring real output against the team's quality bar
- Lead enablement, so the model keeps improving after we leave
"Nothing about our stack changed. Everything about our cycle time did."
Four pillars, one eval loop.
The eval loop is the part that compounds. Once the team can measure output against its own bar, every tool decision after we leave is an experiment with a scoreboard — not a guess.