Blog
AI Made Developers 19% Slower. The Cause Was Context, Not the Model.
11 min read
In July 2025, METR published the results of a randomized controlled trial studying the productivity impact of AI coding tools on experienced developers. The headline finding was uncomfortable: developers using Cursor on large, familiar codebases were 19% slower than the same developers working without AI assistance.
That is not a cherry-picked outlier. This was a controlled experiment. Experienced open-source contributors. Real production codebases they knew well. Current, capable tooling. And they ended up slower.
The part that deserves at least as much attention as the finding itself: after experiencing the slowdown firsthand, the participants still believed AI had made them about 20% faster. The perception gap was 44 percentage points. The tool made them slower and they did not notice.
The instinctive reaction to this study in developer communities was either "AI is overhyped" or "the study was flawed." Both miss the more useful conclusion, which is buried in the methodology: the slowdown was not evenly distributed. And understanding where it concentrated tells you exactly what to fix.
What the METR study actually measured
METR study results (July 2025, randomized controlled trial):
-> Developers expected AI would make them 24% faster
-> Actual measured result: 19% slower with AI enabled
-> After experiencing the slowdown, they still believed it made them 20% faster
-> The perception-reality gap: 44 percentage points
Participant profile:
-> Experienced open-source developers (not beginners)
-> Working on large, familiar codebases they knew well
-> Using Cursor — one of the most capable tools availableThe participant profile matters. These were not beginners learning to use AI tools. They were experienced developers working on codebases they already knew. That is the scenario where AI is supposed to shine — not learning mode, but production mode, with a developer who knows what they want and just wants to go faster.
And on large, familiar codebases, those developers ended up slower. The study is important not because it proves AI tools are useless — they clearly are not — but because it isolates the specific condition under which the productivity narrative inverts: large, established codebases where system understanding matters more than generation speed.
Why experienced developers on large repos got slower
The METR study's own analysis of contributing factors is illuminating. The 44% acceptance rate — meaning just under half of AI suggestions were actually used — is the most telling number. That means more than half of all AI output required time to read, evaluate, and reject. On a small project with simple conventions, that review is fast. On a large production system where every suggestion needs to be evaluated against years of architectural decisions, shared patterns, hidden dependencies, and cross-service contracts — that review is expensive.
Why AI made experienced developers slower on large repos:
-> 44% of AI suggestions were accepted — meaning 56% required review and rejection
-> Time spent reviewing bad suggestions exceeded time saved by good ones
-> Relevant context rarely made it into the model's view automatically
-> Developers spent significant time "getting AI into context" before useful output began
-> Cross-file, cross-service understanding required manual stitching the AI could not do
-> Debugging AI-generated code that looked right but violated system conventionsThe last item is underappreciated. When an AI coding tool suggests something that looks syntactically correct but violates a system convention — using a pattern the team deprecated, missing a required audit log write, ignoring a state machine that governs a particular flow — debugging that suggestion takes longer than writing the correct implementation from scratch would have. The developer has to recognize that something is wrong, trace why, fix it, and often explain to the model why its suggestion was incorrect. That cycle is a net negative on time.
All of these failure modes share a root cause: the model did not have accurate context about how this specific system works. It was generating based on what looks right in general, not what is right for this codebase. The retrieval failed before the generation started. This is the same wall we describe in AI coding quietly hitting a retrieval wall.
The perception gap is the most dangerous part
The finding that developers felt faster while being slower is not a curiosity. It is an organizational risk. Teams making tooling and infrastructure decisions based on developer perception are measuring the wrong thing. If your developers report that AI tools are helping them ship faster but your delivery metrics are not improving — or are getting worse — the METR finding is a plausible explanation.
The perception gap likely has a straightforward explanation: certain tasks feel dramatically faster with AI. Generating boilerplate, writing tests for simple functions, scaffolding new components — these are subjectively fast with AI assistance, and they happen frequently enough to dominate a developer's day-to-day perception. The time lost reviewing incorrect suggestions, debugging subtle convention violations, and fighting through the "getting into context" overhead is less salient because it is spread across many interactions rather than experienced as a single costly moment.
Where AI still reliably accelerates development
The METR study is not a verdict on AI coding tools in all contexts. It is a verdict on a specific scenario: experienced developers on large, established codebases. That scenario is also one of the most common and most important in professional software development — which is why the finding matters. But it is worth being precise about where the study's conclusion does and does not apply.
Where AI does reliably make developers faster:
-> new projects and greenfield code with no existing conventions to honor
-> isolated, self-contained tasks that don't require cross-repo understanding
-> boilerplate, tests, documentation, and well-defined transformations
-> when the developer already knows the answer and wants generation speedNone of these scenarios require the AI to understand a complex, established system. They are all either greenfield (no prior context to honor) or self-contained (the relevant context fits in the local view). In these cases, AI generation speed is the bottleneck and better generation quality translates directly to faster output.
The problem for most professional development teams is that this category is not where most of their work lives. Most of their work involves systems that have years of decisions baked in, conventions that exist for reasons the documentation does not fully explain, cross-service dependencies that are not visible from any single file, and edge cases that live in institutional memory rather than inline comments. That is exactly the context that current AI tools cannot reconstruct from local retrieval alone.
The fix is not less AI. It is better context.
The METR study's correct takeaway is not "stop using AI coding tools." It is "fix the retrieval problem." The 19% slowdown was not because the models were too weak to help. It was because the models did not have accurate, current context about the systems they were helping with. The suggestions that required review and rejection were largely not wrong in a generic sense — they were wrong for this specific system. Better context produces fewer of those suggestions, which means the review overhead drops and the productivity curve inverts back to positive.
What changes when AI has accurate system context:
-> suggestions respect actual conventions, not generic internet patterns
-> cross-service dependencies are visible before they break something
-> existing implementations surface before the AI builds a duplicate
-> "getting into context" is immediate, not a 10-minute setup ritual
-> the 56% rejection rate drops because fewer suggestions violate system realityThe practical implication for engineering teams is that measuring developer productivity on AI-assisted work without controlling for context quality is measuring noise. Two teams using the same AI tools can have opposite productivity outcomes depending on whether those tools have accurate system context or not. The team with good context gets faster. The team without it gets slower — and may not notice until the METR study equivalent happens to their delivery metrics.
Why this matters beyond individual developers
The METR finding concentrated on individual developer productivity. But the implications compound at the team level. When AI tools produce suggestions that violate conventions, those violations spread. A code review might catch them. It might not. Over time, AI-assisted development on systems without grounded context produces architectural drift — the codebase accumulates patterns that each made sense locally to an AI tool without system context, but that together undermine the coherence of the system.
This is a problem a single developer can partially absorb through careful review. It is a problem a team cannot absorb without a shared, accurate source of system truth that every developer's AI tools draw from. Individual context strategies — CLAUDE.md files, manual @-mentions, editor rules — help at the margin. They do not solve the structural problem that every developer is working from their own partial, manually maintained, inevitably stale representation of the system.
Final take
The METR study is important because it is rigorous and the finding is counterintuitive. AI made experienced developers slower on large codebases. Not because the models were bad — because the context was bad. The 44% suggestion acceptance rate is a diagnostic: more than half of what the AI produced was wrong enough to reject, and reviewing wrong suggestions is slower than just writing the right thing.
The path from "AI makes us slower" to "AI makes us faster" on large production systems runs directly through context quality. Better retrieval, better semantic grounding, better cross-service visibility — these are the inputs that determine whether the productivity curve points up or down. The models are already capable. The missing piece is the system understanding layer that lets those models reason accurately about your specific codebase rather than about software in general.