Blog

AI Code Ships Faster and Breaks More Often. Here's What the Data Shows.

10 min read

The business case for AI coding tools is almost always made on velocity: PRs merge faster, deployment frequency goes up, feature output increases. These numbers are real. They are also incomplete. The velocity case for AI tools rests on measuring one side of a ledger that has two sides — and the side that does not get measured is where the costs accumulate.

The data on AI code quality is now substantial enough to be taken seriously. CodeRabbit's analysis of production PRs found that AI-authored changes produce 1.7 times more issues per PR than human-authored changes, and a 24% higher incident rate. GitClear analyzed 211 million lines of code and found that AI assistant adoption doubled code duplication while halving refactoring activity. These are not theoretical concerns from skeptics of AI tools. They are empirical findings from production codebases where AI tools are already deployed and delivering the velocity gains their vendors promised.

The productivity gain is real. The quality cost is also real. Engineering leadership is currently measuring one and not the other.

The one-sided productivity case

When an engineering organization evaluates AI coding tools, the metrics are almost universally velocity-oriented. Time to merge. Deployment frequency. Features shipped per sprint. Developer satisfaction with their tooling. These are the metrics that are easy to measure, easy to present to leadership, and easy to frame as a positive ROI story.

They are also the metrics that benefit most directly and immediately from AI adoption. The velocity improvement from AI coding tools is felt quickly and is distributed across the team. Every developer who uses Cursor or GitHub Copilot or Claude Code reports getting things done faster. The sprint retrospective looks good. The roadmap delivery looks good. The quarterly engineering review looks good.

What the quarterly review does not show: whether the services built with AI are generating more incidents than the services built without it. Whether the code duplication rate is increasing in a way that will slow the next year of development. Whether the refactoring commits that represent the team cleaning up and consolidating their codebase have declined, meaning the codebase is growing faster than it is being maintained. These are trailing indicators that take months to surface, and by the time they surface, the velocity metrics have already been used to justify broad AI adoption.

What the quality data actually shows

The CodeRabbit finding — 1.7x more issues per PR for AI-authored code — is worth disaggregating. The issues are not distributed uniformly across code types. They cluster in complex conditional logic, edge case handling, and state management. These are the areas where AI tools are most likely to generate code that looks correct and passes tests but contains subtle behavioral errors that surface in production under load or in edge cases that tests did not cover.

The 24% higher incident rate for AI-generated code is the more consequential number for engineering leadership. Incidents have direct business impact: they affect users, they require on-call response, they create reputational and financial cost. A 24% increase in incident rate is not a marginal quality degradation — it is a substantial increase in operational risk that compounds across every service where AI-generated code is deployed.

AI code quality findings — CodeRabbit, GitClear, and METR data

What the research shows — AI code quality findings:

CodeRabbit analysis (production PRs, 2024–2025):
  -> AI-authored changes: 1.7x more issues per PR vs. human-authored
  -> Incident rate: 24% higher for AI-generated code
  -> Issue types: logic errors, edge case handling, missing validation
  -> Pattern: issues cluster in complex conditional logic and state management

GitClear analysis (211 million lines of code):
  -> Code duplication: doubled with AI assistant adoption
  -> Refactoring commits: halved over the same period
  -> Churn rate (code written then deleted within 2 weeks): increased significantly
  -> Interpretation: AI generates faster but produces more throwaway and duplicate code

METR study (autonomous AI coding agents, 2025):
  -> Senior engineers with AI agents: 19% slower on complex tasks vs. without
  -> Cause: time spent reviewing, correcting, and re-prompting AI output
  -> Implication: velocity gains are task-dependent, not universal

The consistent finding across all three: velocity metrics improve.
Quality metrics — defect rates, duplication, incident frequency — worsen.
The productivity story is only half the story.

The GitClear duplication finding is a different kind of problem. Code duplication is not immediately costly — duplicate code works. The cost is paid over time, as the duplicated implementations diverge, as bugs in one implementation are fixed but not propagated to the others, as the codebase grows denser and harder to reason about. Doubling the duplication rate means doubling the rate at which the codebase accumulates maintenance debt that will slow future development.

The halved refactoring activity compounds the duplication problem. Refactoring is how a codebase stays healthy over time: teams consolidate duplicate implementations, extract shared abstractions, clean up patterns that have been superseded. When refactoring activity halves, the codebase accumulates technical debt at the rate of development without the cleanup cycle that normally keeps it manageable. The result, visible in the GitClear data, is a codebase that is growing faster and degrading faster simultaneously.

The METR study finding — that senior engineers with autonomous AI agents are 19% slower on complex tasks — complicates the simple velocity story further. The productivity benefit of AI tools is not uniform across task types. For well-defined, bounded tasks, AI tools genuinely accelerate delivery. For complex tasks that require system-level understanding, the overhead of reviewing, correcting, and re-prompting AI output can exceed the time saved. The METR study's finding that AI slows experienced engineers on complex work suggests that velocity gains in aggregate may be masking slowdowns in exactly the high-value, high-complexity work that engineering leaders care most about.

Why velocity and quality metrics diverge

The divergence between velocity improvement and quality degradation has a structural cause: 43.8% of AI coding sessions involve minimal human interaction. The developer submits a prompt, reviews the output at a surface level, and ships it. The generation is fast. The review is shallow. The defects that would have been caught by the cognitive overhead of writing the code — the moments where a developer would have paused, thought about edge cases, realized an assumption was wrong — are not caught by a review that is primarily checking whether the code looks plausible.

This is not negligence. It is the rational response to the tool's value proposition. AI coding tools are valuable because they reduce the time required to produce working code. If the developer rebuilds all the cognitive overhead of writing the code during the review process, the velocity gain disappears. The tool is implicitly asking developers to trust the output at a level that, in practice, means some proportion of the output ships with defects that a more engaged review would have caught.

The 1.7x issue rate is the cost of that trust. It is not an indictment of the tools — it is the predictable consequence of optimizing for generation speed without correspondingly investing in review quality or codebase context quality. The tool generates correct-looking code faster. The review process does not scale with the generation rate. Defects accumulate.

The leadership measurement gap

Engineering leaders who adopted AI tools in 2024 and 2025 are now sitting on enough historical data to measure both sides of the ledger. Most are not doing so. The velocity metrics are being tracked because they were the justification for adoption. The quality metrics are not being tracked because they were not part of the original evaluation framework.

What gets tracked vs. what should be tracked alongside AI adoption

The AI productivity dashboard — what gets tracked vs. what should be:

What most engineering orgs track after AI adoption:
  + PR merge time              (down 30–45% with AI tools)
  + Deployment frequency       (up significantly)
  + Lines of code shipped      (up substantially)
  + Feature delivery speed     (consistently reported as faster)
  + Developer NPS on tooling   (high — developers like AI tools)

What most engineering orgs do NOT track alongside those numbers:
  - Defects per PR by authorship type (AI-authored vs. human-authored)
  - Incident rate change post-AI adoption by service
  - Code duplication rate over time
  - Refactoring activity as a share of total commits
  - MTTR for incidents in AI-heavy vs. human-authored services
  - Post-merge revision rate for AI-generated changes

The gap between what is tracked and what is not tracked
is where the quality cost of AI adoption accumulates invisibly.

The gap between what is tracked and what is not tracked is where the quality cost accumulates invisibly. A team that does not track defect rates by authorship type does not know whether their AI-generated services are generating more incidents than their human-authored services. A team that does not track code duplication over time does not know whether their codebase health is degrading. A team that does not track refactoring activity does not know whether the cleanup cycle that keeps their codebase maintainable has stalled.

The measurement gap is not a technical problem. The data is available in every git repository and every incident management system. It is a prioritization problem: the metrics that justified AI adoption are being tracked; the metrics that would reveal its quality costs are not. Engineering managers need visibility into both dimensions — and most current AI tool evaluation frameworks give them only one.

What engineering managers need to measure alongside velocity

Closing the measurement gap does not require new tooling for most teams. It requires adding quality dimensions to the metrics that are already being tracked. Several measurements are particularly diagnostic:

Defect rate by authorship type. Tag PRs by their AI involvement level — fully AI-generated, AI-assisted with significant human modification, human-authored — and track the post-merge defect rate for each category. This surfaces whether the 1.7x issue rate from the CodeRabbit data is showing up in your specific codebase, and in which categories of work. It also creates a feedback loop for developers: if fully AI-generated PRs in complex services are generating twice the post-merge defects, that is information that changes review behavior.

Incident attribution by service and authorship. When incidents are investigated, the postmortem should capture whether the failing code was AI-generated. Over time, this creates a map of which services carry elevated incident risk from AI authorship. The 24% higher incident rate is an average — the actual distribution across services will show which services have elevated risk and which do not, which is actionable information for deployment risk assessment.

Code duplication rate over time. GitClear's finding of doubled duplication is detectable in any codebase with static analysis tooling. Tracking this quarterly shows whether AI adoption is inflating the duplication rate in ways that will slow future development. If duplication is increasing, the codebase is accumulating maintenance debt at a rate that the velocity gains may not justify.

Refactoring activity as a share of commits. A healthy codebase has a consistent proportion of commits that are refactoring — cleaning up, consolidating, improving structure without changing behavior. If that proportion is declining after AI adoption, the codebase is generating faster than it is being maintained. The GitClear finding of halved refactoring activity is a leading indicator of future velocity loss, not a lagging one.

What high-quality AI-assisted development looks like in practice

The data does not argue against AI coding tools. It argues against deploying them without accounting for their quality costs. The teams that are producing better outcomes with AI tools share a common set of conditions that differ from the default deployment.

Conditions that produce better AI-assisted development outcomes

Conditions that produce better AI-assisted development outcomes:

Context quality
  -> AI session has access to the full codebase — not just open files
  -> Existing implementations are findable before new ones are generated
  -> Conventions are demonstrated through examples, not just stated in rules files
  -> Cross-service dependencies are visible before changes are made

Human interaction quality
  -> Developer reviews generated code with genuine understanding, not approval scan
  -> Complex logic is explained by the AI before being accepted
  -> Edge cases are explicitly verified, not inferred from test passage
  -> Changes to shared abstractions are validated against all consumers

Process structure
  -> AI-generated changes in high-risk services get additional review scrutiny
  -> Defect rates are tracked by authorship type and fed back to developers
  -> Refactoring activity is measured and protected from velocity pressure
  -> Incident postmortems tag AI-generated code as a relevant variable

What these conditions have in common:
  They require the AI tool to have system-level context.
  They require the developer to build genuine comprehension.
  They require leadership to measure quality alongside velocity.
  None of them happen automatically when you roll out AI tools.

The most important condition in that list is context quality. The root cause of most AI code quality problems is the same: the AI generated code that was locally reasonable but globally wrong because it did not know the full system. The retry logic was implemented inline because the AI did not know about the shared retry module. The event was published directly because the AI did not know about the event bus abstraction. The edge case was missed because the AI did not know about the database behavior that creates it.

Giving AI sessions access to a semantic index of the full codebase changes this. When the tool can find the existing implementation before generating a new one, duplication drops. When it can trace cross-service dependencies before making changes, incident rates from unexpected breakage drop. When conventions are demonstrated through indexed examples rather than stated in rules files, convention violations drop. The quality problems that the data documents are largely context problems — and context problems have context solutions.

Kognita maintains a managed semantic index across the full codebase, automatically re-indexed on every merge, available to every AI coding session and to every team member who needs to understand what the system does. The velocity gains from AI tools do not require accepting the quality costs — but capturing both requires infrastructure that the tools themselves do not provide.

Final take

The productivity case for AI coding tools is real and the data supports it. The quality cost is also real and the data supports that too. Engineering leadership that is tracking only velocity is making adoption decisions on half the available information.

A 1.7x issue rate and a 24% higher incident rate are not acceptable tradeoffs for faster feature delivery — they are quality costs that have not yet been priced into the productivity calculation. The teams that will get the most value from AI tools over the next three years are the ones that measure both sides of the ledger now, before the quality debt compounds to the point where it offsets the velocity gains that justified adoption.