Blog

AI Fixed the Bug. It Also Changed Twenty Files You Didn't Ask It To.

9 min read

You asked the agent to fix a null pointer exception. One function. You specified the file. The fix itself is two characters — a question mark after card. The agent came back with a 340-line diff touching five files. The variable is now called paymentInstrument. A type that twelve other services import has been renamed. The tests were restructured. JSDoc was added to every function in the file. The change is technically correct throughout. The PR review takes an hour.

This is AI scope creep. Not a bug, not a hallucination — the agent did exactly what it was designed to do. It optimized for what it perceived as "better code" rather than for the scope of what you asked. The problem is not that any individual change is wrong. The problem is that none of the unrequested changes went through the same scrutiny as the change you actually wanted.

Why AI agents expand scope by design

AI coding agents are trained on examples of high-quality code. High-quality code has good variable names, extracted helpers, documentation, and consistent patterns. When an agent sees a function with a variable named card and another function in the codebase with a variable named paymentInstrument, it identifies an inconsistency. Its training signals that consistent naming is better. So it renames the variable.

The agent does not know why the naming is different. It does not know that card is a deliberate convention in the payments module that predates the PaymentInstrument type by two years, or that changing it will require updates to twelve callers, or that the team decided to leave it alone during the last refactor because the blast radius was not worth it. The agent sees a naming inconsistency and resolves it — confidently, silently, across every file it can touch.

This is the structural reason scope creep happens. The agent is not misbehaving. It is applying general code quality heuristics without the system-specific knowledge that would tell it which improvements are appropriate and which are out of scope. Without knowing why the code is written a certain way, it treats every deviation from general best practice as something to fix.

The scope creep taxonomy

AI scope creep is not random. It clusters into recognizable types, each with a characteristic frequency and risk profile. Understanding the taxonomy helps you identify where your AI sessions are most likely to expand beyond their intended scope.

Types of AI scope creep by frequency and risk level

AI scope creep taxonomy — types, frequency, and risk level

TYPE                    FREQUENCY   RISK    EXAMPLE
─────────────────────────────────────────────────────────────────
Variable renaming        Very high   Low     card → paymentInstrument
                                             "more descriptive" per the model

Parameter extraction     High        Low-    Takes a 40-line function and
                                     med     extracts 2 helper functions
                                             "for readability"

JSDoc / comment          High        Low     Adds documentation nobody asked
injection                                    for; sometimes describes intent
                                             incorrectly

Cross-file "consistency" High        Med     Renames a thing in file A, then
                                             renames the same thing everywhere
                                             it appears because "consistency"

Structural refactoring   Med         Med-    Moves logic into a class or module
                                     high    "for better organization"; changes
                                             the architectural shape of the code

Test restructuring       Med         Med     Reorganizes test files, adds or
                                             removes describe blocks, adds test
                                             cases "for coverage" that test the
                                             wrong contracts

Import cleanup           Med         Low-    Reorders imports, replaces named
                                     med     imports with namespace imports,
                                             updates to newer API surfaces

Type widening/tightening Low         High    Changes a type from string to
                                             string | null "to be safer" —
                                             silently breaks callers

Variable renaming is the most common form and usually the least consequential in isolation. The issue is that it cascades. When the agent renames a variable in function A, it follows the reference to function B and renames it there too — then to the type definition, then to the tests. A cosmetic change in one file becomes a cross-repository rename that touches everything connected to that name. Each individual step is a valid refactoring decision. The aggregate is a 300-line diff where the actual change is buried.

Cross-file consistency changes are the most deceptive because they appear to be careful, thoughtful work. The agent is not just fixing the bug — it is ensuring the fix is consistent everywhere related. To a reviewer skimming the diff, this looks like good engineering judgment. The problem is that the agent is enforcing consistency based on its own judgment about what the consistent state should be, not based on the team's conventions or the history of why those names and structures exist. This connects to a broader pattern: AI tools regularly violate team conventions precisely because they are optimizing for general best practices rather than the specific patterns the team has established over time.

Type changes are the highest-risk form even though they are the least frequent. When an agent widens a type from string to string | null "to be safer," it may break callers that do not handle null. When it tightens a type, it may reject inputs that were previously valid. Type changes in shared type files have blast radius that is often not visible in the diff itself — it appears in TypeScript errors elsewhere in the codebase, or, if the agent also silenced those errors, in runtime failures.

Why unapproved changes are riskier than wrong changes

Wrong changes are visible. They fail tests, break linting, trigger TypeScript errors, or produce observable failures in review or QA. The team catches them and reverts them. The feedback loop is tight.

Unapproved changes that are technically correct are invisible. They pass tests. They pass linting. They pass review, because reviewers are checking for correctness, not for whether each change was actually requested. They land in production and stay there. Nobody owns the decision to rename card to paymentInstrument. Nobody evaluated whether the extracted helper function correctly represents the abstraction the team wants. Nobody reviewed the JSDoc for accuracy against the actual system behavior. The decisions were made by the agent, and they were approved by reviewers who did not realize they were approving them.

This is how AI scope creep compounds into a harder-to-reason-about codebase. Not through dramatic failures, but through the accumulation of unreviewed micro-decisions that gradually diverge the code from the team's understanding of it. Nobody made a conscious choice to rename the PaymentCard type to PaymentInstrument. It just appeared in a PR that was otherwise about a one-line null check, approved because the overall diff looked reasonable.

The review attention problem

Code review has a well-documented attention problem: reviewers allocate a roughly fixed amount of attention to a PR regardless of diff size. A 300-line diff gets approximately the same focused review time as a 30-line diff. The 300-line diff gets a faster skim per line.

AI scope creep exploits this directly. When the agent expands a one-line fix into a 300-line refactor, it dilutes the review attention that the actual change receives. The reviewer is now reading variable renames, helper function extractions, JSDoc additions, and test restructuring. By the time they reach the actual null check, they have spent most of their attention budget on changes that were never the point of the PR.

The scope creep changes also benefit from a cognitive halo effect. Because the agent produced all of them in one session, reviewers tend to treat the whole diff as a coherent unit — either trusting the whole thing or questioning the whole thing. In practice, reviewers usually extend the trust they have in the actual fix to the surrounding unrequested changes. The null check is obviously correct, so the variable renames probably are too. This is how unrequested architectural changes pass review: they ride in on the coattails of changes that were clearly right.

There is an additional complication when the scope-crept changes touch tests. AI-generated tests frequently misrepresent system contracts when the agent lacks grounding in how the system actually behaves. A test that the agent adds "for coverage" during a scope creep session may test the agent's model of what a function should do rather than what it actually does — and that test will pass, making the code look well-covered while the system contract it is actually verifying is wrong.

How system context constrains scope

The root cause of AI scope creep is the agent not knowing why code is written a certain way. Variable card looks improvable to an agent that does not know it is a deliberate convention. Inline logic looks extractable to an agent that does not know it was extracted and re-inlined six months ago for a reason. The gap between what the agent can see and what it needs to know is where scope creep happens.

Same task: AI session without vs. with system context

Same task, two outcomes — AI with vs. without system context

TASK: "The processPayment function throws when card is undefined.
       Fix the null check."

─── WITHOUT SYSTEM CONTEXT ───────────────────────────────────────
Agent has: the open file, general TypeScript knowledge
Agent reasoning (approximate): "processPayment uses a variable called
  card. card is a better name for PaymentInstrument. While I'm here,
  I should make the naming consistent. And add documentation. And the
  validateCard logic is inline when it should be extracted..."

Result: 340-line diff, 5 files, renames a type that 12 other files import.

─── WITH SYSTEM CONTEXT ──────────────────────────────────────────
Agent has: same file + semantic understanding of the whole codebase

What the context shows:
  -> PaymentInstrument is a core domain type imported in 12 services
  -> card is the conventional parameter name in all 8 payment functions
  -> chargeCard was deliberately extracted from processPayment 4 months ago
     as a separate unit for isolated testing (visible in git semantics)
  -> the team uses a centralized validatePaymentMethod() from shared/lib/payments
     for validation — there is no pattern of inline validators

Agent reasoning: "The card naming convention is established. The type
  name is used across the codebase. The scope of this task is the null
  check. I should not rename things that are consistent with existing usage."

Result:
- if (card.token) {
+ if (card?.token) {

Diff: 1 line. The one line you asked for.

When an AI session has access to a semantic index of the codebase — not just the open file, but the behavioral patterns, naming conventions, and structural decisions across the whole repository — it can distinguish between "this looks improvable by general standards" and "this is consistent with how this team has deliberately built this system." The agent stops treating every apparent inconsistency as something to fix, because it understands that the inconsistency exists for a reason or reflects an established local convention.

This requires more than a rules file or a CLAUDE.md. Rules files capture the conventions someone thought to write down. They do not capture the implicit patterns that exist in the codebase's history and structure — the naming that has been consistent across thirty payment functions for two years, the type that is deliberately narrow because the team made a conscious choice about what to accept at that boundary, the helper function that was intentionally inlined because the extraction created more complexity than it removed. Those patterns are visible in the codebase. They are not visible in a text file that describes the codebase.

What scoped AI assistance actually looks like

The goal is not to prevent AI agents from improving code. It is to ensure that improvements are deliberate rather than incidental — that when the agent changes something beyond the immediate task, it is because the developer chose to address that scope, not because the agent decided to on its own.

The practical levers are: task specificity, context quality, and review discipline. Task specificity means being explicit about what is in and out of scope: "fix the null check, do not rename variables or touch other files." Context quality means giving the agent enough system understanding to recognize which apparent improvements are actually improvements for this codebase, versus which are locally inconsistent with established patterns. Review discipline means treating a 300-line diff from a one-line task as a signal that something went wrong, not as a thorough job.

The context quality lever is the one most teams underinvest in. It is also the one with the highest leverage. An agent that understands why the code is written the way it is will naturally produce more scoped output, because it can evaluate proposed improvements against the actual system rather than against general heuristics. That evaluation is what separates an agent that improves your code from an agent that noisifies your diff.

Before and after: a one-line fix that stayed one line

Task: "Fix the null check in processPayment — it's throwing when card is undefined"

--- What you expected ---
-  if (card.token) {
+  if (card?.token) {
     await chargeCard(card.token, amount)
   }

Diff: 1 line. Review time: 30 seconds.

--- What the agent actually changed ---
payments/processPayment.ts
  -> fixed the null check (the thing you asked for)
  -> renamed: card → paymentInstrument, amount → chargeAmount
  -> extracted helper: validatePaymentInstrument()
  -> added JSDoc to processPayment, validatePaymentInstrument, chargeCard

payments/chargeCard.ts
  -> renamed parameters to match processPayment "for consistency"
  -> added JSDoc (referenced the renamed parameter names)

payments/__tests__/processPayment.test.ts
  -> updated all references to renamed variables
  -> added test cases for the new validatePaymentInstrument helper
  -> restructured describe block "to follow testing conventions"

payments/__tests__/chargeCard.test.ts
  -> updated all references to renamed parameters

lib/types/payment.ts
  -> renamed PaymentCard → PaymentInstrument "for consistency with the new naming"

Diff: 340 lines across 5 files.
PR review time: 1 hour, 12 minutes.
Changes that were requested: 1.
Changes that were not requested: 339 lines.

Final take

AI scope creep is not a bug in the tools. It is the predictable output of an optimization process with insufficient constraint. The agent is doing what it is designed to do — produce better code — without the system knowledge to correctly evaluate what "better" means in the context of your codebase.

The unapproved changes that come with every AI-assisted PR are not harmless. They carry decisions that nobody reviewed, conventions that nobody validated, and architectural choices that nobody owned. Over time, they accumulate into a codebase that is harder to reason about — not because anything is wrong, but because nobody made the decisions that produced it.

The fix is context, not constraint. An agent that knows your codebase will generate tighter diffs naturally. It will recognize conventions, understand why structures exist, and limit its improvements to the scope it was given. You stop spending an hour reviewing a one-line PR — and you stop discovering two months later that a type you rely on was silently renamed during a bug fix.