Blog

The Feature Flag Graveyard: Why Stale Flags Mislead AI and Block Deployments

14 min read

The codebase has 47 feature flags. Six are active experiments. Four are permanent killswitches for emergency use. The other 37 — nobody knows. ENABLE_NEW_CHECKOUT_FLOW is set to true everywhere and has been for 18 months. LEGACY_PAYMENT_FALLBACK is set to false but references a service that still exists. BETA_INVITE_SYSTEM wraps a code path that runs for every user. Nobody dares touch them. They have become load-bearing walls.

Feature flags were supposed to make deployment safer. Add a flag, roll out gradually, clean up when done. The first two steps happen reliably. The third step — clean up when done — almost never does. Not because engineers are lazy, but because “clean up when done” requires answering a question nobody budgets time for: what exactly does this flag wrap, in what state is it running in production, and what would break if it were removed?

That question is expensive to answer. So teams defer it. Then they defer it again. Eventually the flag becomes a permanent feature of the codebase, and anyone who might know the answer has left or moved on to different work. The flag stays. The dead code stays. The cognitive overhead accumulates. Quietly, until the codebase is full of things nobody fully understands.

How flag accumulation happens

Flags are cheap to create. Any engineer can add a flag check in an afternoon. The deployment risk of the wrapped feature is reduced immediately. The flag pays for itself on the first deploy. From the perspective of the engineer adding it, the calculus is clear: low cost, high safety benefit.

Removal is the inverse. Removing a flag requires understanding everything the flag wraps, confirming the flag state in every environment, verifying that the dead branch is truly dead and that no test depends on it behaving a specific way, and making a judgment call about whether any downstream system expects the legacy behavior. That is an afternoon of investigation in the best case. It is several days when the flag has been in the codebase for 18 months and the original author is gone.

The asymmetry is fundamental. Creation is fast and local. Removal is slow and system-wide. Until you have a way to answer system-wide questions quickly, the incentive will always favor creating flags over removing them. And so the graveyard grows.

Sprint planning makes it worse. Cleanup work without a visible customer impact is perpetually deprioritized. “Remove old feature flags” does not compete well against “ship new feature” in a backlog refinement meeting. Product managers are not wrong to deprioritize it — the cost of flag accumulation is diffuse and invisible until it is not. By then the cleanup backlog is enormous and every item on it requires significant investigation before it can be executed safely.

The four types of stale flags

Four stale flag patterns and what each actually costs

The four types of stale flags and what each one costs:

  TYPE 1: Permanent-by-accident
  Pattern: ENABLE_NEW_CHECKOUT_FLOW = true everywhere for 18 months
  Cost:
  → two code paths maintained where one exists
  → the false-branch receives improvements it will never use
  → AI tools write tests for dead code, reducing signal in test suites
  → new engineers read both branches and try to understand both
  Removal complexity: high — requires confidence that the false branch
  is truly dead in all environments and that no test depends on it

  TYPE 2: Emergency killswitch past its emergency
  Pattern: DISABLE_BULK_EXPORT = false for 12 months
  Cost:
  → the bulk export code has two modes it never uses
  → on-call engineers look at the killswitch during incidents and
    must rule it out before investigating the real issue
  → the "emergency" it protected against may no longer exist —
    but nobody verified that before leaving the flag in place
  Removal complexity: medium — requires confirming the failure mode
  the killswitch protected against is no longer reachable

  TYPE 3: Experiment with no committed outcome
  Pattern: NEW_RECOMMENDATION_ALGO — neither variant was adopted
  Cost:
  → two recommendation algorithms maintained in parallel
  → neither is the canonical one
  → new features added to one but not the other
  → A/B test was inconclusive and nobody made the call
  Removal complexity: high — requires a product decision that was deferred,
  then system confirmation of which variant is actually in production

  TYPE 4: "Just in case" flag for behavior nobody has triggered
  Pattern: FALLBACK_EMAIL_PROVIDER = false, never triggered in 18 months
  Cost:
  → a secondary email provider integration maintained indefinitely
  → credentials rotated, dependencies updated, tests maintained
  → the fallback scenario has not occurred and may never occur
  → if the primary provider has been stable, the fallback is theoretical
  Removal complexity: low-medium — requires confirming fallback was
  never triggered and that the failure scenario is acceptably covered

What distinguishes these four types is not the cost to maintain them in place — that cost is real but manageable in isolation — it is the cost of removing them. Each type requires a different kind of understanding before removal is safe. Permanent-by-accident flags need confidence that one branch is dead in every environment. Killswitch flags need confirmation that the emergency scenario is no longer live. Experiment flags need a product decision that was explicitly deferred. “Just in case” flags need operational history showing the fallback was never invoked.

All four of those confirmations require system-level knowledge. Not local knowledge about the flag itself, but knowledge about what the flag wraps, how its state is evaluated, and what would happen across the system if the wrapped code were unconditionally committed. That knowledge does not come from reading the flag declaration. It comes from tracing the system.

What stale flags actually cost

The cost of an individual stale flag is manageable. The cost of 37 stale flags compounds across multiple dimensions that teams do not usually account for explicitly.

Code complexity

Every feature flag is a branch. A codebase with 37 stale flags has 37 conditional branches whose condition is either always true or always false in production, but which the code does not know that. Every engineer reading a flag-wrapped section has to evaluate whether the flag might be off somewhere, whether the dead branch matters, and whether changes to the dead branch have any effect. That cognitive overhead is not zero. It multiplies across every code review, every debugging session, and every new engineer onboarding.

Test coverage gaps

When a flag wraps two code paths, tests cover combinations of flag states. Some of those combinations never occur in production. Tests that exercise the dead branch pass reliably — because the dead branch has no interaction with production behavior — but they provide no signal about what actually runs. A test suite that exercises production behavior at 80% coverage but appears to cover 95% because the dead-branch tests are included is actively misleading. It erodes confidence in the test suite as an indicator of real risk.

Dead code maintenance

Dead code gets maintained. Dependencies in the dead branch get upgraded during routine dependency updates. Linter errors in the dead branch get fixed during code cleanup passes. Engineers spend real time on code that executes nowhere. That time is not dramatic in a single sprint, but across a year of 37 stale flags it adds up to weeks of engineering effort spent maintaining code that could be deleted.

The AI coding problem specifically

Stale feature flags are a pre-existing problem. AI coding tools make it significantly worse, in ways that are not immediately obvious.

When an AI coding tool sees a flag-wrapped code path, it does not know whether the flag is on or off in production. From the model's perspective, both branches are live code. It will suggest improvements to dead branches. It will write tests for code paths that never execute. It will “fix” issues in the legacy fallback that has been disabled for 14 months. It will optimize paths that appear conditional but are always enabled. The model is working correctly — it is the codebase that is lying to it.

What AI sees vs. what production actually runs — three stale flag examples

What AI coding tools see vs. what production actually runs:

  ─────────────────────────────────────────────────────
  Flag: ENABLE_NEW_CHECKOUT_FLOW
  Code:
    if (featureFlags.get('ENABLE_NEW_CHECKOUT_FLOW')) {
      return newCheckoutService.process(cart)
    } else {
      return legacyCheckoutService.process(cart)
    }

  What the AI sees: a conditional with two live branches
  What production runs: the if-branch, always, for 18 months
  What the else-branch is: dead code nobody removed

  AI behavior:
  → suggests improvements to legacyCheckoutService (dead code)
  → writes tests for both branches (one branch never executes)
  → proposes "optimization" that modifies a path nobody uses
  → misses that newCheckoutService is the only real surface to test
  ─────────────────────────────────────────────────────
  Flag: LEGACY_PAYMENT_FALLBACK
  Code:
    if (!featureFlags.get('LEGACY_PAYMENT_FALLBACK')) {
      return primaryPaymentGateway.charge(order)
    } else {
      return legacyGateway.charge(order)  // references active service
    }

  What the AI sees: a killswitch with a live fallback service
  What production runs: the if-branch (flag is false everywhere)
  Reality: legacyGateway still exists, but this path never runs

  AI behavior:
  → treats legacyGateway as a live dependency to maintain
  → includes it in dependency upgrade analysis
  → flags it as "missing error handling" (never reached)
  → assumes changes to it could affect production behavior
  ─────────────────────────────────────────────────────
  Flag: BETA_INVITE_SYSTEM
  Code:
    if (featureFlags.get('BETA_INVITE_SYSTEM')) {
      return betaInviteService.sendInvite(user)
    }

  What the AI sees: a beta feature for a subset of users
  What production runs: this code path, for every single user
  What "beta" means here: it shipped 14 months ago and was never graduated

  AI behavior:
  → suggests keeping it gated "until beta is complete"
  → proposes lower test coverage than a permanent feature warrants
  → treats it as optional code rather than the primary invite path

The first example is the most common failure mode: the AI invests engineering effort in dead code because it cannot distinguish a stale always-true flag from a real conditional. The second example shows the reverse: the AI maintains a dependency that has no production path because the flag that would reach it is always false. The third is the most insidious — a “beta” flag that is treating every user as if they are in beta, leading the AI to reason about it as optional code rather than the canonical path.

None of these are failures of the AI model. They are failures of information. The AI has code but not context. It sees structure but not state. It cannot know that ENABLE_NEW_CHECKOUT_FLOW has been true in every environment for 18 months unless something tells it. A stale flag is not just technical debt for humans to navigate — it is misinformation for AI systems that produce code based on what they read.

As AI coding tools become more central to how teams write and modify code, the cost of codebase misinformation rises. Every stale flag that causes an AI tool to work on dead code, maintain unused dependencies, or underweight the importance of the canonical path is a productivity leak that compounds with every session. The problem existed before AI. AI made it urgent.

Why “just audit the flags” does not work

Flag audit initiatives are not new. Most teams have tried them and most teams have not finished them. The pattern is consistent: someone proposes a flag audit, a list of flags is assembled, engineers are assigned to evaluate specific flags, and work stalls when each evaluation requires more investigation than anticipated.

The investigation problem is not engineering reluctance. It is that answering the questions safely requires system-wide tracing that takes time proportional to flag age and complexity. A flag added last month by the engineer who added it is removable in an hour. A flag added 18 months ago, touching a service with two dependents, by an engineer who has since moved to a different team, with no associated cleanup ticket and no documentation of original intent — that is a multi-day investigation.

The audit stalls on exactly the flags that most need to be removed: the ones with the most accumulated complexity, the most uncertainty about production state, and the least institutional memory to draw on. Easy flags get cleaned up. Hard flags become permanent. The graveyard grows from the bottom up.

What flag archaeology actually requires

Flag audit questions that require system-level answers

Flag audit questions that need system-level answers:

  For each flag, to determine if it is safe to remove:

  Is this flag still referenced in the codebase?
  → text search finds the declaration, but does any active path call it?
  → are there dead callers that were removed but left the flag behind?

  What does this flag wrap?
  → a single method? a service? an entire workflow?
  → how many code paths pass through the conditional?
  → are there nested flags that depend on this one being set?

  What is its current state in each environment?
  → production: true or false?
  → staging: same as production, or being used for testing?
  → is the value hardcoded, env-var driven, or from a feature flag service?

  What would break if this flag were removed?
  → if we delete the false branch, do any integration tests break?
  → does any other service call this service expecting the legacy behavior?
  → are there database migrations or schema changes tied to flag state?

  When was it last changed?
  → when was the flag value last modified in the feature flag service?
  → has any code in the wrapped paths been modified recently?
  → is there a Jira ticket that references this flag?

  Who created it and why?
  → is there a commit message or ticket that explains the original purpose?
  → was it a rollout flag, an emergency killswitch, or an experiment?
  → was there ever a cleanup ticket filed and abandoned?

The questions above cannot be answered by text search. You can find every reference toLEGACY_PAYMENT_FALLBACK in the repository with a grep. What grep does not tell you is whether any of those references are in live code paths, whether the service referenced in the dead branch is still active and maintained, whether any integration test depends on the flag being evaluated as true, or whether the failure scenario the killswitch was designed for can still occur.

Behavioral understanding is what the audit requires. Not just where the flag appears — but what the flag controls, how its state flows through the system, what code paths are actually live versus dead, and what the blast radius of removal would be. That is semantic understanding, not text search. It requires tracing execution across services, connecting configuration state to behavioral outcomes, and identifying the downstream dependencies that would be affected by removing the conditional.

The manual version of this takes two weeks on a complex flag. Not because engineers are slow, but because building that understanding from first principles — reading code, tracing calls, checking environment configurations, verifying test coverage — is inherently time-consuming when you are starting from scratch for each flag in a list of 37.

Making the system traceable

Kognita indexes the codebase semantically — not as text, but as a graph of logical units, dependencies, and execution paths. A flag audit that used to be a two-week manual archaeology project becomes a series of queries against that index.

Is ENABLE_NEW_CHECKOUT_FLOW still referenced in live code paths? The semantic index traces the flag from its declaration through its evaluators to the code paths that branch on it — and identifies whether any of those paths are reachable. Not by reading every file, but by following the execution graph.

What does LEGACY_PAYMENT_FALLBACK wrap? The index surfaces the service, the method calls, the downstream dependencies, and the database tables touched if the flag were true — the full blast radius of the dead branch, assembled from the code without manual tracing.

Is there a Jira ticket for cleaning up BETA_INVITE_SYSTEM? Kognita's Jira MCP integration connects the code artifact to the project history — finding cleanup tickets that were filed and abandoned, or confirming that no cleanup was ever planned.

The index re-indexes automatically. When a flag is removed, the references disappear from the index. When a new flag is added, it appears. The audit is not a one-time project — it is a continuous query against a system that always reflects the current codebase. The graveyard that has been building for two years does not have to be cleaned up in a single two-week sprint. It can be addressed incrementally, flag by flag, with confidence in each removal because the system can answer the safety questions before the deletion is made.

Final take

Flags do not become load-bearing walls because teams are lazy. They do it because removal requires understanding you can only get by tracing the whole system — and the whole-system trace has always been too expensive to justify for a single flag cleanup task.

The graveyard is not a discipline problem. It is a tooling problem. When the audit questions — what does this flag wrap, what is its production state, what would break if it were removed — can be answered in minutes instead of days, the calculus changes. Cleanup becomes proportionate to the problem. Flags get removed when they should be removed, not when a two-week project can be justified in a planning meeting.

Make the system traceable, and the graveyard starts to shrink. That is not aspirational — it is a consequence of making the right answers cheap to obtain.