Blog

What Code Indexing Cannot Answer (And What Fills the Gap)

9 min read

Code indexing — semantic or otherwise — is good at answering one category of question: "where is X in the codebase?" It finds definitions, surfaces similar patterns, maps imports. This is a real improvement over grep. But the questions that slow engineering teams down the most are not "where is X" questions — they are "what happens if I change X" and "why does X work this way" questions. Those require context that source code indexing, by design, does not contain.

What code indexing answers and what it does not

The category of questions a code index can answer is limited to what is derivable from static source analysis — definitions, relationships, patterns, imports. The category of questions it cannot answer is larger and includes the questions that actually slow teams down:

Answerable vs. unanswerable questions for a code index

Code index answers:
  ✓ "Where is X defined?"
  ✓ "What functions are called by X?"
  ✓ "Are there other files that look like X?"
  ✓ "What does this class import?"
  ✓ "Where is this pattern used across the codebase?"

Code index cannot answer:
  ✗ "What breaks if I change X?"
  ✗ "Who owns X and who should I talk to about changing it?"
  ✗ "Why was X implemented this way?"
  ✗ "What is the safest way to remove X?"
  ✗ "Is X still actively used in production?"

The unanswerable questions are not obscure. They come up in every sprint: impact analysis before a refactor, safe deletion of old utilities, understanding historical decisions before changing them. These are the questions that currently route to the most experienced engineer on the team — the one who happens to have the cross-system mental model. When that person is unavailable, the work stalls.

The impact analysis gap

Impact analysis is the most consequential gap. Before changing something significant — an API contract, an event schema, a shared utility — a developer needs to know everything that depends on it. Code indexing surfaces what is in the current repo. It cannot surface what is in other repos, what is in mobile clients, what is in third-party integrations:

Impact analysis query vs. full impact reality

"What breaks if I change the user auth token format?"

  Code index retrieval:
    → finds auth.ts, tokenValidator.ts, jwtHelper.ts
    → finds 12 files that import from auth module
    → returns them ranked by semantic similarity

  What you actually need to know:
    → 3 mobile clients parse token fields directly (different repos)
    → 1 third-party webhook validates token signature
    → legacy admin panel has hardcoded token parsing
    → 2 cron jobs run token validation offline
    → 4 downstream services extract userId from token format

This is the mechanism behind "it was a simple change" post-mortems — where a developer changed something that looked locally safe and broke three things in adjacent systems that were not visible from the repo they were working in. Code indexing finds everything in the repo. Impact lives across the organization's full system surface. These are different scopes, and most outages happen in the gap between them. This is the structural problem in API contract changes breaking more than expected.

The intent gap

Code tells you what was built. It rarely tells you why. Constants with specific values, unusual data structures, unexpected logic branches — these often exist for reasons that are not documented in the code, stored in git history, or reachable by semantic search:

The gap between code and intent

"Why does this payment retry have a 7-day limit?"

  Code index answer:
    → RETRY_LIMIT_DAYS = 7 in constants.ts
    → Used in PaymentRetryScheduler

  What the index cannot tell you:
    → This was set by legal (statute of limitations on chargebacks)
    → Changing it requires legal approval, not just a code review
    → There is a Jira epic from 2023 where this was debated
    → The previous value (14 days) caused a compliance violation

When a developer changes a constant without understanding the intent behind it, they may be undoing a compliance requirement, a contractual obligation, or a hard-won production fix. Code indexing cannot warn them because the intent is not in the code. It is in a Jira ticket from 2023, or in the memory of the engineer who negotiated it with legal. This is the institutional memory problem described in architecture decisions made verbally and never written.

The ownership gap

Every significant piece of code has an owner — a team, a person, a service domain — but code indexes do not encode ownership. When you find a function during debugging, you often need to know: who maintains this? Who should review a change? Who do I talk to if it is wrong? A semantic index of source files has no answer to these questions. A behavioral understanding of the system — which team owns which service, which service owns which capability — can start answering them.

What a semantic layer above code indexing provides

The questions that code indexing cannot answer require a layer of context above the source files: behavioral relationships, ownership, cross-repo impact, and connected work-in-progress:

What a semantic layer adds above raw code indexing

What a semantic layer adds above code indexing:
  → Behavioral ownership: which team owns which capability
  → Change impact: downstream consumers of this code
  → Jira integration: open tickets that touch this area
  → Historical context: recent changes + associated intent
  → Cross-repo visibility: all consumers, not just this repo

Kognita's approach is to build this layer explicitly — reconstructing behavioral ownership, cross-repo consumer graphs, and Jira-integrated context so that the questions code indexing cannot answer become queryable. "What touches the payment service?" gets an answer that includes the downstream consumers in other repos. "What is actively changing in this area?" gets an answer that includes open Jira tickets alongside the code.

Final take

Code indexing is the right tool for navigating source code — finding definitions, surfacing patterns, understanding structure. The tools that do it well are genuinely useful, and teams should use them. But treating a code index as a complete answer to codebase intelligence is overestimating what it does.

The questions that actually slow teams down — impact analysis, architectural intent, cross-system dependencies, ownership — are not in the index. They are in the behavioral layer above the code. That layer does not build itself from a vector index. It requires semantic enrichment that understands what the code does and who depends on it — not just where it lives.