Blog

Why Context Windows Will Never Be Enough

15 min read

The AI industry currently treats larger context windows like a universal solution. Every few months, another announcement appears:

The arms race (headlines)

100k tokens
200k tokens
1 million tokens
10 million tokens

The assumption is straightforward: if the model can see more of the repository, it will understand the repository better.

But software systems are not just large text blobs. They are execution graphs, dependency networks, architectural systems, behavioral surfaces, and operational flows. Those structures do not emerge automatically from larger token windows. That is why context scaling alone will never fully solve AI coding.

The industry is solving the wrong problem

Most people frame repository understanding as a context size problem. Increasingly, the harder problem is a repository reconstruction problem. Those are not the same thing.

Example: a one-million-token repository dump

Imagine giving an LLM an entire monorepo inside a giant context window. Will the model now understand service ownership, runtime execution, architectural boundaries, async workflows, hidden side effects, or operational dependencies? Not really. Repository understanding is not merely about visibility — it is about structure.

Humans don't understand repositories by reading everything

Senior engineers do not understand large codebases by reading every file sequentially. They build abstractions, mental maps, dependency models, architectural understanding, and execution intuition. They navigate repositories graphically, not linearly.

Example mental model:

Compressed into “checkout execution flow”

CheckoutController
  → PaymentService
  → FraudEngine
  → StripeGateway
  → EventBus
  → EmailWorkflow

LLMs currently consume mostly flattened text — a completely different representation.

Bigger context windows create attention dilution

This problem gets worse as context grows. Suppose a monorepo contains many retry systems, payment abstractions, notification frameworks, and event buses. A developer asks: where do we retry failed Stripe charges. A gigantic window may include retry middleware, Kafka retries, HTTP retries, cron retries, webhook retries, and payment retries. The model now faces signal dilution — more context does not necessarily improve relevance; sometimes it destroys it.

The model still does not know what matters

Even if the entire repository fits into context, the model still lacks importance weighting, architectural prioritization, operational boundaries, ownership understanding, and execution salience. The repository is visible — meaning is not automatically reconstructed.

Repository meaning is non-local

Software behavior is distributed. A single line may trigger behavior hundreds of files away:

One line, many consumers

eventBus.publish(new OrderPlacedEvent(order));

That may activate inventory reservation, fraud analysis, analytics, email workflows, billing pipelines, and async retries — across services, queues, consumers, and orchestrators. No contiguous token window naturally captures that operational graph.

Context windows flatten structure

Repositories are hierarchical systems. Context windows are flat sequences. Flattening destroys graph relationships, execution topology, dependency directionality, and architectural boundaries. The model receives ordered tokens instead of behavioral structure — an enormous mismatch.

Example: false confidence from visibility

Suppose the model sees:

userRepository.delete(userId);

But the actual deletion workflow also includes:

Downstream (easy to miss in attention)

billingService.cancelSubscriptions()
s3AssetCleaner.purgeFiles()
auditLogger.recordDeletion()

If retrieval or attention misses those flows, the model concludes that user deletion only removes database records — even though the repository technically existed inside context. Visibility is not understanding.

Bigger windows also increase noise

As repositories scale, giant contexts introduce duplicated abstractions, stale implementations, dead code, test utilities, generated files, and framework boilerplate. The model must separate operationally relevant behavior from semantically similar noise — which becomes harder as context grows.

This is why retrieval still matters

People imagine future systems dumping repositories into gigantic windows. In practice, retrieval becomes more important — the problem shifts from “can the model see enough?” to “can the system surface the right operational structure?” That is a fundamentally different problem.

The future is probably hierarchical context

The strongest systems likely will not use one giant flat context. Instead they assemble layered representations: semantic summaries, execution graphs, repository maps, dependency structures, behavioral flows, and retrieved implementation detail. That starts looking less like autocomplete — and more like repository cognition.

Context windows solve storage, not understanding

Large windows help memory, visibility, persistence, and retrieval bandwidth. They do not automatically solve architecture understanding, execution reasoning, dependency awareness, or operational modeling. Those require structure-aware systems.

The bigger shift

The industry is slowly moving from text generation toward repository understanding. And repository understanding is fundamentally a graph problem — not merely a token problem. The future probably belongs to systems that reconstruct execution intent, architectural structure, operational relationships, and semantic ownership — because once the AI actually understands the repository, hallucinations drop dramatically.

Codebases are graphs, not documents

Most AI systems still treat repositories like giant text corpora — conceptually, a collection of files. That mental model breaks almost immediately at scale. Large software systems are not documents; they are graphs.

Documents are mostly linear

Natural language is usually sequential; meaning is primarily local. That is why standard RAG works reasonably well for blogs, PDFs, and support articles.

Codebases behave completely differently

Software meaning is distributed across relationships. CheckoutController may depend on payment, fraud, inventory, events, analytics, and notifications — none of which are necessarily local in the filesystem. Operational meaning lives in the graph connecting them.

This is why naive retrieval fails

Traditional RAG retrieves nearby text, semantically similar chunks, or lexical matches. Software meaning is often many hops away.

Example query: where do we cancel subscriptions when users delete accounts. Relevant behavior may span:

No single chunk holds the whole story

UserDeletionWorkflow
  → AccountCleanupService
    → BillingCancellationHandler
      → StripeSubscriptionManager

No single chunk contains the complete behavior — because the meaning exists in the graph.

The file system lies

The tree src/services/controllers/utils/ is not the real architecture — execution flow is. Directories may slice code one way while runtime operations cut across them. Humans learn that over time; AI systems usually do not.

Humans already think graphically

Senior engineers reason through dependency chains, ownership boundaries, execution flows, event propagation, and service topology. When debugging payments, they mentally traverse request → validation → fraud → charge → event → async workflows. That is graph traversal, not document retrieval.

Embeddings alone cannot capture this

Dense embeddings excel at semantic similarity. Graphs encode causality, dependencies, execution order, ownership, and runtime relationships — not merely semantic neighborhoods. A query like retry failed Stripe charges may retrieve HTTP retry middleware, Kafka retries, webhook retries, and cron retries instead of the actual billing recovery workflow — because semantic similarity is not operational understanding.

Chunking quietly breaks graph structure

Suppose an execution flow exists across controller → payment service → Stripe client → event publisher. Naive chunking isolates each into separate retrieval units. The AI sees disconnected syntax instead of connected behavior — the graph disappears.

Most AI coding agents operate on flattened repositories

Current systems often look like:

What the LLM sees

repository
  ↓
split into chunks
  ↓
embed chunks
  ↓
retrieve chunks
  ↓
send text to LLM

Notice what disappears: topology, execution structure, relationships, dependencies. The repository graph becomes isolated text fragments — one of the core reasons agents get lost in large systems.

Example: architectural hallucination

Suppose the convention is: all business logic belongs in domain services. The AI retrieves only controller chunks and generates policy checks inside the controller. The code compiles — but it is architecturally wrong because the model failed to perceive repository structure, not because it ran out of tokens.

The future is probably graph-aware retrieval

Instead of retrieving isolated chunks by semantic similarity alone, future systems likely retrieve connected operational subgraphs:

Reason over behavior, not fragments

Payment Retry Flow
  → retry scheduler
  → failed charge queue
  → Stripe recovery worker
  → notification workflow

Now the AI reasons over behavior instead of disconnected syntax.

Code search is quietly becoming graph search

The problem is evolving from “find similar text” toward “reconstruct operational meaning” — a graph problem, not a document retrieval problem.

The important insight

Most AI coding failures happen because repositories are graph-shaped while AI systems still perceive them as documents. That mismatch creates hallucinations, missing dependencies, duplicated logic, architectural mistakes, and incomplete reasoning.

The systems that win will likely be the ones that best reconstruct execution topology, dependency structure, behavioral relationships, and operational flows — because codebases are not documents. They are distributed behavioral graphs disguised as text.