Blog
The Hidden Cost of Bad Chunking
14 min read
Most discussions around retrieval systems obsess over embeddings, vector databases, rerankers, context windows, and model quality.
But in real-world code retrieval systems, one layer quietly determines whether everything works or collapses: chunking.
Bad chunking destroys retrieval quality long before embeddings even become relevant. And most systems are chunked badly.
Why chunking matters more than people think
Retrieval systems never search entire repositories directly. They search chunks.
That means the chunk becomes the fundamental unit of meaning. If your chunk boundaries are wrong: retrieval quality drops, reranking becomes noisy, context windows fill with garbage, agents hallucinate, and architectural understanding collapses.
The embedding model can only represent what exists inside the chunk. If the chunk itself is semantically broken, retrieval becomes fundamentally unreliable.
The naive approach: fixed-length chunking
Most systems still do something like:
chunk_size = 500_tokens
overlap = 100_tokensThis works reasonably for generic text. It works terribly for software systems — because code meaning does not align to token counts.
Example: broken function boundaries
Suppose we have this method:
public PaymentResult processPayment(Order order) {
validateOrder(order);
FraudCheckResult fraud = fraudEngine.check(order);
if (!fraud.isAllowed()) {
metrics.incrementFraudRejects();
return PaymentResult.rejected();
}
StripeCharge charge = stripeClient.charge(order);
eventBus.publish(new PaymentProcessedEvent(order));
return PaymentResult.success(charge.id());
}Naive chunking might split this into:
public PaymentResult processPayment(Order order) {
validateOrder(order);
FraudCheckResult fraud = fraudEngine.check(order);
if (!fraud.isAllowed()) {
metrics.incrementFraudRejects();return PaymentResult.rejected();
}
StripeCharge charge = stripeClient.charge(order);
eventBus.publish(new PaymentProcessedEvent(order));return PaymentResult.success(charge.id());
}The semantic meaning of the execution flow is now destroyed.
What retrieval sees
A developer searches: where do we publish payment events. The relevant logic exists in fragment 2 — but that fragment no longer contains method context, fraud logic, execution intent, function ownership, or surrounding business flow. Retrieval quality degrades dramatically.
Another failure: class fragmentation
Suppose a repository contains:
class UserDeletionWorkflow {
deleteDatabaseRecords();
revokeSessions();
purgeS3Assets();
cancelBillingSubscriptions();
enqueueGDPRCleanup();
}Fixed token chunking may separate deletion orchestration, GDPR cleanup, billing cancellation, and storage cleanup into unrelated chunks. The system loses the fact that these operations belong to the same behavioral workflow.
Why this matters for AI agents
Humans reconstruct missing context naturally. LLMs often cannot. An AI agent retrieving fragmented chunks may incorrectly conclude that billing subscriptions are never cancelled during account deletion simply because that logic lives in another disconnected chunk.
This is one of the hidden causes of hallucinations in code agents — not model intelligence, but retrieval fragmentation.
Chunking defines semantic boundaries
Good chunking aligns with actual software meaning. Bad chunking aligns with arbitrary token counts. That distinction becomes enormous at repository scale.
Better strategy: AST-aware chunking
Instead of chunking by token length, chunk by syntactic unit: complete methods, classes, interfaces, controllers, transaction scopes, logical execution blocks. Now retrieval preserves semantic integrity.
Instead of splitting mid-function, processPayment becomes one coherent retrieval unit. Searches like where do we charge stripe or where are payment processed events emitted retrieve the same operational flow. That matters enormously.
But AST chunking is still not enough
Syntax does not always equal meaning. Execution behavior spans multiple files:
class StripeService {
retryCharge();
}
class RetryPolicy {
execute();
}AST chunking preserves syntax — but still loses operational relationships across those boundaries.
The next step: behavioral chunking
The most advanced systems increasingly move toward behavioral units: execution chains, dependency flows, service interactions, event propagation, transactional boundaries.
Example chain:
CheckoutController → PaymentService → FraudEngine
→ StripeClient → EventPublisherThat is vastly more meaningful retrieval context than isolated syntax fragments.
Chunking is actually compression
Chunking is not just splitting text — it is semantic compression. You are deciding what meaning survives retrieval. Every chunking strategy destroys information differently.
Fixed-length chunking optimizes simplicity
Pros: easy, fast, model-agnostic, predictable, uniform embedding cost.
Cons: destroys execution flow, splits semantics, breaks architectural reasoning, creates retrieval fragmentation.
AST chunking optimizes syntax
Pros: preserves functions and classes, language-aware, structurally cleaner.
Cons: misses multi-hop behavior, ignores runtime relationships, weak architectural context across files.
Behavioral chunking optimizes operational meaning
Pros: preserves execution semantics, improves agent reasoning, stronger architectural retrieval.
Cons: much harder to build — graph analysis, repository-aware indexing, dependency resolution.
Why most RAG systems fail on real codebases
Small demos hide chunking problems. Large repositories expose them immediately. As systems scale, abstractions deepen, execution spreads across services, naming becomes inconsistent, and indirection increases. Naive chunking collapses under that complexity. The retrieval problem becomes less like document search and more like reconstructing distributed behavior.
The future is probably multi-layer chunking
The strongest systems will likely maintain multiple parallel representations: file-level chunks, function-level chunks, execution-flow chunks, repository graph chunks, semantic summaries. Different retrieval tasks need different granularities.
The real insight
Most people think retrieval quality comes from better embeddings. Often the bigger improvement comes from better semantic boundaries — because retrieval systems can only understand the meaning you preserve inside the chunk. Once meaning is fragmented, no embedding model can fully reconstruct it.
The problem with “500 token chunks”
That recipe appears in almost every RAG tutorial. For generic text it often works. For codebases it becomes one of the fastest ways to destroy retrieval quality — because software systems are not continuous prose. They are structured behavioral graphs, and arbitrary token windows ignore that.
Why 500 tokens became popular
Fixed token chunking is simple, fast, deterministic, model-friendly, easy to batch, and gives uniform embedding cost. It works well for blog posts, PDFs, documentation, and support articles — where meaning is relatively distributed. Code is different.
Meaning does not respect token boundaries
Suppose a checkout flow spans:
CheckoutController → PaymentService → FraudEngine
→ StripeClient → EventPublisherA fixed token chunker may produce unrelated islands:
fraudEngine.check(order)stripeGateway.charge(order)eventPublisher.publish(...)A developer searching where does checkout complete successfully does not want isolated syntax fragments. They want the behavioral execution flow — a different retrieval problem entirely.
Chunking by length vs chunking by meaning
Length-based chunking (split after N tokens) optimizes implementation simplicity, embedding consistency, and throughput — and destroys execution semantics, architectural context, and behavioral continuity.
Contextual chunking splits by semantic boundary: complete methods, controllers, transaction scopes, execution chains, event flows, service boundaries. Retrieval preserves operational meaning.
Example: authentication flow
Imagine identifiers across a repo:
- AuthController
- SessionService
- JwtTokenProvider
- RefreshTokenStore
- CookieManager
A user searches where are refresh tokens rotated. With length-based chunking you might retrieve only tokenStore.save(refreshToken) — barely useful. With contextual chunking you retrieve the full rotation story: invalidate old token, generate new JWT, persist refresh token, update secure cookie. Now the developer actually understands the system.
Why overlap does not really solve this
Increasing overlap (e.g. 500 tokens / 200 overlap) reduces fragmentation slightly — but introduces duplicated retrieval, wasted context window, reranker noise, inflated embedding storage, and repeated irrelevant context. Large overlap becomes an expensive band-aid over fundamentally broken boundaries.
The hidden cost: embedding pollution
When unrelated concepts share one chunk, embeddings become semantically diluted. A chunk mixing retry payment, send analytics, cache invalidation, and audit logging weakly represents all four at once — search quality drops because the vector no longer strongly represents any single operational meaning.
Why small chunks also fail
Shrinking aggressively (e.g. 120-token chunks) loses dependencies, execution flow, surrounding intent, and business meaning. You retrieve isolated syntax atoms instead of coherent logic.
The real tradeoff
Chunking balances semantic precision vs context preservation. Too large → dilution. Too small → fragmentation. The optimal chunk is rarely defined by token count. It is defined by behavioral cohesion.
Better chunking strategies
1. AST chunking — by method, class, interface, module. Good baseline for code retrieval.
2. Dependency-aware chunking — expand using imports, call graphs, inheritance, interfaces (e.g. PaymentService + StripeClient + RetryPolicy + FraudEngine).
3. Execution-flow chunking — build around runtime behavior: validate cart → reserve inventory → charge payment → publish event. Dramatically more useful for agents.
Why AI coding systems expose chunking problems faster
Many “hallucinations” are retrieval boundary failures: an agent concludes that payments are not retried because retry behavior lives in another disconnected chunk. The industry is slowly realizing that retrieval quality is downstream of chunk quality, and chunk quality is downstream of repository understanding.
The bigger shift
The future probably moves toward semantic, graph-aware, and behavioral chunking — and repository cognition systems — instead of increasingly arbitrary token heuristics. The issue is not that 500 is the wrong number. The issue is that tokens are not software boundaries, and retrieval systems that ignore that eventually collapse under repository complexity.