Blog
BM25 vs Sparse Vector Search for Code Retrieval
12 min read
Most code search systems still rely on a simple assumption: developers search for code using the exact words already present in the repository.
That works surprisingly well — until it doesn't.
A developer asks: where do we invalidate expired sessions
But the implementation actually lives inside:
- JwtTokenCleanupJob
- SessionRevocationService
- ExpiredCredentialInterceptor
Suddenly lexical search starts struggling.
This is where the discussion around BM25, dense embeddings, and increasingly sparse vector retrieval becomes important. But most explanations stay theoretical. The reality is much easier to understand when you look at concrete repository behavior.
Why BM25 still works extremely well for code
BM25 is fundamentally lexical. It ranks documents using token frequency and rarity. That sounds primitive compared to AI embeddings — but codebases are full of highly valuable lexical signals.
public class PaymentRetryScheduler {
...
}A developer searches: PaymentRetryScheduler. BM25 immediately wins. Dense embeddings are unnecessary here. The identifier itself already contains extremely precise meaning.
The same applies to:
- Table names
- Exception classes
- Kafka topics
- Environment variables
- Feature flags
- API routes
- Protobuf names
- Terraform resources
If someone searches STRIPE_WEBHOOK_SECRET, you do not want “semantic similarity.” You want the exact match instantly.
This is why tools like ripgrep, Lucene, Elasticsearch, and Sourcegraph still heavily rely on lexical ranking. And honestly — they should.
Where BM25 starts breaking
The problem appears when the developer searches conceptually instead of lexically.
public class TokenBucketLimiter {
public boolean allowRequest(User user) {
...
}
}The developer searches: where do we stop users from spamming requests
BM25 sees almost zero overlap:
| Query | Code |
|---|---|
| stop users | allowRequest |
| spamming | TokenBucket |
| requests | limiter |
Humans instantly understand the relationship. BM25 does not.
Another real example: payment recovery
Suppose your repository contains:
- ChargeRecoveryWorker
- FailedPaymentOrchestrator
- RetryableStripeException
A developer searches retry failed stripe payments. BM25 works partially because of “stripe” and “retry.”
But what if the repository uses internal terminology?
- RevenueRecoveryPipeline
- BillingFailureCoordinator
- SoftDeclineProcessor
Now lexical overlap disappears. BM25 starts retrieving garbage:
- Generic retry utilities
- HTTP retry interceptors
- Kafka retry consumers
- Retryable database transactions
…instead of the actual billing recovery system.
This is where semantic retrieval helps
Embedding models attempt to solve this by understanding meaning instead of exact words. Conceptually, retry failed stripe payments becomes semantically related to billing recovery, soft decline handling, and payment orchestration. This is the promise of dense vector search. And sometimes it works very well.
But dense embeddings introduce new problems
The problem is that codebases are not essays. They contain enormous semantic overlap.
Example. Query: retry webhook failures. Retrieved result: HttpRetryInterceptor. Technically related. Operationally useless.
Another example: query user onboarding emails might retrieve EmailTemplateRenderer instead of WelcomeSequenceWorkflow. The embedding model understands semantic neighborhood — but not repository intent. That distinction matters enormously in large systems.
Sparse retrieval is becoming more interesting
Sparse vector retrieval sits between BM25 and dense embeddings. Instead of producing opaque dense vectors, sparse models generate weighted semantic tokens. Think of it like controlled semantic expansion.
Query: stop users from spamming requests. Sparse expansion might internally weight terms like: rate limit, throttle, 429, bucket, quota, abuse. Now retrieval can match RateLimitInterceptor, TokenBucketLimiter, QuotaExceededException without abandoning lexical grounding entirely. That balance is extremely useful for code retrieval.
Sparse retrieval preserves important exact signals
Suppose a developer searches S3MultipartUploadManager. Dense embeddings may retrieve upload services, storage abstractions, and blob handlers. Sparse retrieval preserves the actual identifier importance — because identifiers in code carry enormous semantic weight.
Developers do not search like normal users. They search using half-remembered class names, implementation concepts, framework jargon, and internal company terminology. Retrieval systems need to support all of these simultaneously.
Why hybrid retrieval usually wins
The best systems rarely choose one strategy. They layer them.
Query
→ BM25 candidate retrieval
→ Sparse semantic expansion
→ Repository graph expansion
→ Dense reranking
→ Context stitchingExample: “Where is user deletion handled?”
Imagine the repository contains:
- DeleteAccountWorkflow
- UserAnonymizationService
- GDPRCleanupJob
- S3AssetPurgeWorker
Step 1 — BM25 retrieves DeleteAccountWorkflow. Good start.
Step 2 — sparse expansion associates “delete user” with anonymization, cleanup, purge, retention, and GDPR. Now additional relevant systems appear.
Step 3 — graph expansion discovers the workflow: DeleteAccountWorkflow → GDPRCleanupJob → S3AssetPurgeWorker → BillingSubscriptionCanceller. Now the system understands the full operational workflow — massively more useful than retrieving isolated chunks.
Dense retrieval alone often collapses at scale
Small demos make embeddings look magical. Large repositories expose the weaknesses quickly. Imagine a monorepo containing retry middleware, retry queues, retry payment jobs, HTTP retry wrappers, Kafka retry consumers, and retry cron workers. A dense search for retry failed events may return 50 semantically “similar” chunks. But similarity is not enough. Developers need the correct operational subsystem — a much harder problem.
Chunking is often more important than embeddings
This part gets ignored constantly. Most retrieval failures are actually chunking failures.
Naive chunking — split every 500 tokens — creates disasters. Example: a chunk ends after validateCard(); and the rest of the function exists in another chunk. Now retrieval loses execution meaning entirely.
Better chunking means logical execution units: complete methods, service boundaries, controller + downstream calls, transaction scopes, dependency clusters. AST-aware chunking dramatically improves retrieval quality.
The real goal is repository understanding
The future is not “keyword search vs embeddings.” The future is repository cognition. Modern systems increasingly need to reconstruct execution flow, ownership boundaries, architectural relationships, dependency impact, and operational behavior.
That requires combining lexical precision, semantic expansion, structural reasoning, graph traversal, and repository-aware ranking — not just cosine similarity.
So is BM25 obsolete?
Not even close.
In fact, for many code retrieval tasks, BM25 remains one of the strongest signals available. The systems that work best today usually look like: BM25 for exactness, sparse retrieval for semantic expansion, dense reranking for contextual relevance, and graph traversal for repository awareness — not “replace everything with embeddings.”
Because code retrieval is fundamentally different from document retrieval. Software systems are not paragraphs. They are interconnected behavioral graphs disguised as text.