KognitaKognita.

Blog

BM25 vs Sparse Vector Search for Code Retrieval

12 min read

Most code search systems still rely on a simple assumption: developers search for code using the exact words already present in the repository.

That works surprisingly well — until it doesn't.

A developer asks: where do we invalidate expired sessions

But the implementation actually lives inside:

  • JwtTokenCleanupJob
  • SessionRevocationService
  • ExpiredCredentialInterceptor

Suddenly lexical search starts struggling.

This is where the discussion around BM25, dense embeddings, and increasingly sparse vector retrieval becomes important. But most explanations stay theoretical. The reality is much easier to understand when you look at concrete repository behavior.

Why BM25 still works extremely well for code

BM25 is fundamentally lexical. It ranks documents using token frequency and rarity. That sounds primitive compared to AI embeddings — but codebases are full of highly valuable lexical signals.

Example (Java)
public class PaymentRetryScheduler {
    ...
}

A developer searches: PaymentRetryScheduler. BM25 immediately wins. Dense embeddings are unnecessary here. The identifier itself already contains extremely precise meaning.

The same applies to:

  • Table names
  • Exception classes
  • Kafka topics
  • Environment variables
  • Feature flags
  • API routes
  • Protobuf names
  • Terraform resources

If someone searches STRIPE_WEBHOOK_SECRET, you do not want “semantic similarity.” You want the exact match instantly.

This is why tools like ripgrep, Lucene, Elasticsearch, and Sourcegraph still heavily rely on lexical ranking. And honestly — they should.

Where BM25 starts breaking

The problem appears when the developer searches conceptually instead of lexically.

Example (Java)
public class TokenBucketLimiter {
    public boolean allowRequest(User user) {
        ...
    }
}

The developer searches: where do we stop users from spamming requests

BM25 sees almost zero overlap:

QueryCode
stop usersallowRequest
spammingTokenBucket
requestslimiter

Humans instantly understand the relationship. BM25 does not.

Another real example: payment recovery

Suppose your repository contains:

  • ChargeRecoveryWorker
  • FailedPaymentOrchestrator
  • RetryableStripeException

A developer searches retry failed stripe payments. BM25 works partially because of “stripe” and “retry.”

But what if the repository uses internal terminology?

  • RevenueRecoveryPipeline
  • BillingFailureCoordinator
  • SoftDeclineProcessor

Now lexical overlap disappears. BM25 starts retrieving garbage:

  • Generic retry utilities
  • HTTP retry interceptors
  • Kafka retry consumers
  • Retryable database transactions

…instead of the actual billing recovery system.

This is where semantic retrieval helps

Embedding models attempt to solve this by understanding meaning instead of exact words. Conceptually, retry failed stripe payments becomes semantically related to billing recovery, soft decline handling, and payment orchestration. This is the promise of dense vector search. And sometimes it works very well.

But dense embeddings introduce new problems

The problem is that codebases are not essays. They contain enormous semantic overlap.

Example. Query: retry webhook failures. Retrieved result: HttpRetryInterceptor. Technically related. Operationally useless.

Another example: query user onboarding emails might retrieve EmailTemplateRenderer instead of WelcomeSequenceWorkflow. The embedding model understands semantic neighborhood — but not repository intent. That distinction matters enormously in large systems.

Sparse retrieval is becoming more interesting

Sparse vector retrieval sits between BM25 and dense embeddings. Instead of producing opaque dense vectors, sparse models generate weighted semantic tokens. Think of it like controlled semantic expansion.

Query: stop users from spamming requests. Sparse expansion might internally weight terms like: rate limit, throttle, 429, bucket, quota, abuse. Now retrieval can match RateLimitInterceptor, TokenBucketLimiter, QuotaExceededException without abandoning lexical grounding entirely. That balance is extremely useful for code retrieval.

Sparse retrieval preserves important exact signals

Suppose a developer searches S3MultipartUploadManager. Dense embeddings may retrieve upload services, storage abstractions, and blob handlers. Sparse retrieval preserves the actual identifier importance — because identifiers in code carry enormous semantic weight.

Developers do not search like normal users. They search using half-remembered class names, implementation concepts, framework jargon, and internal company terminology. Retrieval systems need to support all of these simultaneously.

Why hybrid retrieval usually wins

The best systems rarely choose one strategy. They layer them.

Hybrid retrieval pipeline (example)
Query
  → BM25 candidate retrieval
  → Sparse semantic expansion
  → Repository graph expansion
  → Dense reranking
  → Context stitching

Example: “Where is user deletion handled?”

Imagine the repository contains:

  • DeleteAccountWorkflow
  • UserAnonymizationService
  • GDPRCleanupJob
  • S3AssetPurgeWorker

Step 1 — BM25 retrieves DeleteAccountWorkflow. Good start.

Step 2 — sparse expansion associates “delete user” with anonymization, cleanup, purge, retention, and GDPR. Now additional relevant systems appear.

Step 3 — graph expansion discovers the workflow: DeleteAccountWorkflow → GDPRCleanupJob → S3AssetPurgeWorker → BillingSubscriptionCanceller. Now the system understands the full operational workflow — massively more useful than retrieving isolated chunks.

Dense retrieval alone often collapses at scale

Small demos make embeddings look magical. Large repositories expose the weaknesses quickly. Imagine a monorepo containing retry middleware, retry queues, retry payment jobs, HTTP retry wrappers, Kafka retry consumers, and retry cron workers. A dense search for retry failed events may return 50 semantically “similar” chunks. But similarity is not enough. Developers need the correct operational subsystem — a much harder problem.

Chunking is often more important than embeddings

This part gets ignored constantly. Most retrieval failures are actually chunking failures.

Naive chunking — split every 500 tokens — creates disasters. Example: a chunk ends after validateCard(); and the rest of the function exists in another chunk. Now retrieval loses execution meaning entirely.

Better chunking means logical execution units: complete methods, service boundaries, controller + downstream calls, transaction scopes, dependency clusters. AST-aware chunking dramatically improves retrieval quality.

The real goal is repository understanding

The future is not “keyword search vs embeddings.” The future is repository cognition. Modern systems increasingly need to reconstruct execution flow, ownership boundaries, architectural relationships, dependency impact, and operational behavior.

That requires combining lexical precision, semantic expansion, structural reasoning, graph traversal, and repository-aware ranking — not just cosine similarity.

So is BM25 obsolete?

Not even close.

In fact, for many code retrieval tasks, BM25 remains one of the strongest signals available. The systems that work best today usually look like: BM25 for exactness, sparse retrieval for semantic expansion, dense reranking for contextual relevance, and graph traversal for repository awareness — not “replace everything with embeddings.”

Because code retrieval is fundamentally different from document retrieval. Software systems are not paragraphs. They are interconnected behavioral graphs disguised as text.