Blog

Reranking vs Cosine Similarity

11 min read

Most AI retrieval systems still look roughly like this: embed the query, compare to precomputed document vectors with cosine similarity, return top‑k chunks. It is simple, fast, and scalable — which is why it became the default. As repositories grow, a second stage becomes critical: hybrid lexical + dense retrieval for recall, then reranking for precision. Understanding bi‑encoders vs cross‑encoders is central to how modern code search actually behaves — and connects directly to why retrieval is the bottleneck in large systems.

The default embedding pipeline

query
  ↓
embedding model
  ↓
cosine similarity
  ↓
top-k chunks

The core problem

Suppose a developer searches for retry failed Stripe charges. Your repository may contain many “retry-shaped” symbols:

HttpRetryInterceptor
KafkaRetryConsumer
WebhookRetryHandler
FailedPaymentRecoveryWorkflow

A standard embedding search may retrieve several of them because they are semantically similar — but only one is operationally correct. Cosine similarity optimizes for general neighborhood overlap, not billing recovery intent.

What cosine similarity actually does

Most embedding retrieval uses bi‑encoders: the query and each chunk are embedded independently, then compared with cosine similarity. Conceptually, similar directions in vector space mean “semantically related.” That works surprisingly well at scale because documents are embedded ahead of time — retrieval becomes nearest‑neighbor search over millions of chunks with one query embedding call.

Why bi‑encoders lose interaction detail

The query and chunk never meet during encoding. The model never directly compares query ↔ document tokens; it compresses each side into a fixed‑size vector. That compression loses nuance: execution meaning, repository intent, exact phrasing, and operational specificity — especially where abstractions overlap and naming varies (chunking makes this worse).

Example failure

For where do we retry failed Stripe charges, chunks like RetryableHttpClient, KafkaRetryScheduler, and FailedPaymentRecoveryWorkflow may all score highly. A bi‑encoder may rank generic infrastructure above the actual recovery workflow — the vector representation is too coarse for operational disambiguation.

Enter reranking

Reranking adds a second‑stage relevance model. Instead of asking “what is generally similar?”, it asks “what best answers this exact query?” Most rerankers use cross‑encoders, which process query and document together and emit a relevance score — enabling token interaction, fine phrasing, and contextual relationships.

Why cross‑encoders are slower

Cross‑encoders compare the query against each candidate directly. If you retrieve 100 chunks, the reranker may evaluate 100 pairs — expensive compared to a single ANN search. That is why production systems combine both.

Two-stage retrieval (common pattern)

Query
  ↓
Bi-encoder retrieval
  ↓
Top 100 candidates
  ↓
Cross-encoder reranking
  ↓
Top 5 final chunks

Why this matters more for code than documents

Code retrieval has heavy semantic overlap, repeated abstractions, duplicated patterns, infrastructure noise, and shared terminology. The word “retry” may appear across HTTP clients, payment systems, Kafka workers, cron schedulers, and queue handlers. Cosine similarity struggles to separate conceptual similarity from operational relevance; cross‑encoders help — but only if the right candidate is in the initial pool.

The hidden weakness of rerankers

Rerankers still depend on candidate retrieval quality. If the correct chunk never appears in the initial top‑k — because naming differs, embeddings miss it, or chunking fragmented the workflow — the reranker never sees it. No amount of reranking recovers a missing candidate.

Layered dependency

chunking quality
  ↓
retrieval quality
  ↓
reranking quality
  ↓
final reasoning quality

The important insight

Cosine similarity answers “what is generally semantically related?” Reranking answers “what is most relevant for this query?” Those are different problems. Reranked systems often feel dramatically smarter because retrieval precision improved — not necessarily because the base model improved.

Bi‑encoders vs cross‑encoders

Bi‑encoder: fast, scalable, vector‑search friendly, precomputable — best for candidate generation. Cross‑encoder: highly precise, query‑aware, slower — best for reranking top candidates.

The future probably combines multiple layers

Strong systems increasingly look like layered pipelines — lexical and sparse recall, dense retrieval, cross‑encoder reranking, and graph expansion — because repository retrieval is less about “similar text” and more about reconstructing operational meaning.

Where the field is heading (conceptual)

Query
  ↓
BM25 retrieval
  ↓
Sparse retrieval
  ↓
Bi-encoder retrieval
  ↓
Cross-encoder reranking
  ↓
Graph-aware expansion
  ↓
LLM reasoning

Final takeaway

Embeddings and cosine similarity made semantic retrieval possible at scale — but they are fundamentally approximate. Reranking exists because semantic similarity alone is not enough when repositories contain overlapping abstractions and operational relevance requires finer reasoning. Bi‑encoders optimize speed; cross‑encoders optimize precision; modern retrieval increasingly needs both.