Blog
Code Embeddings Find Similar Code. They Don't Know What Your Code Does.
10 min read
Code embeddings are the retrieval engine behind Cursor, Copilot, and most AI coding tools. They convert source code into high-dimensional vectors and use cosine similarity to find "related" code when you ask a question. This works well for navigation — finding code that shares vocabulary and structure with your query. It works poorly for the harder question that developers actually need answered during debugging: "does this code do what I think it does?" Similarity is not equivalence, and the gap between them is where wrong AI answers are generated.
What embeddings actually encode
A code embedding encodes the semantic content of a text — which, for code, means the vocabulary, structure, and patterns present in the chunk. It is good at capturing surface-level semantic relationships: functions that deal with "authentication" cluster together, payment-related code clusters together, retry logic clusters together. It is not good at capturing behavioral properties that are not expressed in the text:
What a code embedding captures:
→ Text patterns: variable names, function names, comments
→ Structural patterns: how the code is organized
→ Domain vocabulary: "auth", "payment", "user", "retry"
→ Style patterns: common idioms in the same language
What a code embedding does NOT capture:
→ What the function actually returns at runtime
→ Whether two similar-looking functions are equivalent
→ Which of two nearly identical functions has the bug
→ Side effects that don't appear in the function body
→ The reason this code exists (intent, not implementation)This is not a flaw to be fixed — it is the nature of how embeddings work. They are trained on text and encode textual patterns. Code intent, side effects, and runtime behavior are often not encoded in the text of a function. A function named handleRequest does not tell you what request, what handling, or what happens when it fails. The embedding captures the pattern; the meaning requires context that is not in the chunk.
Similar-looking code can do opposite things
The most dangerous consequence of similarity-based retrieval is returning code that looks like what you need but does something different or opposite. Two functions that share the same structure, vocabulary, and domain can have completely contrary effects:
Two functions with high cosine similarity:
processUserPayment(userId, amount):
→ validates card, charges stripe, updates DB, sends receipt
processRefund(userId, amount):
→ validates card, reverses stripe charge, updates DB, sends notification
Embedding similarity: very high (same vocabulary, same structure)
Behavioral equivalence: opposite (charge vs. refund)
Risk if swapped: catastrophicThis kind of high-similarity / wrong-behavior retrieval is responsible for a subset of AI coding errors that look like hallucination but are not. The model retrieved the wrong function — not an invented one — and used it confidently because the embedding similarity was high. This connects to the broader retrieval problem described in why reranking is needed beyond cosine similarity.
The dead code and duplicate function problem
Legacy codebases accumulate deprecated functions, renamed utilities, and multiple versions of the same capability at different stages of migration. Code embeddings treat all of them similarly because they encode the same patterns — they cannot distinguish an active function from a deprecated wrapper or a deleted utility that was never removed:
Duplicate function retrieval problem:
sendNotificationEmail(userId, message) — active, used by 8 callers
sendNotificationEmail_v2(userId, message) — dead code, deprecated Q2
sendEmailNotification(userId, template) — new version, replaces both
Embedding similarity: all three score > 0.92
Index retrieval: returns all three with similar confidence
AI answer: suggests the wrong one 2/3 of the timeThis is the mechanism behind AI dependency hallucination in codebases — when the AI confidently suggests using a function that exists in the codebase but should not be called. From the embedding perspective, all three versions scored nearly identically. Only behavioral context — which one has active callers, which one was deprecated, which one is the current interface — allows distinguishing them.
Why this matters for AI coding quality
AI coding tools are only as good as the context they retrieve. If retrieval surfaces the wrong function because it looks similar to the right one, the model generates code that uses the wrong function — confidently, because the retrieved context said it was the right place to look. The developer sees a plausible-looking suggestion with no indication that the underlying retrieval was wrong.
In a well-maintained codebase with clean naming and no dead code, this problem is manageable. In a real codebase with three years of accumulated decisions, deprecated wrappers, renamed modules, and utility functions of ambiguous status — it is a constant source of subtle errors that are difficult to diagnose because they look like model mistakes rather than retrieval failures.
What behavioral enrichment adds to embedding-based indexing
The solution is not replacing embeddings — they are the right tool for semantic navigation. The solution is adding a layer of behavioral analysis that captures what embeddings cannot:
Semantic enrichment adds:
→ Caller graphs: who calls this function (not just what it calls)
→ Data flow: what state it reads vs. what it writes
→ Service ownership: which service this behavior belongs to
→ Deprecation signals: is this actively used or orphaned?
→ Execution frequency: hot path vs. edge caseKognita's indexing pipeline adds this enrichment layer. The embeddings handle semantic retrieval; the behavioral analysis adds the signals that allow distinguishing between similar-looking functions: caller graphs, data flow patterns, service ownership, and deprecation signals. The result is retrieval that is more likely to surface the function you should call — not just the function that looks most like the one you described.
Final take
Code embeddings are the right foundation for AI coding tools. They enable the kind of conceptual retrieval — "find me everything related to retry logic" — that transforms AI coding quality compared to keyword search. But similarity is not the same as correctness, and the gap matters more as codebases grow and accumulate overlapping, deprecated, and ambiguous code.
Better AI coding answers require knowing not just what looks like what, but what does what — and that requires behavioral context beyond what embeddings alone can encode.