Blog
Why AI Gives You a Different Answer About Your Own Codebase Every Day
11 min read
There is a specific kind of frustration that builds up slowly in engineering teams that have adopted AI coding tools. It is not the obvious failures — the hallucinated APIs, the clearly wrong implementations, the responses that are obviously off-base. Those are easy to catch and easy to attribute to model limitations.
This frustration is subtler. You ask the same architectural question twice with slightly different wording and get two completely different approaches. Both look reasonable in isolation. Neither is consistent with the other, and neither is necessarily consistent with what your team already uses. One developer's AI session produces a repository pattern for database access. Another developer's session, the same week, produces inline SQL. Both got merged. Now you have two patterns in the codebase and no canonical one.
This is the AI consistency problem. It is not random. It has a specific cause. And it gets worse as teams grow, because more sessions means more divergence.
The same question, different days
A developer discussion that spread through engineering communities in 2025 captured the problem precisely:
Same question, same codebase, different days:
Monday:
"Generate a database access layer."
→ Clean repository pattern with a unit-of-work abstraction
Thursday:
"Generate a database access layer."
→ Raw SQL with inline connection strings
Both "work." Neither is consistent with the other.
Neither is consistent with what the team already uses.This is not a failure of model capability. Both outputs represent reasonable approaches to the problem as stated. The model is not wrong in any absolute sense — it is just not anchored to your specific system. Without knowing that your team has an established data access convention, without knowing that a repository pattern is already in use across six services, the model generates whatever approach the current context and sampling produce. That varies.
Why AI answers about your codebase are inherently variable without grounding
The inconsistency is not a bug. It is an expected property of how these systems work without external grounding.
Why AI gives different answers about your codebase without grounded context:
-> no memory between sessions — each starts from zero
-> retrieval is probabilistic — different files surface on different prompts
-> the model is sampling from its training distribution, not from your system
-> phrasing variation changes which examples get retrieved
-> temperature and model behavior introduce genuine randomness
-> your CLAUDE.md describes conventions; the model generates interpretations of themThe first two items are the most important. Every session starts from zero — there is no memory of the repository pattern established in Monday's session when Thursday's session opens. And retrieval is probabilistic — slightly different phrasing changes which files and symbols get surfaced as context, which changes which patterns the model treats as precedent.
Even with a well-written CLAUDE.md that describes your conventions, the model is interpreting a text description and generating output that it believes matches that description. The same text can be interpreted multiple ways when the task is complex. And as context rot sets in over a long session, the conventions described at the start of the session get progressively less weight.
The underlying dynamic is that without persistent grounding in the actual codebase, AI answers about your system are probabilistic guesses. The model is sampling from a distribution of plausible answers shaped by its training data and whatever local context it retrieved. That distribution may contain your team's convention as its most likely output — but it also contains every other reasonable approach, and which one surfaces varies session to session.
What architectural drift actually costs
Teams that catch this early think of it as an annoyance — a code review discipline problem. Teams that let it accumulate discover that it is a compounding liability.
What architectural drift from inconsistent AI output costs:
-> code review time: reviewers must catch convention violations
-> refactoring time: inconsistent patterns compound into systemic mess
-> onboarding confusion: new engineers see multiple "correct" approaches
-> debugging time: similar-looking code that behaves differently
-> trust erosion: developers stop trusting AI suggestions entirelyThe trust erosion item is underappreciated. When AI-generated code consistently requires correction for convention violations, developers start adding a mental overhead to every AI interaction: assume the output is probably using the wrong pattern, add review time accordingly. This is a rational response to inconsistency. It also eliminates a significant fraction of the productivity benefit the tool was supposed to provide.
The onboarding problem compounds over time. A new engineer who joins the team and reads the codebase sees multiple "correct" approaches to the same problems. They do not know which one is canonical because there may not be a canonical one anymore — the codebase itself has diverged. Their mental model of "how we do things here" is incoherent because the codebase reflects multiple AI sessions' worth of different interpretations of the same conventions.
Consistency is an indexing problem, not a prompting problem
The instinctive fix for AI inconsistency is better prompting: more detailed CLAUDE.md, stricter editor rules, more explicit convention documentation. These help at the margin. They do not solve the structural problem.
The structural problem is that the AI is generating against a text description of your conventions rather than against the actual conventions as they exist in the code. The gap between description and reality is always present and always growing. As the codebase evolves, the CLAUDE.md lags behind. As the team grows, different developers' descriptions of the same convention diverge slightly. As sessions accumulate state, the initial convention description gets diluted.
The fix is to give the model access to the conventions as they actually exist in the code, not as they are described in a text file. When the model can see that six existing services use a repository pattern for data access, it does not need to be told "use a repository pattern" — it can observe that this is how the team works and generate code that follows suit. The convention is grounded in behavior, not described in words.
This is what semantic codebase retrieval does differently from keyword search or raw file access. It reconstructs the actual patterns the team uses from the code itself — the canonical implementations, the consistent structures, the repeated abstractions. When that reconstruction is served as context to an AI coding session, the model's output is anchored to what the codebase actually is, not to what someone wrote about it.
Cross-team consistency is harder and more important
Individual inconsistency — one developer getting different answers on different days — is annoying but manageable with careful review discipline. Cross-team inconsistency is an architectural problem.
When multiple developers are each maintaining their own CLAUDE.md, their own context strategies, their own editor rules, they are each working from their own interpretation of the team's conventions. Those interpretations drift. The AI sessions each developer runs produce output consistent with their individual context — but inconsistent with each other.
A shared, managed semantic index solves this at the root. Every developer's AI sessions draw context from the same indexed representation of the same codebase. The starting point is identical. The conventions surfaced are the actual conventions in the code, not each developer's written interpretation of them. The result is not perfect consistency — sessions still vary, models still sample — but the baseline is shared and accurate rather than individual and approximated.
What non-technical teams experience
The consistency problem is not limited to developers. When product managers, support leads, or operations teams use AI tools to ask questions about the system — "how does subscription cancellation work," "what happens to pending charges when an account is paused" — they experience the same inconsistency. Ask the same question on different days and get answers that differ in consequential ways.
For non-technical users, the inconsistency is harder to detect. A developer who knows the codebase can recognize when the AI's answer contradicts the actual implementation. A product manager relying on AI explanation has no such check. They may make planning and scoping decisions based on whichever version of the system the AI described that day — which may or may not match how the system actually behaves.
Grounded context fixes this for non-technical users too. When answers are anchored to an indexed representation of the actual system rather than to probabilistic generation, the answer to "how does cancellation work" is the same today as it was last week — because it is derived from the codebase, not guessed from training data.
Final take
AI giving different answers about your codebase on different days is not a model problem and it is not primarily a prompting problem. It is a grounding problem. The model has no persistent, accurate source of ground truth about your specific system. Without that anchor, every session produces a fresh probabilistic sample from the space of reasonable answers — consistent in quality, inconsistent in specifics.
The fix is not to describe your conventions more precisely in a text file. It is to give the model direct access to the conventions as they exist in the code. Semantic codebase retrieval makes AI answers about your system deterministic in the only way that matters — not because the model stops sampling, but because every session starts from the same grounded representation of what the system actually is.