KognitaKognita.

Blog

Your CLAUDE.md Is Taxing Every Single Prompt

10 min read

Every time you open a session, CLAUDE.md is loaded into the system prompt — and it stays there. Every message you send, every file Claude reads, every tool call it makes, that file is re-sent and re-weighed. You are paying for it on the very first turn, before you have asked anything, and on every turn after. It is the most easily ignored line item in your context budget because it never shows up as a mistake. It just quietly shrinks the room left for actual code.

On a small file nobody notices. But the whole pressure on CLAUDE.md is to grow — more conventions, more module notes, more gotchas — because that is how you keep it accurate. And the larger it grows to stay true, the more of every request it eats.

It is a fixed cost on every request

The thing to internalize is that CLAUDE.md is not a one-time cost. It is a standing tax:

The same tokens, re-sent every turn
CLAUDE.md is paid on every turn

  Session start  -> CLAUDE.md loaded into the system prompt
  Every message  -> it is still there, re-sent, re-weighed
  Every tool call -> still there
  Whether or not the file is relevant to what you asked

  A 4,000-token CLAUDE.md across a 40-turn session is not
  4,000 tokens. It is 4,000 tokens carried through every turn,
  competing for attention the whole way.

This is the same dynamic we describe in context window vs. an indexed codebase: anything loaded into the prompt is loaded for the whole session, whether or not the current question needs it. A file describing your deployment setup is still consuming tokens while you fix a CSS bug.

Accuracy and budget pull in opposite directions

Here is the trap. The more accurately you want CLAUDE.md to describe a real, growing system, the bigger it has to be — and the bigger it is, the more it costs on every request and the more it crowds out the code you actually want the model looking at:

You cannot be both cheap and complete
The accuracy/budget trap

  Make CLAUDE.md MORE accurate
    -> add module map, conventions, gotchas, service list
    -> file grows to stay true as the codebase grows
    -> larger fixed cost on every request, less room for code

  Keep CLAUDE.md SMALL
    -> cheap per turn, leaves room for actual files
    -> but too thin to describe a real system accurately
    -> the model falls back to guessing

  There is no setting that is both cheap and complete.

There is no length that escapes this. A thin file is cheap and inaccurate. A thick file is accurate-ish and expensive, and it pushes you toward the other failure mode — a file so long the model skims it, which is its own problem covered in how long CLAUDE.md should be. The token tax and the length ceiling are the same wall seen from two sides.

The opportunity cost is the real bill

The raw token cost is the smaller problem. The bigger one is opportunity cost: every token spent on a standing description of the codebase is a token not spent on the actual files relevant to the task. Context windows are finite, and attention spreads thinner the more you fill them. A bloated CLAUDE.md does not just cost money — it actively degrades the model's focus on the code in front of it.

So you are paying twice: once in tokens for the file, and again in quality because the file displaced room the model needed for the real problem. And much of what the file describes is irrelevant to any given query — you are carrying the whole map to answer a question about one street.

Retrieval has no standing cost

The structural fix is to stop loading codebase knowledge as a fixed block and start retrieving it on demand. A semantic index pays nothing until you ask, then returns only the slice relevant to the current query:

Fixed prompt tax vs. pay-per-query retrieval
Static load vs. retrieved-on-demand

  CLAUDE.md (static):
    -> fixed tokens, every turn, relevant or not
    -> grows with the codebase to stay accurate
    -> crowds the window before you ask anything

  Semantic retrieval (Kognita via MCP):
    -> zero standing cost; nothing loaded until asked
    -> pulls only the chunks relevant to THIS query
    -> the index can be enormous; the prompt stays lean

The index itself can be arbitrarily large — millions of lines across dozens of repos — because none of it sits in the prompt by default. Only the chunks that match the question are pulled in. The budget you spend tracks the question, not the size of the codebase.

Where Kognita fits

Kognita keeps the codebase representation server-side as a continuously updated semantic index and serves relevant context to the agent through MCP only when a query needs it. There is no standing block consuming your window every turn, no file that has to grow to stay accurate, and no tradeoff between completeness and cost. The index holds everything; the prompt holds only what this question requires. Keep CLAUDE.md small and stable for genuine global norms, and let retrieval carry the rest without taxing every request.

Final take

A CLAUDE.md that is complete enough to be useful is large enough to be expensive, and you pay that price on every turn whether the file is relevant or not. The instinct to keep adding to it is exactly what makes it cost more and focus the model less.

Codebase knowledge should be a query, not a constant. Retrieve what the question needs; stop paying for the whole map on every request.