Blog

Your Codebase Documentation Is Always Out of Date. That's Not a Discipline Problem.

10 min read

The new engineer finds the architecture diagram in Confluence. It shows three services communicating through a message queue. They start working. Three weeks in, they discover a fourth service that does not appear on the diagram, a second queue that handles high-priority events separately, and a webhook integration with a third-party payment processor that was added six months ago. The diagram was not wrong when it was written. It just never kept up.

This is the standard experience on teams that have shipped at any real velocity. Documentation sprints happen. READMEs get written. Architecture diagrams go into Confluence. Six months later, the service split in two, the queue got replaced, a new integration appeared, and the documentation is now a snapshot of a system that no longer exists.

The instinct is to blame discipline. If developers updated the docs with every PR, this would not happen. But that instinct misunderstands the problem. Documentation decay is not a discipline failure. It is a math problem.

Why documentation decays: the math does not work

Documentation is a static artifact. It describes a system at a point in time. The system is dynamic. Every change — a new service, a renamed queue, a replaced integration, a refactored workflow — creates a gap between the document and reality. That gap is documentation debt.

In a slow-moving codebase, documentation debt accumulates slowly. Teams have time to pay it down. In a fast-moving codebase, the debt accumulates faster than anyone can repay it. Not because developers are lazy. Because the rate of change exceeds the rate of writing.

Documentation is also not on the critical path. A merged pull request does not fail because the Confluence page was not updated. A deployment does not halt because the architecture diagram still shows the old queue topology. The system ships. The documentation drifts. There is no forcing function that makes them stay in sync.

The teams most likely to have accurate documentation are the ones shipping least. The teams most likely to have stale documentation are the ones shipping most. That is the wrong correlation, and no process change fixes it structurally.

The documentation lifecycle: every codebase has all of these simultaneously

Birth

Documentation starts accurate. A new project launches. A documentation sprint produces READMEs, architecture diagrams, ADRs. The team feels good about this. The information is correct because the system was just built and nobody has changed it yet.

Middle age

Accuracy drifts over the next few months. Small changes accumulate. A service gets a new dependency. A queue gets renamed. A feature flag changes default behavior. Most documentation stays roughly right. Someone with experience can still use it with reasonable success if they know to treat certain details skeptically.

Decay

A major refactor, a service split, or a significant integration change breaks accuracy in ways that are not obvious from the document itself. The documentation still looks authoritative. It has headings, diagrams, decision records. But the system it describes diverged from reality at some point nobody can precisely identify.

Death

The documentation actively misleads. An on-call engineer consults the architecture diagram during an incident and makes a decision based on a topology that no longer exists. A product manager writes a spec based on a service description that is fourteen months stale. A new engineer builds a mental model from documentation that contradicts the actual system. Every use of the document increases the cost of the original documentation sprint.

Most large codebases have documentation in all four of these stages simultaneously. The troubling part is that the stages are not labeled. There is no badge on a Confluence page that says "this section is at death stage." It all looks the same from the outside.

Documentation decay timeline: a realistic service evolution example

Service evolution — what becomes stale and when:

Month 0: single order-service, one RabbitMQ queue, Stripe webhook
  docs: accurate

Month 3: payment-service split from order-service, retry logic added
  docs: partially accurate (mentions order-service, misses payment-service)

Month 6: high-priority queue added (separate from standard queue),
  new webhook integration with Avalara for tax calculation
  docs: significantly misleading (wrong queue topology, missing integration)

Month 9: order-service decomposed into order-service and fulfillment-service,
  Stripe replaced by Adyen, Avalara still present
  docs: actively harmful (describes Stripe flow that no longer exists,
  omits fulfillment-service entirely)

The “update the docs” instinct is wrong

The most common response to stale documentation is a process change: add a documentation update to the definition of done, create a documentation review step in the PR checklist, assign a documentation owner to each service. These interventions treat documentation as the source of truth. They assume that if the process is correct, the documentation will be accurate.

But the code is the source of truth. Documentation is a derivative. The order-service does not call payment-service because the architecture diagram says it does. The diagram says it does because someone looked at the code and wrote it down. The code is always the primary record. The document is always secondary.

Updating documentation manually to track a fast-moving codebase is like updating a printed map every time a road gets built. The map can never be current because by the time you finish printing it, the roads have changed again. The structural problem is not the quality of the map. It is that the map is the wrong interface for a dynamic territory.

This framing matters because it reframes what the solution needs to be. If documentation is a derivative of code, and the code changes faster than documentation can track, then the answer is not better documentation — it is making the code itself queryable.

What bad documentation actually costs

The cost of stale documentation is usually invisible until it is not. Nobody files a ticket for “documentation was wrong and I wasted two hours.” The cost shows up in slower onboarding, misinformed estimates, and incidents that take longer to resolve than they should.

New engineer onboarding

New engineers build mental models from whatever documentation exists. If that documentation describes a system that no longer matches reality, the mental model is wrong from the start. They spend their first weeks discovering discrepancies between the diagram and the code, which they usually interpret as their own confusion rather than as documentation decay. The onboarding takes longer and the early work is less confident.

Incident response

An on-call engineer gets paged at 2am for a payment processing failure. They open the architecture diagram. It shows order-service calling Stripe directly. They start investigating the Stripe integration. The actual problem is in Adyen, which replaced Stripe eight months ago, and which does not appear on the diagram at all. Twenty minutes are spent chasing the wrong path because the documentation was confidently wrong.

Product specification

A product manager writes a feature spec that depends on how the order workflow currently functions. They consult the documentation. The documentation describes the pre-split architecture. The spec asks for changes to a service that was decomposed six months ago. Engineering receives the spec and has to explain why the specified approach does not map to the actual system. The spec gets rewritten. The sprint starts late.

Engineering estimates

An engineering manager estimates effort for a new integration based on their understanding of the current architecture, partially informed by documentation that describes the system as it existed fourteen months earlier. The estimate misses a dependency that was added since then. The work takes longer than expected. The explanation is that the estimate was wrong, but the underlying cause was that the system was less understood than it appeared.

The documentation-as-query shift

Instead of trying to keep a static document current, derive answers from the current code on demand.

“What does the order service call?” is not a question a README should answer. It is a question the current codebase should answer directly. The README was written by a human who looked at the code and summarized what they saw. The summary aged. The code did not age — the code is always current because it is the system.

When the question is routed to the code rather than to a derivative document, the answer reflects the current state of the system. Not the state from six months ago. Not the state before the service split. The state right now.

This changes the role of documentation. Documentation is no longer trying to describe system behavior. It is describing things the code cannot tell you on its own: why decisions were made, what trade-offs were considered, what historical context shaped the architecture. That is a much narrower scope, and it changes much more slowly. It is actually maintainable.

What a queryable codebase replaces — and what it does not

Not all documentation becomes obsolete under this model. Decision records, reasoning for architectural choices, post-mortems, historical context — these still matter and they genuinely do not live in the code. Code can tell you what the system does. It cannot tell you why the team chose RabbitMQ over Kafka in 2022, or why fulfillment was split from order management after the Q3 scaling incident.

That category of documentation should still exist, should still be written carefully, and should still be maintained. But it changes rarely. Once written, a decision record about a service split stays accurate for years. The reasoning does not change even as the implementation evolves.

The category that becomes obsolete is the “what does the system currently do?” documentation. That question should never require opening Confluence. It should be answerable from the current codebase directly.

What documentation should and should not try to cover

What documentation should still cover:
  -> why the service split happened (decision record)
  -> why Adyen replaced Stripe (context, not behavior)
  -> which trade-offs were made in the queue topology
  -> historical migration context
  -> team ownership and escalation paths

What documentation should not try to cover:
  -> what the system currently does
  -> which services call which services
  -> what the current queue names are
  -> what integrations are active
  -> how a workflow actually executes today

This is where Kognita changes the model

Kognita maintains a managed semantic index of the actual codebase, not a static document. When the order service gets split into order-service and fulfillment-service, the index reflects the split. When RabbitMQ is replaced by SQS, the index reflects SQS. When a new Avalara integration is added, the index shows the Avalara integration.

The index is derived from the current state of the repository, re-indexed automatically on every change. There is no documentation sprint that produces it. There is no human who has to remember to update it. It is always current because it is derived from the thing that is always current: the code.

When someone asks “what does the order service call?” the answer comes from the current codebase. Not from a diagram someone drew fourteen months ago. Not from a README that was last touched before the Adyen migration. From the actual, current system.

This is not a documentation tool. It is a different category: a queryable system representation that is always synchronized with reality because reality is its source.

Documentation vs. system queries: the same questions, very different answers

Static documentation (written once):
  Q: What does the order service call?
  A: Publishes to the order_created queue, calls Stripe for payment
  (Written 14 months ago. Stripe is gone. fulfillment-service not mentioned.)

System query (derived from current code):
  Q: What does the order service call?
  A: Publishes to order_created_standard and order_created_priority queues.
     Calls payment-service via gRPC on port 9001.
     payment-service calls Adyen for authorization.
     fulfillment-service subscribes to order_created_standard.
     Avalara called synchronously for tax calculation before authorization.
  (Derived from the actual codebase. Current as of last commit.)

The gap between those two answers is not an edge case. It is the normal state for any codebase that has been in active development for more than a year. The static answer is what Confluence returns. The derived answer is what the actual system does. The difference between them is the cost of treating documentation as a source of truth.

Final take

Documentation decay is not a discipline problem. It is a category error. Treating the map as the territory — writing documentation about what the system does and then expecting that documentation to stay accurate — ignores the fundamental asymmetry between static artifacts and dynamic systems.

The fix is not writing better documentation or enforcing stricter documentation processes. Teams with excellent documentation hygiene still have stale architecture diagrams. The fix is making the territory itself queryable, so that questions about current system behavior are answered by the current system rather than by a human-written approximation of it.

Documentation should describe what the code cannot tell you: why decisions were made, what was considered and rejected, what historical context shaped the system. The code itself should answer everything else. When that split is correct, documentation becomes small, stable, and genuinely useful — instead of large, drifting, and quietly harmful.