Blog

The Hardest Codebase to Understand Is the One You Just Acquired

11 min read

The original engineers are gone. The documentation is three years behind the code. The commit messages say "fix" and "update." There is one service nobody touches and everyone is afraid of. The person who wrote it left eight months ago and took the context with them. Post-acquisition engineering onboarding is the worst-case codebase scenario: no institutional memory, business pressure to ship from day one, and an AI coding tool that starts hallucinating the moment the patterns get unfamiliar.

Acquisitions are sold on the strength of the technology. The due diligence covers architecture reviews, security audits, dependency inventories. What it does not capture — what no document captures — is the tacit knowledge that made the codebase navigable to the people who built it. The naming conventions that made sense internally. The service boundary that was drawn for a business reason that no longer applies. The module that was on the roadmap for deprecation when the acquisition closed. That knowledge does not transfer in a data room. It transfers through the engineers, and in most acquisitions, a meaningful share of those engineers do not stay.

What the acquiring team inherits is the system without the map. Everything needed to understand the map is inside the code — but getting from "the code contains this" to "the engineers understand this" is the problem, and the tools built for that problem mostly assume you can ask the people who wrote it.

Why acquired codebases are structurally different

In an organic codebase, the team that built it is still there. When a new engineer joins, there are senior engineers who can explain why the authentication service is structured the way it is, why OrderService and FulfillmentService have a hard boundary, why there are two billing modules that look redundant but are not. That knowledge is refreshed constantly through code review, architecture discussions, and the informal accumulation of context that happens when a team lives inside a system for years.

In an acquired codebase, the institutional memory left with the team. Whatever was in people's heads is now gone — or accessible only through expensive, unreliable channels: emails, Slack history, the occasional engineer who stayed through the transition and is now pulled in every direction at once. The commit history exists, but it was written for people who already understood the context. "Fix checkout flow" means something to the engineer who wrote it and nothing to the engineer reading it two years later with no background.

The architectural decisions that made sense in 2021 are not explained anywhere. A microservice split that was driven by team structure — Conway's Law in action — looks arbitrary to an engineer who does not know that the two teams responsible for it stopped talking to each other eighteen months before the acquisition. A data model that seems unnecessarily complex was built around a compliance requirement that was scoped out before launch and never removed from the schema. The code carries the consequences of decisions made for reasons that no longer exist and were never written down.

This is structurally different from a standard engineer onboarding situation in one critical way: in standard onboarding, the context exists and can be surfaced through human interaction. In post-acquisition onboarding, the context exists only in the code. There is no human to interrupt. The codebase is the only source of truth, and the codebase is not easy to query.

How standard AI tools fail on unfamiliar code

Standard AI coding tools — Cursor, GitHub Copilot, and similar — work by pattern-matching against training data. They are good at codebases that look like the public repositories they were trained on: standard frameworks, conventional naming, common architectural patterns. When a codebase diverges from those patterns, the model fills the gap with its priors. The output looks confident and compiles cleanly. It is often wrong for this specific system.

An acquired startup's codebase is almost always at least a little weird. Startups build for speed, not convention. They adopt internal frameworks because an off-the-shelf solution did not fit their constraints. They name services after internal jokes or founder preferences. They structure their service boundaries around the two engineers who built the initial version, not around the domain model that emerges later. The resulting codebase is coherent to the people who built it and opaque to the pattern-matching model that has never seen anything quite like it.

The failure mode is not random errors — it is confident misdirection. An AI tool that has never seen the acquired startup's custom job queuing framework will not say "I don't know this pattern." It will suggest the standard library pattern it does know, wired up correctly according to that library's conventions, connected to the wrong abstraction layer entirely. The engineer who trusts that suggestion has now introduced a dependency on a library the codebase does not use, bypassing the actual job management layer. This is the same failure pattern that afflicts legacy codebases more broadly — and acquired codebases have it in concentrated form.

How standard AI tools fail when codebase patterns are unfamiliar

How standard AI tools fail on acquired codebases:

  Failure mode          | Root cause                    | Consequence
  ----------------------|-------------------------------|---------------------------
  Framework confusion   | Custom internal framework     | Suggestions follow common
                        | not in training data          | patterns, bypass actual
                        |                               | system abstractions
                        |                               |
  Convention mismatch   | Idiosyncratic naming or       | Generated code violates
                        | structural patterns the       | conventions the AI never
                        | model has not seen            | detected
                        |                               |
  Dead code inclusion   | Legacy configs and deprecated | AI treats dead code paths
                        | modules coexist with live     | as live, produces code
                        | code, no differentiation      | that calls deleted routes
                        |                               |
  Dependency blindness  | Implicit service dependencies | Changes that look isolated
                        | not visible from call sites   | break downstream consumers
                        | in a single file              | the AI did not know existed
                        |                               |
  Architectural drift   | AI assumes standard service   | Suggestions that were
                        | boundaries based on naming    | right for how the service
                        |                               | was named, not how it
                        |                               | actually works
                        |                               |
  Confidence inflation  | Model fills uncertainty gaps  | Plausible, confident output
                        | with training priors          | that is wrong for this
                        |                               | specific system

The problem compounds as the codebase diverges further from common patterns. A service with an unusual name and an internal convention for dependency injection will confuse the model more than a service with a standard name and a conventional pattern. The more the acquired startup built to its own conventions — which is the mark of a mature, intentional engineering team — the worse the standard AI tools perform on day one.

What the first month actually looks like without help

The first month in an acquired codebase without institutional memory or adequate tooling is a slow, expensive grind. The engineers on the acquiring team are reading code to understand a system they have not seen before, without being able to ask the authors, without documentation that reflects the current state, and with only each other to consult — and they all started at the same time.

The changes that get made in the first month are the most dangerous. An engineer who does not yet understand the system's dependency structure makes a change that looks isolated and breaks a downstream consumer they did not know existed. An engineer who has not traced the authentication flow adds a new API endpoint that bypasses a validation step that every other endpoint runs. An engineer who does not know that a deprecated service still has live traffic removes a code path that was serving real requests in production. These are not careless mistakes. They are the natural result of working in a system without a map.

Code duplication is a persistent first-month problem. Engineers who cannot find existing functionality build new versions of it. The acquired codebase has a utility for formatting currency that six different modules already use — the new engineer, not knowing it exists, writes a seventh. The acquired codebase has a client library for the third-party SMS gateway — the new engineer, not finding it, installs a different library and creates a second integration. The codebase accumulates new technical debt in the first month because the engineers adding it cannot see what is already there.

The configuration gap produces the most embarrassing failures. The acquired codebase's staging and production environments differ in ways that were never written down — because the people who set them up knew the differences and did not need to write them down. A change that works in staging fails in production because the production environment has a Redis instance with a shorter key expiry, a CDN configuration that rewrites certain headers, or a database connection pool that behaves differently under the acquired system's load patterns. None of these are in any document. All of them have caught new engineers in their first month.

The specific questions that have no quick answer

Post-acquisition onboarding generates a specific category of question that takes weeks to answer by reading code and would take minutes with the right context layer. These are not questions about individual functions or specific bugs. They are questions about system behavior: why does this work this way, what depends on what, what is the intended behavior of this path.

“What does this service actually do?” sounds like an easy question. It is not. The service name may not reflect its current responsibility. The README may describe what it was supposed to do when it was created. The code may contain the accumulated result of three rounds of scope changes that happened without documentation updates. Answering this question correctly requires reading the service, tracing its callers, understanding what it writes and where, and synthesizing a coherent picture from those inputs. That takes days, not minutes.

“What breaks if I change the schema of this table?” is a question that requires knowing every consumer of that table — services that read it, background jobs that query it, external sync processes that depend on it, API endpoints that serialize it. In an acquired codebase where the service topology is not well understood, finding all those consumers requires a search that spans the entire repository. Grep can find direct SQL references. It will miss the ORM queries using the column by a slightly different alias, the batch export job that constructs the query dynamically, the legacy reporting module that was never migrated off the old schema.

“Are there consumers of this queue that are not obvious from the code?” is the question that precedes the production incident. Queue consumers in acquired codebases are often registered at application startup, in configuration files, or through a mechanism specific to the acquired team's internal framework — none of which are obvious from reading the producer code. The engineer who changes the message format without finding all consumers discovers the missing ones when they stop receiving messages.

What engineers need to understand when inheriting an acquired codebase

What engineers need to understand when inheriting an acquired codebase:

  Service topology:
  -> What are the services and how do they communicate?
     (REST, event queues, gRPC, direct DB calls — which ones use what)
  -> Which services are stateful vs. stateless?
  -> Are there any services that run on separate infrastructure
     not visible from the application repositories?

  Data layer:
  -> Which databases does each service own?
  -> Are there shared databases — and if so, which services write to them?
  -> What background jobs touch persistent state and on what schedule?

  Business logic location:
  -> Where does the core domain logic live? In services, models, or
     somewhere in between?
  -> Are there undocumented validation rules that bypass the obvious
     entry points?
  -> Which parts of the codebase are actively maintained vs. frozen?

  Configuration and environment:
  -> Where are environment-specific config values stored?
  -> Are there production config differences that are not reflected
     in staging or the repo?
  -> What secrets management pattern is in use?

  Ownership and risk:
  -> Which services are "load-bearing" — breakage causes immediate incident?
  -> Are there services nobody has touched in over a year?
  -> What parts of the codebase were the last engineers actively working on
     when the acquisition closed?

What semantic codebase intelligence gives you

The core problem is that the information needed to answer these questions exists in the codebase — it is just not accessible in the form the questions are being asked. The codebase knows which services depend on the payment service. It knows which table columns every consumer reads. It knows every caller of the authentication method. That information is distributed across hundreds of files, encoded in call graphs and import chains and configuration values, and not directly queryable in plain language.

Semantic codebase intelligence is an approach that builds an execution-aware, call-graph-grounded model of a codebase from the code itself. Rather than pattern-matching against training data — which fails on unusual patterns — it analyzes the actual structure of the specific system: what calls what, what depends on what, what the execution paths through the system look like. The model it builds reflects this codebase, not a generic approximation of codebases that look similar.

For acquired codebases, this matters because the unusual patterns are exactly the patterns that need to be understood. The custom job queuing framework is a load-bearing part of the system. The non-standard authentication chain is the actual security boundary. The idiosyncratic service boundary that would look wrong to an outside observer exists for a reason — even if that reason is now historical — and any engineer who changes it without understanding it is taking on unquantified risk. Semantic indexing that builds from execution paths rather than pattern recognition captures these correctly, because the execution paths are there regardless of whether the pattern is conventional.

The difference in practice is the difference between a tool that says "this looks like it might be an authentication service" and a tool that says "this is the authentication entry point, it calls these four classes in this order, it validates against this database table, and these seventeen endpoints depend on it." One is a pattern-match guess. The other is a map of the actual system.

Kognita for acquired codebases

Kognita connects to the acquired repository — no legacy authors required. The engineering team on the acquiring side connects the repository once. From that point, any team member can ask plain-language questions about the acquired system and get answers grounded in the actual code, not in generic patterns or outdated documentation. The acquired system's unusual conventions, custom frameworks, and idiosyncratic service boundaries are indexed as they are, not normalized toward what a standard system would look like.

The questions that would take weeks to answer by reading become minutes. "What are all the callers of this authentication method?" returns every entry point — including the ones wired through the acquired team's custom middleware registration pattern that no one on the acquiring team had seen before. "What services does the payment service depend on?" returns the full dependency graph, including transitive dependencies and the async queue consumers that are not visible from the payment service code alone. "Which parts of the codebase touch user billing?" returns every service, job, and controller that interacts with billing data — the complete picture, not the partial one that file search returns.

The benefit is not limited to developers. Engineering managers trying to assess integration risk need to understand what the acquired system does before they can plan the migration. Product managers trying to scope which features from the acquired product can be preserved need to understand which parts of the codebase implement those features. The same access problem that affects contractors dropped into an unfamiliar system affects every non-author who needs to reason about an acquired codebase — and Kognita makes the system queryable for all of them.

Questions Kognita can answer about an acquired system

Questions Kognita can answer about an acquired system:

  Topology questions:
  Q: "What are all the services in this repository and how do they
      communicate with each other?"
  A: Returns a grounded map of services, their communication patterns
     (sync REST, async queue, etc.), and which pairs are coupled tightly
     vs. loosely — derived from actual call graph analysis, not naming.

  Dependency questions:
  Q: "What services does the payment service depend on, directly
      and transitively?"
  A: PaymentService -> FraudCheckService (sync, every transaction)
                    -> NotificationService (async, on completion)
                    -> AuditLogService (sync, writes before response)
     NotificationService -> EmailQueueWorker (async, SQS)
                         -> SMSGatewayAdapter (sync, legacy)
     Surfaces the full dependency graph, including hidden transitive deps.

  Impact questions:
  Q: "What breaks if I change the schema of the user_accounts table?"
  A: Returns every service that reads from user_accounts, every
     background job that queries it, any external sync processes
     identified in the index — with the specific columns each consumer
     depends on.

  Ownership questions:
  Q: "Which parts of the codebase touch user billing?"
  A: Returns BillingService, PaymentService, SubscriptionWorker,
     AdminBillingController, and three data export jobs — with the
     specific billing-related operations each one performs.

  Risk questions:
  Q: "What are all the callers of the AuthenticationService
      verify() method?"
  A: Returns every entry point that calls into authentication
     verification, including any that bypass the standard middleware
     path — a complete picture of the attack surface before any
     security-sensitive change.

The index updates as code changes. Engineers who join the acquired system in month three are working with the same quality of context as engineers who started in month one — and the context they have reflects the system as it exists in month three, not as it was documented before the acquisition closed. There is no maintenance overhead. No documentation to keep current. The codebase is the source of truth, and the index keeps pace with it.

Final take

The cost of a slow acquisition onboarding is not just productivity loss in the first quarter. It is wrong architecture decisions made in ignorance of the acquired system's actual structure — decisions that create technical debt that takes years to pay down. It is services rewritten when they should have been extended, because the engineers doing the integration did not know the service existed or did not understand what it did. It is third-party integrations duplicated, security boundaries misunderstood, data models partially migrated because no one had a complete picture of the schema's consumers.

The cost is also human. Engineers who spend their first months in an acquired codebase making mistakes they could have avoided with better context are demoralized, not just unproductive. The feedback loop of "change something, break something, revert, understand why too late" is the worst possible introduction to a system that the acquiring team is being asked to take ownership of and improve.

Semantic codebase understanding compresses the timeline from months to weeks — not by making the system simpler, but by making it legible from day one. The acquired codebase contains everything the engineering team needs to know. The question is whether they can access that knowledge in a form that maps to the questions they are actually asking. When the engineers who built it are gone and the documentation is three years out of date, the code itself is the only reliable source of truth. What changes is whether that truth is queryable.