Blog

How to Scope a Technical Migration Without the Mid-Project Surprises

9 min read

The migration was scoped at three weeks. Six weeks in, the team has migrated four of the twelve services they thought were affected. The other eight surfaced mid-project: two because they shared a database schema that was not in the original blast radius assessment, three because they called an internal API that was part of the migration scope, and three because they wrote to a queue that was being deprecated. The scope was not wrong on day one — it was incomplete. The codebase knew the answer. Nobody asked it.

This is the standard pattern for technical migrations that go over time and budget. Not execution failures — scope failures. The work itself gets done competently. The problem is that the work expands mid-project because the initial scoping did not find all the affected surfaces before the first commit landed. Every week of mid-project discovery is a week of unplanned work that the timeline was not built to absorb.

Why migrations fail on scope, not execution

Technical migrations fail on scope in two distinct ways. The first is the wrong blast radius: the team identified the component being migrated but did not find all the other components that are affected by it. The second is wrong dependencies: the team knew what they were migrating but did not fully map what that component connects to — what calls it, what reads from it, what consumes its output.

Both failure modes are discoverable before the work starts. Neither requires guessing. The system's actual structure — which services call which APIs, which services share schema, which workers subscribe to which queues — is encoded in the codebase. The problem is not that the information is hidden. The problem is that the scoping process does not systematically extract it before committing to a timeline.

What makes this particularly expensive is that the failures compound. When a service surfaces mid-migration that was not in the original scope, the team does not just add it to the backlog and keep moving. That service often has its own dependencies on the thing being migrated. It may already have work in progress from another team that now conflicts. It may require rollback safeguards that were not built into the migration plan. One missed surface becomes three weeks of unplanned work — and the migration that was scoped at three weeks is not even done at six.

The three types of hidden migration scope

Every technical migration has a visible scope — the component being changed — and a hidden scope — everything that depends on or interacts with that component. Hidden scope falls into three categories, each with a distinct failure mode when missed.

The first category is shared data model consumers. When a migration involves changing a database schema — renaming a table, altering column types, restructuring a data model — every service that reads from or writes to that schema is in the blast radius. In practice, this means services that the migrating team does not own, services that were built years ago by engineers who have left, and services that share the database for historical reasons that are no longer obvious from the architecture. Schema changes that look contained affect all of them simultaneously.

The second category is API consumers. When a migration involves changing or deprecating an internal API endpoint — updating its contract, changing its response shape, removing a field, requiring a new parameter — every service that calls that endpoint is affected. These consumers do not appear in the repository of the service being migrated. They are distributed across the codebase in services the migrating team may not have looked at. They are only visible through cross-repo tracing.

The third category is event and queue subscribers. When a migration involves changing a message schema, deprecating a queue, or altering the event taxonomy of a messaging system, every subscriber to those events is in the blast radius. Subscribers are often asynchronous workers or analytics pipelines — services that do not call the migrated component directly and do not appear in any dependency graph that only traces synchronous API calls. They are the most frequently missed category and the most likely to cause silent failures: the subscriber keeps running, processes the wrong shape, and drops records or produces corrupted state with no error thrown.

None of these appear reliably in architecture diagrams. All three live in the code.

Why the initial estimate is always wrong

Teams scope migrations against what they know. They know what they are migrating — that is the starting point of the entire effort. They know the services they own. They know the architecture they have worked in recently. What they do not reliably know is what depends on what they are migrating, especially across service ownership boundaries and across time.

The initial estimate reflects the visible blast radius: the component being migrated plus whatever dependencies are immediately apparent. In a small system with one or two services, that is sufficient. In a distributed system with dozens of services developed by multiple teams over multiple years, the visible blast radius is consistently smaller than the actual blast radius. The gap between the two is what turns a three-week migration into a six-week one.

This is not a failure of the engineers doing the scoping. A senior engineer who knows the system well will catch more hidden dependencies than one who does not. But no individual has a complete current mental model of a production system with dozens of services and years of accumulated changes. The services added by other teams eighteen months ago, the workers that were built to handle a now-standard pattern, the legacy data pipeline that still reads from the old schema for a reporting job — all of it is real, all of it is in the blast radius, and none of it appears reliably in one person's recollection of the architecture.

The manual blast radius process and why it fails

The standard approach to blast radius assessment is a combination of three methods: grep for references to the component being migrated, check the architecture diagram, and ask the team. Each has specific and well-documented failure modes.

Manual blast radius discovery — methods and where each fails

Manual blast radius discovery — methods and where each fails:

  Method: grep for the service name or endpoint path
  Failure: misses semantic dependencies
    -> BillingService does not import PaymentsService — it calls the HTTP API
    -> grep for "PaymentsService" returns zero results in BillingService
    -> the dependency exists, grep cannot see it

  Method: check the architecture diagram
  Failure: diagrams are stale
    -> WebhookHandlerService was added in a sprint eight months ago
    -> the architecture diagram has not been updated since last year
    -> the service does not appear — it is real, it is in production, it is invisible

  Method: ask the team
  Failure: knowledge is incomplete and person-dependent
    -> the engineer who built AnalyticsWorkerService left six months ago
    -> nobody else remembers it reads Stripe events directly
    -> the current team cannot report what they do not know

  Method: check service documentation
  Failure: documentation describes intent, not behavior
    -> AdminPanelService README says "displays payment information"
    -> it does not say which fields, which API version, or which service it calls
    -> behavior must be read from the code — not the README

The combined failure rate of these three methods is not random. It is systematic. Grep misses the dependencies that are most likely to be forgotten because they do not use the internal service name — they call an HTTP endpoint, read from a shared database table, or subscribe to a queue topic. Architecture diagrams miss services that were added after the last diagram update, which in practice is most recently-added services. Team knowledge misses everything built by people who have left and everything operated by teams who were not invited to the scoping conversation.

The result is a blast radius assessment that is accurate for what the team already knew and systematically incomplete for everything else. The things it misses are exactly the things that surface mid-migration.

What a pre-migration discovery actually needs

A complete pre-migration blast radius assessment requires five specific inputs, all of which are extractable from the codebase and the issue tracker before any work begins.

Pre-migration discovery checklist — what to find and what breaks when you miss it

Pre-migration discovery checklist — what to find and what breaks when you miss it:

  1. Shared data model consumers
     Find: every service that reads from or writes to the affected schema
     Miss it: schema changes silently corrupt reads in services you did not touch
     How it surfaces: data integrity errors in a service three hops away, two weeks after migration

  2. API consumers
     Find: every service that calls the endpoint being changed or deprecated
     Miss it: a downstream service sends requests to the old contract and gets 4xx errors
     How it surfaces: production errors in a service the migration team does not own

  3. Event and queue subscribers
     Find: every consumer of the message schema being modified
     Miss it: subscribers still parsing the old shape silently drop or misprocess messages
     How it surfaces: missing records in analytics, silent processing failures, stale state

  4. In-progress Jira work
     Find: every open epic or ticket that touches the migration blast radius
     Miss it: two teams modify the same component from incompatible directions mid-sprint
     How it surfaces: merge conflicts, broken contracts, or duplicate refactors in the same sprint

  Summary: every category you skip is a surface you discover mid-migration
  instead of before the first commit.

The first four inputs come from the codebase. The fifth — open Jira work in progress that overlaps with migration scope — comes from connecting the codebase trace to the active sprint. When two inputs are combined, you get the full picture: the structural blast radius from the code plus the active collision risk from the issue tracker. Missing either one produces an incomplete scope.

The key shift in methodology is moving from "what does the team remember about this component?" to "what does the system actually show about this component?" Memory-based scoping is bounded by who is in the room and what they have recently touched. System-based scoping is bounded only by what is in the codebase — which includes everything, including the services built by teams who are not in the room.

A concrete migration scoping example

Consider a migration from Stripe v1 to the v2 API. The obvious scope is the PaymentsService — it owns the Stripe integration, it holds the API key, it is the service the team is migrating. On day one, the blast radius looks contained: migrate PaymentsService, update its tests, verify the new API behavior, ship it.

This is what a grep-and-ask scoping process produces. It is also incomplete by four services.

Payments migration — visible scope versus actual blast radius

Payments service migration — Stripe v1 to v2:

  Obvious blast radius (what the team scoped):
  -> PaymentsService (the service being migrated)
     Calls Stripe v1 API directly
     Owns charge creation, refund logic, card tokenization

  Actual blast radius (what the team discovered mid-project):
  -> BillingService
     Calls /v1/charges to retrieve charge history for invoices
     Expects v1 response shape: { id, amount, currency, status }
     v2 returns: { id, amount, currency, payment_status }  ← field renamed

  -> WebhookHandlerService
     Parses Stripe v1 webhook event shapes
     Event type "charge.succeeded" renamed to "payment_intent.succeeded" in v2
     Handler silently discards v2 events — no error thrown

  -> AdminPanelService
     Renders v1 field "failure_code" in the charge detail view
     v2 deprecates failure_code in favor of last_payment_error.code
     Field returns null in v2 — UI shows nothing, no error

  -> AnalyticsWorkerService
     Logs Stripe v1 event names for funnel tracking
     v2 event taxonomy is different — existing dashboards stop updating
     No error thrown, records just stop appearing

  Total blast radius: 5 services (not 1)
  Additional work discovered: +3.5 weeks
  Scope on day 1: PaymentsService only

Each of the four additional services has a different failure mode. BillingService throws errors because a field it expects was renamed in the v2 response. WebhookHandlerService silently drops events because the event name changed. AdminPanelService renders blank fields because the v1 field it displays was deprecated and now returns null. AnalyticsWorkerService stops updating dashboards because it logs v1 event names that no longer exist in v2. None of them throws a clear error immediately. Three of them fail silently. All four are in production and affecting real users before anyone realizes they were in the blast radius.

None of this is discoverable by asking the PaymentsService team who depends on their service. BillingService calls the HTTP API directly — not PaymentsService as a dependency, but the Stripe endpoint that PaymentsService was previously wrapping. WebhookHandlerService was built by the platform team two years ago and no longer has an active owner. AdminPanelService was built by the frontend team and is not on the engineering lead's mental map of the payments blast radius. AnalyticsWorkerService was built by the data team and operates independently. A cross-repo behavioral trace finds all four. A team conversation finds none of them with confidence.

Kognita for migration scoping

Kognita provides semantic cross-repo tracing that finds all consumers of a service, schema, or API before work starts. The question "what calls the v1 payments API?" returns a complete answer — not grep results bounded by what the search pattern happened to match, but a behavioral trace across the entire indexed codebase that finds every service that interacts with that endpoint regardless of how the interaction is coded.

The answer includes services in different repositories, services that call the API indirectly through an HTTP client rather than a service import, services that parse the response shape rather than calling the endpoint directly, and workers that subscribe to events produced by the migrated component. These are exactly the categories that manual blast radius assessment misses. They are the categories that surface mid-migration.

The index is managed — no local setup, no repository access required for non-engineering team members, no manual re-indexing. It updates automatically, so the answer to "what calls this endpoint?" reflects the current state of the code on the day the question is asked, not the state from the last architecture review. A product manager or scrum master can run the pre-migration discovery query before the scoping session without waiting for an engineer to run a search manually.

The Jira connection

The structural blast radius from the codebase is the necessary input. The Jira connection is what makes it operationally complete. Open epics and in-progress work that touch the migration blast radius need to be discovered before the migration starts — not when two teams collide mid-sprint having made incompatible changes to the same component.

The specific query that matters is: which open Jira tickets or active sprints involve work that overlaps with the components in this migration's blast radius? In a typical engineering organization, this question requires manually checking every open ticket and tracing each one's implementation scope — work that nobody does in practice because it takes too long to be a realistic part of migration planning.

When Jira context is connected to codebase context, this becomes a single pre-migration check. The output is not a list of all open tickets — it is a filtered list of tickets that specifically touch the services, schemas, or endpoints in the migration blast radius. That list changes the sequencing conversation before the migration starts rather than forcing a coordination scramble after two teams have already started working in conflicting directions.

Final take

Migration surprises are not bad luck. They are the predictable result of scoping against what the team knows rather than against what the system contains. Every service that was in the blast radius on day one of the migration but was not discovered until week three was always in the blast radius. It did not appear mid-project — it surfaced mid-project. The difference is when you find it: before the first commit, when it changes the scope estimate and the timeline, or after multiple services are already partially migrated and the cost to handle it has compounded.

Making the system queryable before the migration starts is the only way to get a reliable scope. Not a best-effort scope built on memory, diagrams, and whoever attended the architecture walkthrough — a complete scope built on what the codebase actually shows about its own structure. That is the difference between a migration that finishes when it was supposed to and one that finishes when the last hidden surface finally surfaces.