Blog

API Contract Changes Break More Than the Team That Made Them

9 min read

The team shipped the API change on Tuesday. It was well-tested. The endpoint was versioned. They deprecated the old field with a migration guide in the PR description. Three days later, the analytics worker started silently dropping records — it was still reading the old field, which now returned null. The analytics team had no idea the field changed. They were not in the PR review. The field was still in the response, just always null. Nobody threw an error.

That is the standard anatomy of a cross-service API breakage. Not a 500. Not an immediate failure. A quiet degradation that shows up in the wrong place, at the wrong time, noticed by the wrong team — after data has already been lost or corrupted. The changing team did everything right by their own standards. The problem was not in the change itself. The problem was in what they did not know: who else was reading that field.

The anatomy of a cross-service API break

The most damaging API breaking changes are not the ones that throw immediately. They are the ones where the consumer keeps running and the data gets silently wrong. A field that used to be a string is now an integer. A field that returned null now does not exist. A field that was always populated is sometimes empty based on a new business rule. In each case, the response still comes back. The consumer still processes it. The downstream effect surfaces later — in a support ticket, in a data audit, in a dashboard that stopped updating for no apparent reason.

The loudness of a break is not correlated with its severity. An immediately-thrown exception is easy to trace — the stack trace leads back to the API change, the on-call engineer finds the culprit in minutes, and the incident is resolved the same day. A silent failure that corrupts analytics data for a week before anyone notices is far more expensive: the data loss is irreversible, the root cause is harder to trace, and the business impact accumulates while the engineering team has no signal that anything is wrong.

The break taxonomy matters for understanding why consumer discovery needs to happen before the change, not after. If every API break threw immediately, post-hoc debugging would be sufficient. Because the most common breaks are silent, the only reliable defense is knowing who reads each field before changing it.

Why breaking changes slip through

Breaking changes reach production undetected through a combination of five factors, each of which is individually plausible and collectively catastrophic.

The first factor is consumer knowledge gaps. The changing team knows their consumers — or thinks they do. They know about the services that appear in their own service's documentation, that their teammates have mentioned, that are in the same team's ownership. They do not reliably know about the analytics worker built by the data team two years ago, the admin tool built by the ops team for internal use, or the webhook handler maintained by a platform engineer who left eight months ago. The consumer map in the changing team's head is a subset of the actual consumer map.

The second factor is the difference between schema validation and semantic correctness. A strongly-typed API with schema validation will catch type errors at the contract boundary: if a consumer sends the wrong type or a required field is absent, the validation layer rejects the request. What schema validation does not catch is semantic changes — a field that changes what it means while keeping the same type and name. The consumer's request passes validation. The response passes validation. The data is wrong.

The third factor is the false safety of deprecation without removal. When an API team deprecates a field instead of removing it, they typically believe they are being conservative — giving consumers time to migrate. In practice, deprecation-without-removal removes the urgency for consumers to update. The field is still there. The consumer still reads it. The consumer team does not see a production error, does not get paged, and does not escalate the migration until something forces the issue. Meanwhile, the deprecated field is providing increasingly degraded data that the consumer team is treating as accurate.

The fourth factor is ownership drift. Services built by people who have since left the team have a period of implicit ownership before someone new explicitly takes responsibility. During that period, they receive no manual monitoring and are not represented in the PR review for changes that affect them. The change ships. The unowned service breaks. Nobody catches it in review because nobody who would notice was in the review.

The fifth factor is integration tests against mocks. Most service-level integration tests run against a mocked version of the API they depend on. The mock reflects the API contract as it was when the test was written. When the real API changes, the mock does not update automatically. The consumer's tests keep passing against the now-stale mock while the consumer is broken against the real API in production.

Three API change scenarios — what breaks silently versus loudly

The failure mode depends on how the consumer handles the changed field, not on how the API team made the change. The same field change produces immediate visible failures in some consumers and silent degradation in others, depending on how each consumer was written.

Three API change scenarios — what breaks silently vs. loudly, and why

Three API change scenarios — what breaks silently vs. loudly, and why:

  Scenario 1: Field type change
  Change: transaction.amount goes from string ("1999") to integer (1999)
  Loud break: consumers using strict type validation throw immediately
  Silent break: consumers that parse the value with parseInt() or coerce it
    keep running — but comparisons against stored string values in the
    database now fail. Records stop matching. No error thrown.

  Scenario 2: Field renamed, old field deprecated (not removed)
  Change: user.failure_code deprecated, user.last_payment_error.code added
  Loud break: consumers that throw on missing fields
  Silent break: consumers reading failure_code get null, not an error
    Display layer shows empty fields. Support tickets start coming in
    three days later about missing payment failure reasons.
    Engineering does not know why — no error was thrown.

  Scenario 3: Enum value added
  Change: order.status adds new value "partially_fulfilled"
  Loud break: consumers with exhaustive switch statements that throw on
    unrecognized values (rare in practice)
  Silent break: consumers with a default case silently route
    "partially_fulfilled" orders into the default handling path —
    treating partial fulfillment as complete fulfillment.
    Inventory counts drift. No error is thrown.

  Common thread: the consumer keeps running. The data gets worse.
  The failure surfaces downstream, not at the API boundary.

The pattern in all three scenarios is the same: the consumer that is most likely to fail silently is the one that handles unexpected values with a default path rather than throwing. This is actually better engineering practice from a resilience standpoint — defensive code that does not crash on unexpected input is more robust in most contexts. The cost is that it absorbs changes that should have been coordinated, making problems invisible until they accumulate into something large enough to notice.

The "find all consumers" problem

Before changing an API, the changing team needs to know three things: who calls this endpoint, who reads this specific field, and who depends on the current response shape. In a monolith, these questions are answerable with standard tooling — the codebase is one repository, the dependencies are explicit, and a search finds all callers. In a distributed system with multiple repositories, multiple teams, and services that call APIs over HTTP rather than through explicit dependencies, this is not a grep. It is a cross-repo behavioral trace.

The difficulty scales with system age and team size. In a new system with three services, the consumer map fits in one person's head. In a system that has been in production for five years, built by ten teams, with forty services and three generations of engineers, the consumer map exists in parts — each team knows their own services, nobody knows the whole. A complete consumer trace requires synthesizing the actual call patterns across every repository, not assembling a map from memory.

The AI coding tool angle makes this worse in a specific way. When a developer uses Cursor or Claude Code to update an API response shape, the model generates the change with awareness of the current file and the files it is instructed to check. It does not have access to the analytics worker in a different repository that reads the field being changed. It does not know about the mobile client maintained by the platform team. The change looks complete because the model's context is complete within its window. The blast radius extends beyond that window into repos the model was not given access to.

How contract testing helps and where it stops

Consumer-driven contract testing — Pact being the most common implementation — is the correct architectural response to the consumer discovery problem. The model is sound: each consumer publishes a contract describing what it expects from the API, and the API provider runs tests against all published contracts before shipping. A breaking change fails the provider's contract tests before it reaches production.

The problem is coverage. Contract testing works for the services that have published contracts. It does not work for the services that have not. In practice, most teams have partial coverage: the core services maintained by the API team have contracts, the services owned by adjacent teams may or may not, and the legacy workers and reporting jobs almost certainly do not. The coverage gap is not a failure of the contract testing approach — it is a reflection of the organizational reality that adopting contract testing requires every consumer team to do work. Not every consumer team does.

Contract testing coverage vs. actual API consumer map — where breakages live

Contract testing coverage vs. actual API consumer map — where breakages live:

  Services with published consumer contracts (Pact or equivalent):
  -> OrderService           [contract exists, tests run in CI]
  -> CheckoutService        [contract exists, tests run in CI]
  -> MobileClient           [contract exists, platform team maintains it]

  Services that call the same API without contracts:
  -> AnalyticsWorkerService [reads /orders endpoint, no contract]
  -> AdminReportingService  [reads /orders endpoint, no contract]
  -> PartnerWebhookService  [consumes order events, no contract]
  -> LegacyExportJob        [reads order fields for nightly CSV, no contract]

  What contract testing catches: breakages in the three services with contracts
  What contract testing misses: breakages in the four services without contracts

  The four services without contracts are:
  -> older (built before contract testing was adopted)
  -> maintained by different teams (analytics, ops, platform)
  -> lower-visibility (workers, not user-facing services)
  -> exactly the services where silent failures go unnoticed longest

The services without contracts are systematically the higher-risk ones. They are older — built before contract testing was standard practice. They are lower-visibility — workers and reporting jobs rather than user-facing services. They are owned by teams with less direct stake in the API contract — analytics, ops, data teams who depend on the API but did not build it. And they are the services where silent failures go unnoticed the longest, because they do not have active engineering attention watching for production errors.

Contract testing is the right long-term solution and worth investing in. But "we will eventually have full contract coverage" is not a defense against breakages happening today, and partial coverage creates a false confidence that all consumers are protected when only the well-maintained subset is.

What cross-service API change discovery actually needs

A reliable pre-change consumer discovery process requires three specific inputs. The first is a complete trace of all callers of the endpoint — not callers in the changing team's repository, but callers across every repository that has been indexed. The second is a field-level reference trace: not just "what calls this endpoint" but "what reads this specific field." An endpoint may have fifty consumers and only three that read the particular field being changed — those three are the blast radius for a field-level change, and a consumer map at the endpoint level overstates the risk while a field-level trace gives the accurate picture.

The third input is open work in progress that touches the same endpoint or field from the consumer side. When a consumer team is actively modifying the code that reads a field, a change to that field's behavior creates an immediate collision — the consumer's in-progress work will be built against either the old shape or the new shape, and that decision needs to be made explicitly rather than discovered when the change ships and the in-progress work fails CI.

Pre-change consumer discovery — what to find before changing an API field

Pre-change consumer discovery — what to find before changing an API field:

  Field being changed: payment.processor_response (string → structured object)

  Cross-repo trace results:
  -> PaymentsService         reads field to log raw response          [owner: payments team]
  -> BillingService          reads field to extract error codes        [owner: billing team]
  -> FraudDetectionService   reads field to parse processor signals    [owner: risk team]
  -> AdminPanelService       displays field in charge detail view      [owner: frontend team]
  -> DisputeHandlerService   reads field to determine chargeback type  [owner: operations team]
  -> AuditLogWorker          stores field for compliance records       [owner: data team]

  Open Jira work touching this field:
  -> RISK-214  [in sprint] FraudDetectionService adding new processor signal parsing
  -> DATA-89   [in sprint] AuditLogWorker schema update — touches processor_response

  Action required before changing this field:
  -> Coordinate with 5 additional teams (not 0)
  -> Sequence change after RISK-214 ships or align on shared interface
  -> Align DATA-89 schema update with the new structured object shape
  -> Total impact: field change that looked self-contained affects 6 services
     and has active sprint conflicts with 2 of them

The output of that discovery process changes the conversation entirely. A field change that was going to be handled as a standard PR with a deprecation note in the description becomes a cross-team coordination effort with a specific sequencing decision. That is not a worse outcome — it is an accurate picture of what the change actually requires. The alternative is making the change without the picture and handling the coordination reactively when consumers start breaking.

Kognita for API change discovery

Kognita provides semantic cross-repo tracing that finds all callers of an API and all field references before the change is made. The blast radius question — "who reads this field?" — is answered from the current state of the codebase across all indexed repositories, not from memory or outdated architecture documentation.

The distinction from grep or basic code search is the semantic layer. A field referenced as response.processor_response in one service, destructured as const { processorResponse } = payment in another, and accessed as payment['processor_response'] in a legacy worker are all references to the same field. A keyword search that only matches one syntax misses the others. A behavioral trace identifies all of them as consumers of the same field regardless of how the reference is written.

The Jira MCP integration connects the structural blast radius to the active sprint context: open tickets that touch the same endpoint or field from the consumer side surface automatically, so the changing team knows about active collisions before the change ships rather than after. The index is managed and continuously updated — the consumer map reflects the current codebase, not the architecture as it was documented at the last major review.

For non-engineering stakeholders — product managers assessing the scope of a proposed API change, engineering managers evaluating the risk of a migration — the plain-language interface means the consumer discovery query does not require a senior engineer to run a search manually. The blast radius is queryable by anyone who needs to understand it.

Final take

Distributed systems are only as stable as the contract awareness between services. A team that changes an API without knowing its full consumer set is not being reckless — they are working blind. They are making a decision that affects services they do not own, teams they did not consult, and data pipelines that will fail silently while the incident tracker stays quiet. The information they need to make the change safely exists in the codebase. The failure is in not extracting it before the change ships.

Making the consumer map queryable before changes is the structural fix. Contract testing is the long-term investment. Cross-repo consumer discovery is what makes safe API changes possible in the time between now and full contract coverage. The choice is between finding all consumers before the change and finding them one by one after something breaks.