Blog

AI-Generated Tests Pass. They Just Don't Test the Right Things.

10 min read

Coverage went up. CI is green. The AI wrote forty tests in the time it used to take to write four. And yet the regression you shipped last week — the one where a network timeout causes callers to silently skip charges instead of retry — was not caught by any of them.

The tests were not wrong. They passed because the function did exactly what the tests expected. The problem is that the tests were written to match the implementation, not to verify what the rest of the system depends on. The function returns null on timeout. The tests verified that null comes back. Nobody tested what the callers do with null — because the AI that generated the tests did not know the callers existed.

This is the state of AI-generated testing in most codebases right now. Coverage numbers are climbing. Meaningful test coverage — the kind that catches failures before production — is not keeping pace.

Why green CI is a misleading signal in AI-assisted codebases

Coverage and CI status have always been proxies for test quality, not measures of it. A test suite that exercises every branch of a function can still fail to catch a regression if it never tests what callers expect from that function. That gap has always existed. AI coding tools widen it significantly.

When a developer writes tests manually, they bring system knowledge to the task. They know that chargeCard is called by the subscription renewal job, which retries on false but not on null, because they wrote the renewal job last quarter. They know that a successful charge needs to include a chargeId field because they traced an incident where the downstream status updater broke when that field was missing. That knowledge shapes the tests they write.

An AI coding tool generates tests from the function it can see. It does not know the callers. It does not know the incident history. It does not know what downstream services consume the return type. It writes tests that are internally consistent with the implementation — and those tests will pass every time, because they are testing the same assumptions the implementation encodes. This is not a failure of the AI tool. It is a context problem. The tool has access to the function. It does not have access to the system.

The result is a coverage number that is technically accurate and practically misleading. Ninety-four percent coverage in a function whose callers were never tested is not ninety-four percent confidence in the system. It is ninety-four percent confidence in the function's internal behavior, which was never in question.

How AI generates tests and what it cannot know without system context

AI coding tools generate tests by reasoning about the code they are given. They infer input domains from type signatures and parameter names. They identify branches from conditionals. They construct mock setups from imports and dependencies. For a well-scoped function, this process produces tests that are comprehensive within the function's visible behavior: valid inputs return expected outputs, invalid inputs throw or return error states, external dependencies are called with the right arguments.

What AI tools cannot infer from the function alone:

Which callers depend on this function and what they expect from it
Which parts of the return type are load-bearing for downstream consumers
What failure modes have been encountered in production and need regression coverage
Which behaviors were changed in hotfixes and need explicit guards
How this function interacts with state machines, queues, or event flows that span multiple services
Which contracts this function is implicitly obligated to honor by its position in the call graph

All of this context exists in the codebase. It is just not in the file the AI is looking at.

Consider the payment handler scenario in detail. The function is four screens of TypeScript. An AI tool can comprehend it completely. But the contracts it needs to test are defined by the subscription renewal job in a different service directory, the status updater in a third location, the billing state machine in a fourth, and an incident retrospective that exists only in a Jira ticket. No single file contains all of that. No prompt context window holds all of that without deliberate retrieval. The AI writes the tests it can write and the system gaps remain untested.

What AI-generated tests cover vs. what production failure modes they miss

Payment handler: AI-generated tests vs. production failure modes

Function under test:
  async function chargeCard(userId: string, amount: number): Promise<ChargeResult>

What AI-generated tests cover:
  -> chargeCard returns { success: true } when Stripe responds OK
  -> chargeCard returns { success: false } when Stripe returns card_declined
  -> chargeCard throws when amount is zero or negative
  -> chargeCard calls stripeClient.charges.create with correct params
  -> TypeScript types are satisfied end-to-end

What AI-generated tests do NOT cover:
  -> What happens when chargeCard is called while a refund for the same userId
     is already in-flight (race condition in the billing state machine)
  -> What CallerA (subscription renewal) does when it gets null vs. throws
     (they handle these differently; chargeCard currently returns null on
     network timeout — CallerA silently skips the charge instead of retrying)
  -> What happens when the idempotency key collides across retries
     (the existing retry wrapper reuses the key on first retry, causing
     Stripe to return the original charge — the test suite never exercises this)
  -> What the downstream SubscriptionStatusUpdater expects in the ChargeResult
     payload (it reads result.chargeId, but the function returns result.id —
     a field name mismatch that exists in production right now)
  -> Behavior under the 30-second Stripe timeout (a separate code path that
     sets a PENDING status; no AI-generated test covers the PENDING state)

Coverage report: 94%
CI: green

The three types of tests AI writes well vs. the tests that only system context enables

AI tools are genuinely good at generating certain categories of tests. Identifying them clearly helps teams allocate AI-generated testing to where it adds value and supplement it where it does not.

Where AI-generated tests are reliable

Unit tests for pure functions. Functions with no side effects, clear input-output contracts, and no downstream consumers are exactly where AI-generated tests shine. Utility functions, transformations, validators, formatters — these are well-covered by tests that reason from the function's visible behavior. The test quality matches the test scope.

Branch coverage for conditional logic. AI tools reliably identify branches and construct inputs that exercise each one. If a function has five code paths, an AI tool will find all five. This is genuinely useful and tedious to do manually.

Type contract verification. Checking that a function's output conforms to its TypeScript type, that required fields are present, that optional fields are handled — AI tools do this well because the type definition is in the same file.

Where AI-generated tests require system context to be useful

Caller contract tests. Testing that a function's output satisfies the expectations of every caller — not just the TypeScript type, but the behavioral contract. This requires knowing the callers and what they do with the output. An AI tool with only the function in context cannot write these.

Regression guards for production failures. Tests that explicitly guard against known failure modes — "this function returned null instead of throwing and the caller swallowed it silently" — require incident history and knowledge of what has broken before. This context lives in git history, incident retrospectives, and Jira tickets, not in the function itself.

Integration boundary tests. Tests that verify how this function participates in a broader flow — the state machine transition it triggers, the event it publishes and what subscribers expect, the queue message format it produces. These require understanding the function's position in the system graph, which is invisible from a single file.

What "tests that test the right things" actually requires

A test suite tests the right things when it guards the behaviors that matter to the system, not just the behaviors that are easy to observe from the function's code. Getting there requires answering questions that cannot be answered from the function file alone.

Who calls this function? What do they do with each possible return value? What was the last production incident involving this function, and is there a regression test for the specific failure mode? What does this function's output feed into downstream, and are the field names and types consistent with what those consumers expect? If this function throws instead of returning an error value, what happens to the caller's transaction?

These questions have answers. They are encoded in the codebase — in caller files, in the git log, in incident-driven test comments, in downstream consumer implementations. But finding and synthesizing those answers requires traversing the codebase, not just reading a single file.

System context an AI tool needs to write meaningful integration tests — and where it lives

System context an AI tool needs to write meaningful integration tests
for the chargeCard function — and where that context lives

1. Call graph: who calls chargeCard and what they expect back
   Lives in: execution traces, service dependency graph, not in chargeCard.ts

2. Caller-specific error handling: does SubscriptionRenewalJob retry on null?
   Does the API gateway surface the error or swallow it?
   Lives in: SubscriptionRenewalJob.ts, api/billing/route.ts — not imported here

3. State machine constraints: what billing states are valid before a charge?
   What states does a successful charge transition to?
   Lives in: BillingStateMachine.ts, state transition tests — separate service

4. Idempotency behavior: how does the retry wrapper interact with Stripe keys?
   Lives in: shared/lib/stripe-retry.ts — not in scope of the function itself

5. Known production failure modes: the null-vs-throw behavior was introduced
   in a hotfix 4 months ago; it's not documented, it's a live behavior
   Lives in: git history, incident retrospective, runtime behavior — nowhere
   an AI session can retrieve without a semantic index of the whole system

6. Contract expectations: what fields does ChargeResult need to include for
   every downstream consumer?
   Lives in: the consumers themselves, not in the ChargeResult type definition

Without this context, AI writes unit tests for the function.
With this context, AI writes tests for the system.

The pattern holds across functions. The context needed to write meaningful tests for any non-trivial function is distributed across the codebase. The function's file contains the implementation. The system context — callers, consumers, contracts, failure history — lives elsewhere. Without retrieval that spans the whole codebase, AI-generated tests are bounded by what is visible in a single editing session.

How codebase grounding improves AI test generation

The gap between implementation tests and system tests closes when the AI session has access to the full codebase graph. Not just the function under test, but its callers, its consumers, its position in state machines and event flows, and the behavioral patterns established by similar functions elsewhere in the system.

With a semantic index of the codebase — one that captures execution relationships, not just text — an AI tool generating tests for chargeCard can retrieve the subscription renewal job and see how it handles the return value. It can retrieve the status updater and verify which fields it reads from the result. It can retrieve the billing state machine and understand which states are valid preconditions for a charge. It generates tests that guard these contracts because it knows the contracts exist.

This is not about giving the AI more text to read. It is about giving the AI the right text: the behavioral context that defines what a function is supposed to do for the system, not just what it currently does in isolation. A function that returns null on timeout and a caller that silently drops charges on null are two separate facts. A semantic index that captures the execution relationship between them makes the gap visible. An AI session with that context writes a test for it.

AI-generated tests for chargeCard: without vs. with codebase context

Testing chargeCard — before and after codebase grounding

--- Before: AI session with only chargeCard.ts in context ---

test('returns success on valid charge', async () => {
  mockStripe.charges.create.mockResolvedValue({ id: 'ch_123', status: 'succeeded' })
  const result = await chargeCard('user_1', 100)
  expect(result.success).toBe(true)
})

test('returns failure on card declined', async () => {
  mockStripe.charges.create.mockRejectedValue({ code: 'card_declined' })
  const result = await chargeCard('user_1', 100)
  expect(result.success).toBe(false)
})

test('throws on invalid amount', async () => {
  await expect(chargeCard('user_1', 0)).rejects.toThrow('Invalid amount')
})

// Coverage: 94%. Tests the implementation. Does not test the system.

--- After: AI session grounded in full codebase context via Kognita ---

test('returns null on network timeout — callers must handle null, not just false', async () => {
  mockStripe.charges.create.mockRejectedValue({ type: 'StripeConnectionError', code: 'ETIMEDOUT' })
  const result = await chargeCard('user_1', 100)
  // Regression guard: null was introduced in hotfix; SubscriptionRenewalJob
  // silently skips on null — this test ensures the behavior is intentional
  expect(result).toBeNull()
})

test('ChargeResult includes chargeId field expected by SubscriptionStatusUpdater', async () => {
  mockStripe.charges.create.mockResolvedValue({ id: 'ch_123', status: 'succeeded' })
  const result = await chargeCard('user_1', 100)
  // SubscriptionStatusUpdater reads result.chargeId — not result.id
  expect(result).toHaveProperty('chargeId')
})

test('concurrent charge attempt for same userId returns existing charge state', async () => {
  // BillingStateMachine enforces one active charge per user
  // This test guards the contract between chargeCard and the state machine
  await BillingStateMachine.setChargeInFlight('user_1')
  const result = await chargeCard('user_1', 100)
  expect(result?.status).toBe('CHARGE_IN_FLIGHT')
  expect(mockStripe.charges.create).not.toHaveBeenCalled()
})

// Coverage: 91%. Tests system contracts. Guards actual production failure modes.

The before-and-after difference is not the quantity of tests. The after suite has three tests to the before suite's three. The difference is what the tests are guarding. The before suite verifies that the function behaves according to its own implementation. The after suite verifies that the function satisfies its contracts with the rest of the system — contracts that were not visible without traversing the codebase.

The practical implication for teams using AI coding tools heavily: the test quality ceiling is set by the context the AI has access to. Teams that give AI sessions a managed semantic index of the whole codebase — not just the current file, not a rules file describing conventions — get tests that reflect system behavior. Teams that do not get tests that reflect implementation details. The coverage numbers look similar. The failure modes caught do not.

Kognita's semantic index is built for exactly this scenario. It captures execution-aware relationships across repos — who calls what, what feeds into what, which behaviors are contracts and which are implementation details. When AI coding sessions are grounded in that index, the tests they generate for a payment handler know about the subscription renewal job, the status updater, and the billing state machine. Not because those files were manually included in context, but because the index maintains the relationships between them and surfaces them automatically when the test generation task requires them.

Final take

Green CI in an AI-assisted codebase is a weaker signal than it used to be. AI tools generate tests that pass by design — they match the implementation they are given. The tests are not wrong. They are just not testing what matters to the system.

The failure mode is invisible in the short term. Coverage goes up. Tests are green. The regressions accumulate in the gaps between functions: in caller contracts that were never verified, in downstream field names that drifted from the type definition, in timeout behaviors that have never been exercised in a test suite. These gaps are not found by running tests. They are found in incidents.

The fix is not writing more tests manually. It is grounding AI test generation in system context — execution relationships, caller contracts, integration points, known failure modes. When AI sessions have that context, they write tests that guard the system rather than verify the implementation. The coverage number becomes meaningful again.