Blog

AI Code Review Is Catching the Wrong Things

11 min read

The pull request passed all AI review checks. No style violations. No obvious security issues. The tests were green and the reviewer approved. Two days after merge, the new payment event handler in billing-service silently conflicted with an existing handler in payments-service — both writing to the same ledger endpoint, no deduplication, no idempotency key. The AI reviewer had seen three files. It had not seen the system.

That is the core failure mode of AI code review today. Not that the tools are weak — they are genuinely useful for what they do. The problem is that reviewing a diff is not the same as reviewing a change. A diff shows you what was added. A change is what the diff does to a running system with sixteen other services, a shared event bus, two consumer teams, and a set of architectural conventions nobody wrote down.

What AI code review tools are actually doing

Most AI code review tools — whether they are GitHub Copilot review suggestions, CodeRabbit, Sourcery, or a custom agent wired to GPT-4 — are operating on the same input: the diff. They see the lines added, the lines removed, and a narrow window of surrounding context from the same file. Some pull in the PR description and linked issues. A few can be given additional files on request.

That scope is reasonable for what they are optimized to catch: style violations, obvious security anti-patterns, missing null checks, naming issues, test coverage gaps, and common algorithm mistakes. On those tasks they save real time. A reviewer does not need to manually scan for hardcoded API keys, inconsistent error handling, or trailing whitespace. Automating that is genuinely useful.

The problem is not what they catch. It is what they cannot catch — because the information those issues depend on does not live in the diff.

Five categories AI code review consistently misses

1. Architectural mismatches

A developer modifies the shape of OrderCreatedEvent in the orders service to include a new customerId field. The diff looks clean: one new field, no breaking changes in the modified file, tests pass. What the diff does not show is that fulfillment-service, analytics-worker, and reporting-service all consume that event and deserialize it with strict schemas. The AI reviewer sees a clean addition. The system sees three broken consumers.

Architectural mismatches are not edge cases. In any system with more than two or three services sharing data structures, event schemas, or database tables, they are a routine risk. Every shared schema change is a potential mismatch. The diff will not tell you who depends on the shape you just changed.

2. Duplicate logic

A developer adds a calculateShippingCost() function to checkout-service. The implementation is correct, the tests are solid, and the PR description explains the business logic. The AI reviewer approves it. What nobody mentioned is that fulfillment-service already has a ShippingCostCalculator that does the same thing — with slightly different rounding logic for international orders.

Now the codebase has two shipping cost implementations. They will diverge over time. When the business rule changes, someone will update one and miss the other. Six months later, checkout shows one price and the fulfillment confirmation shows another. The AI reviewer could not have caught this because it only saw the new function, not the existing one in a different service.

3. Silent side effects

A schema migration renames a column in the users table from account_status to subscription_state. The migration file is clean, the ORM model is updated, the application service that triggered the migration is updated. The AI reviewer sees a tidy refactor. What it does not see: a nightly billing job in billing-worker that queries the column by name using raw SQL, a support tool in internal-ops-service that reads it directly, and an analytics pipeline that ingests it from the replica. All three break on the next deploy.

Side effects invisible to the diff are the most dangerous category because they often fail silently. A billing job that quietly stops updating subscription states does not throw an exception — it just stops processing. The AI reviewer had no way to know those consumers existed.

4. Convention violations

Teams develop patterns over time that never get written down in a linting rule. The team always uses a specific retry decorator on any method that calls the payments API. Background jobs always register themselves with the job registry service. Event handlers always set an idempotency key before writing to the ledger. These conventions exist because previous engineers learned them the hard way. A new engineer writes a background job that skips the registry. The AI reviewer approves it — because the convention is not in the diff, and the model has never seen the codebase before.

A human reviewer who has been on the team for a year catches this in thirty seconds. The AI reviewer, working from the diff alone, cannot.

5. Behavioral regressions

A refactor in checkout-service changes when the order confirmation email gets sent — from immediately after payment confirmation to after fulfillment picks up the order. The code change is clean, the logic is sound, all tests pass. But the product specification says the confirmation email is the purchase receipt, and it should arrive before the customer closes the browser. The AI reviewer sees a technically correct change. It does not know what the email is supposed to do.

Behavioral regressions that cross the product-engineering boundary are invisible to a tool that only reads code. The model would need to know what the system is supposed to do — not just what it does.

What AI review catches vs. what it misses

AI code review catches well:
  ✓ Style violations (naming, formatting, lint rules)
  ✓ Obvious security anti-patterns (SQL injection, hardcoded secrets)
  ✓ Missing null checks, uncaught exceptions
  ✓ Complexity metrics (cyclomatic complexity, function length)
  ✓ Common algorithm mistakes
  ✓ Dependency version issues

AI code review misses:
  ✗ Architectural mismatches (change X without updating Y)
  ✗ Duplicate logic already in another service
  ✗ Side effects not visible in the diff
  ✗ Team conventions the model has never seen
  ✗ Behavioral regressions against the product spec

This is a retrieval problem, not a model problem

GPT-4, Claude, and Gemini are all capable of reasoning about architectural mismatches, duplicate logic, and silent side effects. The reasoning ability is there. The failure is not the model — it is what the model is given to work with.

When you hand a capable model a three-file diff and ask it to review a change, it will review those three files well. It will not hallucinate the other twelve services that exist in the repository. It will not invent the architectural dependencies it has never been shown. It will not know the team convention it has never read. It does the best it can with what it has — which is the diff, and only the diff.

The missed catches are almost all cases where the relevant information exists somewhere in the codebase — just not in the diff. That is a retrieval gap. The model needs to be told about PaymentEventListener in another service before it can catch the conflict. It needs a list of which services consume the event schema before it can flag the breaking change. It needs the team's idempotency convention surfaced as context before it can notice the violation.

Give the same model the same diff plus the right system context, and it catches all five of those categories reliably. The model capability has been there for two years. The retrieval infrastructure to feed it the right context has not.

What context-grounded code review looks like

Context-grounded review means the reviewing agent does not just receive the diff — it receives the diff plus the system knowledge required to evaluate what the diff does to the running system. Specifically:

Cross-repo impact: which other services reference the interfaces, schemas, or event types this PR touches
Behavioral semantics: what the changed component is supposed to do, derived from how it is used across the system
Similar implementations: existing functions or services that solve the same problem, surfaced before the reviewer approves a duplicate
Team conventions: patterns extracted from how the team has solved similar problems in the past — retry strategies, registration patterns, idempotency conventions
Dependency graph: consumers of the types and schemas being modified, including in services not present in the diff

This is not asking the model to do something new. It is giving the model the context it needs to do what it is already capable of doing. The difference in review quality is substantial.

Same PR, without vs. with system context

PR #2847: Add payment-received event handler
  Files changed: 3
  + billing-service/src/handlers/PaymentReceivedHandler.ts
  + billing-service/src/handlers/PaymentReceivedHandler.test.ts
  ~ billing-service/src/app.module.ts

AI review output:
  ✓ No style violations
  ✓ Handler follows existing patterns
  ✓ Unit tests present
  ✓ No obvious security issues
  → LGTM

What the AI reviewer did not know:
  × payments-service already publishes the same event
  × notification-service has its own PaymentReceivedListener
  × Both handlers write to ledger-service — now writing twice
  × No deduplication key defined — idempotency broken

With full system context

PR #2847: Add payment-received event handler
  Files changed: 3

AI review output (with system context):
  ⚠ CONFLICT DETECTED
  billing-service/PaymentReceivedHandler conflicts with:
    → payments-service/listeners/PaymentEventListener (line 84)
    → notification-service/handlers/PaymentReceivedListener (line 201)

  Both downstream handlers write to ledger-service.createEntry().
  No idempotency key present — duplicate ledger entries possible.

  Similar pattern found:
    → billing-service/handlers/RefundProcessedHandler (uses event.idempotencyKey)
    Recommend aligning to that convention before merge.

  Architectural note:
    The payment event topology currently routes through payments-service.
    Adding a second publisher in billing-service creates a split-origin
    pattern that has caused reconciliation issues in Q3 (see OrderRefundedEvent).

How Kognita grounds code review

Kognita maintains a continuously updated semantic index of the full codebase — not individual files, but the relationships between them. Execution paths, event consumers, shared schemas, dependency graphs, and behavioral patterns are all indexed as a connected graph, not a pile of text chunks.

When a PR is opened, a reviewing agent queries Kognita through the MCP endpoint and gets back the system-level context relevant to that diff: which services reference the changed interfaces, whether similar implementations exist elsewhere, which architectural conventions apply to the pattern being added, and what the downstream consumers of any modified schemas are.

The agent does not need to scan the whole codebase — it asks Kognita. Kognita returns the relevant subgraph: the three services that consume the event type being modified, the existing handler that nearly duplicates the new one, the idempotency convention the team uses in this context. The model then reviews the diff in that context, not in isolation.

Because Kognita re-indexes automatically as the codebase changes, the context the agent receives is current. An agent reviewing a PR in a fast-moving monorepo does not get stale dependency data — it gets the actual current state of the system.

What this changes for the review process

The most immediate change: fewer "LGTM, no issues" approvals on PRs that later cause production incidents. The payment event handler conflict described at the top of this article is caught automatically, before merge, with a specific explanation of what will break and a pointer to the convention that prevents it.

The second change is upstream. When the AI reviewer can catch architectural mismatches and convention violations consistently, human reviewers can redirect their attention to higher-level concerns: is this the right abstraction, does this belong in this service, is the approach aligned with where the system is heading. The AI handles the forensics; the human handles the design judgment.

The third change is for junior engineers specifically. Convention violations and duplicate logic are not signs of incompetence — they are the natural consequence of working in a large system that predates you. A reviewer with system context catches those problems early, before they ship, without the implicit career cost of being the engineer whose code caused a production incident in week three.

Teams that run context-grounded review also tend to see an improvement in their architectural discipline over time. When every PR gets feedback on whether it aligns with existing patterns, those patterns become more visible. Engineers learn the conventions faster. Drift accumulates more slowly. The review process becomes a feedback loop for system health, not just a gate on individual file quality.

The underlying principle

The problem is not that AI reviewers are weak. The leading models are strong enough to catch every category of issue described in this article — if they are given the relevant context. The problem is that the standard integration model for AI code review gives them the diff and nothing else.

A diff is not a change. A diff is the minimum representation of what changed. A change is what that diff does to a system with history, consumers, conventions, and behavioral expectations. Reviewing the first without knowledge of the second produces exactly the results you would expect: perfect scores on line-level issues, blind spots on everything that crosses service boundaries.

The fix is not a better model. It is a better retrieval layer — one that can answer "what else in this system is affected by this diff?" before the review starts. That is the gap context-grounded review closes.