Blog

Incident Response Is Slow Because the Context Lives in Nobody's Head Anymore

10 min read

2:17am. PagerDuty fires. The on-call engineer gets paged for the payment processing service. They open their laptop, check the dashboard — elevated error rate on POST /v2/charges. They pull up the service. They have never worked in it directly. The senior engineer who built most of it is on vacation. They open Cursor to help understand what's happening. Cursor has no context about this service. They start reading code. It's 2:45am and they still do not know what the charge handler calls downstream.

This is not an unusual story. It is the default incident experience on most engineering teams in 2026. The tooling has improved enormously. The context problem has not.

The modern incident anatomy

Most incidents are not caused by the component that alerts. They are caused by a change upstream, a dependency that degraded, a database query that started timing out, a rate limit that was hit on a third-party API. The alerting component is the symptom. The cause is somewhere else.

The on-call engineer sees POST /v2/charges failing. The actual problem is three services deeper: AccountLimitService is timing out because a deploy yesterday introduced a slow query on the account_limits table. ChargeHandler calls FraudCheckService calls AccountLimitService. The on-call engineer does not know that chain exists until they trace it manually.

Understanding the actual cause requires two types of context: system topology (what calls what, what depends on what, what would fail if X degraded) and change context (what changed recently that could explain this behavior). Both are hard to reconstruct at 2am with no documentation and no senior engineer awake to answer Slack messages.

Why incident response is a context problem

The common explanation for slow incident response is that root cause analysis is hard. That is sometimes true. But in practice, most of the time in a typical incident is not spent on hard analysis. It is spent on orientation: figuring out the topology of the system, understanding what could be upstream of the failure, identifying what changed recently.

System topology is the map: which services call which other services, what the dependency graph looks like, where shared databases live, which queues connect which workflows. On most teams this map exists in engineers' heads, in occasional architecture diagrams that are months out of date, and implicitly in the code — which the on-call engineer needs to read at 2am to reconstruct it.

Change context is the timeline: what deployed in the last week, which services had migrations, which library versions were bumped. Teams with good deployment pipelines can reconstruct this from CI/CD history. Teams where the relevant change was three services upstream from the alerting service often miss it entirely until much later.

Strip those two types of context from an incident and almost any team looks slow. Give an engineer instant answers to both, and the path to root cause shortens dramatically.

How AI tools help and where they stop

Claude Code and Cursor can read code and answer questions about it. They are genuinely useful for understanding what a specific function does, for generating a theory about why a particular code path would fail, for explaining error messages. This helps.

But they need to be pointed at the right code first. The on-call engineer needs to know: which service owns this behavior? Which services does it call? What changed in this area recently? Those questions require system-wide context that an AI session without grounding cannot answer.

An ungrounded Claude session asked "what does the payment service call when a charge fails?" will speculate based on general patterns of how payment services tend to work. It will not give a specific, accurate answer about the actual architecture of this system. The on-call engineer still has to read the code to find out.

What the on-call engineer needs vs. what they typically have access to

What the on-call engineer needs at 2:17am:
  - Which services does POST /v2/charges touch?
  - What is the downstream dependency chain for ChargeHandler?
  - What changed in the payments area in the last 2 weeks?
  - Which services would degrade if FraudCheckService slows down?
  - Is there a known incident pattern for this error type?

What they typically have access to:
  - Datadog showing elevated error rate
  - Service logs with no upstream context
  - A README last updated 14 months ago
  - Cursor — which has no context about this service

The gap in that list is not a tooling gap. It is a context gap. The tools exist. The system knowledge that would let the tools answer operational questions accurately — that is what is missing.

The knowledge rotation problem

On-call rotates. The engineer paged at 2am on a Tuesday may have never worked directly in the payment service. On a team of eight, each engineer is expected to handle incidents across the full system. That expectation is only realistic if system context is accessible to everyone — not just the two engineers who originally built the service.

In practice, the engineer who built a service carries an enormous amount of operational knowledge that is never written down: which edge cases produce which errors, what the retry assumptions are, which downstream services are most likely to be the actual source of a given failure mode. When that engineer is on vacation or has left the company, the team loses that knowledge.

This is the same knowledge concentration problem that appears in bus factor discussions — but incidents make it acutely painful because the need for that knowledge is immediate and high-stakes. A 2am page is not the right moment to read three interconnected services to understand the dependency chain from scratch.

The time taxonomy of incident response

Most incident post-mortems describe the timeline without analyzing where the time went. Here is where it actually goes.

Time breakdown — context discovery vs. actual root cause work in a typical incident

Typical 45-minute incident timeline:

2:17am  Alert fires
2:18am  Engineer acknowledges, opens dashboard
2:22am  Confirms elevated error rate on POST /v2/charges
2:28am  Starts reading PaymentService code to understand what it calls
2:35am  Finds ChargeHandler calls FraudCheckService — checks its logs
2:40am  Realizes FraudCheckService calls AccountLimitService — didn't know that
2:46am  Checks AccountLimitService — finds elevated latency there
2:51am  Identifies slow query introduced in last deployment
2:53am  Mitigation applied, incident resolved

Time on context discovery (2:22 - 2:46): 24 minutes
Time on actual root cause isolation (2:46 - 2:51): 5 minutes

24 minutes on context discovery. 5 minutes on root cause isolation. This is the real ratio on most incidents involving services the on-call engineer does not know intimately. The hard part was not hard. The slow part was not skilled work. It was archaeology: reading code to reconstruct a dependency chain that should have been queryable in seconds.

The four questions that consume most of that time — what does this service call, what does it depend on, what changed recently, what else could be affected — are context questions, not analytical ones. The answers exist in the codebase. Getting to them at 2am requires either knowing the system personally or reading it from scratch.

The topology question is the most expensive

Understanding the downstream dependency chain is the step that takes longest because it requires reading multiple services, not just one. The on-call engineer checks PaymentService, finds it calls FraudCheckService, checks FraudCheckService, finds it calls AccountLimitService, checks AccountLimitService, finds the slow query. Each hop requires opening a new service, finding the relevant code, and understanding how the dependency is wired. On a modern microservices architecture with 30 services, this process is genuinely difficult under time pressure.

This is the taxonomy problem. The on-call engineer is not slow because they are not smart. They are slow because the information they need is distributed across services they have never opened, in code that was written by someone else, with no single place to ask "how does this all connect?"

What full-context incident response looks like

The same incident with a managed codebase index available looks different. Engineer gets paged for POST /v2/charges elevated errors. Instead of opening the service code cold, they query: "what is the charge handler downstream dependency chain?" They get an accurate answer in seconds — ChargeHandler calls FraudCheckService synchronously, which calls AccountLimitService synchronously, which has a known query hotspot on the account_limits table for large accounts.

They immediately check AccountLimitService. They ask: "what changed in the billing area in the last two weeks?" They see three recent deploys. The one from yesterday includes a query change on exactly the table mentioned. They have root cause in minutes, not 45 minutes.

The difference is not the engineer's skill. It is whether the context they need is accessible without code archaeology.

Kognita for incident response

Kognita's managed codebase index means any on-call engineer can query the system about service dependencies, recent changes, and behavioral architecture. The index is maintained automatically — when services change, the dependency graph updates. When deploys happen, the change timeline stays current. The on-call engineer does not need to know the service personally to get accurate answers about how it fits into the system.

On-call query examples — asking the system instead of reading the code

On-call query examples with managed context:

Query: "What does the payment service call when a charge fails?"
Answer: ChargeHandler → FraudCheckService (sync, p99 500ms on large accounts)
        → AccountLimitService (sync) → NotificationService (async)
        Fallback: RetryScheduler queues for async recovery

Query: "What changed in the billing area in the last 2 weeks?"
Answer: 3 changes in AccountLimitService:
        - Query optimization in limit_check_by_account (deployed May 12)
        - New index on account_limits table (deployed May 13)
        - Rate limit threshold update for enterprise tier (deployed May 14)

Time from query to answer: seconds, not minutes of code reading

The queries above are not hypothetical. They are the exact questions an on-call engineer needs answered when they are paged for a service they do not own. With a managed context layer, those questions take seconds. Without one, they take 20-30 minutes of code reading under time pressure at 2am.

The Jira MCP integration matters here too. Changes deployed in the last two weeks that are connected to Jira tickets carry intent context — not just "a query changed" but "a query changed as part of the AccountLimitService performance work in sprint 47." That helps on-call engineers identify the likely culprit faster, because they can see not just what changed but why.

Zero local setup for every engineer

Because Kognita is managed infrastructure rather than a local tool, the on-call engineer does not need to have set up a personal MCP server or configured a local index. The context is available from the first moment they need it — through a dashboard query or a Jira-integrated agent — without any prerequisite setup. This matters specifically for on-call because the engineer reaching for context at 2am is often not the engineer who configured the tooling.

Final take

Incident response is slow not because engineers are slow. It is slow because the context they need to understand the system is either in someone else's head, buried in a README that is 14 months stale, or requires reading three interconnected services in sequence at 2am to reconstruct what should be a simple dependency map.

The hard part of incident response — forming a root cause hypothesis, deciding on a mitigation, reasoning about blast radius — takes minutes when the engineer has the right context. What takes the other 40 minutes is getting to that context. That is the fixable part.

A managed context layer makes topology queryable, changes searchable, and dependencies visible to every engineer on rotation — not just the ones who happened to build the alerting service. The result is not just faster incidents. It is less anxiety about being paged for a service you have never worked in, because the system is no longer a black box when you need it most.