KognitaKognita.

Blog

The On-Call Engineer Is Paged. They've Never Touched That Service.

9 min read

The alert fires at 2am. The on-call engineer is paged for the data pipeline service. They joined the team three months ago. In those three months, they've worked on the API layer and touched one edge of the notification service. They have never looked at the data pipeline code. The alert says "processing lag above threshold." The runbook says "restart the processor if it hangs." They don't know what "hung" looks like, or whether this is a hang, a backup, or a downstream dependency failure. They wake up the engineer who built it.

Now both engineers are awake at 2am. The engineer who built the service is being paged for a system they handed off six months ago. The on-call rotation is functioning correctly — someone is responding. The knowledge system underlying it is not — the person responding doesn't have the context to respond effectively without escalating to the person whose whole point of leaving the rotation was to stop being paged.

The gap between owning a service and knowing it

On-call rotations assign ownership. They don't create knowledge. Being on-call for a service you've never touched doesn't give you insight into how it behaves, what its failure modes are, or how to distinguish a benign blip from a genuine outage. The rotation distributes the responsibility for responding. It doesn't distribute the knowledge needed to respond well.

Teams that implement broad on-call rotations often discover this gap during incidents. The engineer paged is the right engineer according to the rotation schedule. They're the wrong engineer according to the knowledge required. The incident takes longer, causes more stress, and requires escalation that defeats the purpose of having the rotation in the first place.

What the on-call engineer has vs. what they need at 2am
The on-call engineer at 2am, paged for the data pipeline:

  What they have:
  -> Alert: "data-pipeline processing lag > 10 minutes"
  -> Runbook: "restart the processor if it hangs"
  -> Access to logs (if they know where to look)

  What they need to know:
  -> What does "processing lag" actually mean for this service?
  -> Is 10 minutes lag always bad, or only under certain conditions?
  -> What does "hung" look like vs. "backed up" vs. "dependency failure"?
  -> What are the upstream services that feed this pipeline?
  -> What changed in the last deployment?
  -> Who built this and should be called if it's serious?

  They joined 3 months ago. They've never touched this service.
  They wake up the original engineer. Both are awake at 2am.

Runbooks as partial solutions

The canonical response to on-call knowledge gaps is runbooks. Write down the known failure modes, the diagnostic steps, the remediation procedures. Runbooks help. They're not sufficient. They describe the known failure patterns — but novel incidents by definition aren't in the runbook. They describe what to do — but not how to tell if what you're seeing matches the described pattern. And they decay: services change, runbooks don't always keep pace.

The deeper problem is that runbooks are written by the people who understand the service — and they encode that understanding in a way that requires at least partial pre-existing knowledge to use effectively. A runbook that says "check the processing queue depth" is only useful if you know where the processing queue is and how to check its depth.

Why runbooks don't close the on-call knowledge gap
Why runbooks don't solve the knowledge problem:
  -> Runbooks describe known failure modes. Novel failures aren't in them.
  -> Runbooks say "restart if hung." They don't say how to tell if it's hung.
  -> Runbooks were written by the person who understood the service.
  -> Runbooks decay. Services change. Runbooks don't always update.
  -> A runbook that says "contact @alice" is a runbook that requires Alice.

  Runbooks are necessary. They are not sufficient for an on-call engineer
  who has never touched the service they're responsible for.

Codebase access as an on-call tool

An on-call engineer who can query the codebase while responding to an incident has access to information runbooks don't provide: how the service is structured, what its upstream and downstream dependencies are, what changed in the most recent deployment, and what the normal vs. abnormal behavior looks like in the code. This doesn't replace experience with the service — but it provides a navigable starting point for an engineer who's encountering it under pressure for the first time.

Kognita makes the codebase queryable in plain language during an incident. "What does the data pipeline processor do and what would cause processing lag?" gives the on-call engineer service context before they start poking around blindly. The answer doesn't replace the runbook — it gives the engineer enough system understanding to use the runbook effectively.

Final take

On-call rotations distribute responsibility. They don't distribute knowledge. The on-call engineer paged for a service they don't know will make slower progress and escalate more than one who has system context. Making the codebase accessible during incidents gives every on-call engineer a starting point beyond "wake up the person who built this."

The on-call rotation says who is responsible. The codebase says what they're responsible for. When the engineer can query the second, the first becomes functional rather than nominal.