Blog

AI Agents Don't Work in Story Points — They Work in Token Budgets

9 min read

The scrum master is staring at a velocity chart that went from 44 points last sprint to 180 this sprint. The team didn't hire anyone. Nothing changed in the process documentation. What changed is that they started running Claude Code agents over the weekend on scaffold-heavy tickets. The number is real — 180 points of work items closed — but it doesn't mean the team delivered four times the value. It means the metric they've been using to plan, forecast, and report progress was designed for humans, and they just handed it to something that isn't one.

Scrum.org put it plainly in 2026: "Story points were not designed for AI agents — they measure human effort, uncertainty, risk, and cognitive load. Teams must transition to measuring agentic capacity: how much parallel compute AI bots can utilize safely, measured against the human team's ability to review and merge generated code." That framing is correct, and most teams haven't internalized it. They are still running sprint planning in story points, still using velocity as a capacity signal, and still wondering why the numbers feel broken.

What story points were actually designed to measure

Story points emerged as a way to capture something that hours couldn't: the uncertainty, risk, and cognitive load that comes with unfamiliar technical work. A story that a developer has done before might take two hours. The same story on an unfamiliar codebase, with an unclear API contract and uncertain downstream effects, might take two days. Hours couldn't capture that variance. Story points, relative to a reference story the team knew well, could.

The mechanism works because the thing being measured — human effort with human uncertainty — is reasonably consistent within a stable team. A team's velocity normalizes over time because the humans doing the work have consistent cognitive profiles. They experience fatigue, context-switching costs, learning curves, and review overhead at predictable rates. Story points calibrate to those human characteristics.

AI agents don't have those characteristics. An agent doesn't experience cognitive load about which library to use — it generates code at token-budget speed regardless of how unfamiliar the pattern is. It doesn't get slower when context-switching between tasks. It doesn't have a learning curve in the traditional sense. The metrics scrum masters rely on are becoming irrelevant precisely because the thing being measured has changed its nature.

Why the velocity number is now fiction

Scrum.org's framing is unambiguous: "If a team's velocity jumps from 50 to 5,000 in one sprint after adopting Copilot or Cursor, they haven't delivered 100x value — they've broken the metric." That jump is not hypothetical. Teams running agents on scaffold-heavy work see velocity numbers that bear no relationship to previous sprints and no relationship to the value actually delivered.

The problem compounds when the inflated velocity number starts being used for forecasting. The engineering manager brings a 150-point velocity to the quarterly planning session. Stakeholders extrapolate. The roadmap is built on a number that reflects how fast agents can generate scaffold code, not how fast the team can deliver working, reviewed, production-ready features. The next sprint hits a batch of coordination-heavy, judgment-intensive work and closes at 38 points. Nobody knows why.

Sprint planning with humans vs. AI agents — what the metric breakdown looks like

Sprint planning with humans vs. with AI agents — what breaks:

  Planning a human sprint (the old model)
  -> Scrum master pulls historical velocity: team averages 42 points
  -> Product owner presents backlog items
  -> Team estimates in story points: uncertainty, risk, cognitive load
  -> Capacity is 42 points; team commits to 40
  -> The metric is measuring the right thing: human effort and its variance

  Planning a sprint with AI agents (the new problem)
  -> Scrum master pulls velocity: last sprint was 180 points, sprint before was 44
  -> Nobody can explain the jump. Was it 4x the value? No. One sprint had agents
     working all weekend on scaffold work. The metric is now meaningless.
  -> Team estimates new backlog items in story points: but agents don't experience
     cognitive load. A 5-point story that requires nuanced judgment takes a human
     half a day and an agent three hours. A 5-point story that's pure CRUD takes
     a human half a day and an agent eleven minutes.
  -> The same point value maps to wildly different agent execution times
  -> Capacity planning becomes guesswork: how many points can agents complete?
     Nobody knows, because the unit doesn't measure what agents do.

  What the metric was actually designed for
  -> Story points measure human effort: the uncertainty, risk, and cognitive load
     a developer faces when tackling unfamiliar work
  -> Agents don't have cognitive load. They have token budgets and parallel
     compute capacity.
  -> Scrum.org (2026): "Story points were not designed for AI agents."

The engineering manager who reports 180 points and then 38 points in consecutive sprints is not managing a team with volatile productivity. They are managing a team whose measurement system doesn't distinguish between "agent generated scaffold code" and "human solved a hard coordination problem." Those are different kinds of work with different value profiles, and a single story point number collapses them into the same unit.

Token budgets: what agents actually consume

The concept of token budgets as a planning unit is emerging as the replacement for story points in agent-heavy sprints. A token budget is the computational resource an agent consumes to execute a task — the raw material of agentic work, analogous to hours for human work. Scrum.org frames it this way: "Token budget planning is the new capacity planning for AI-augmented teams."

But token budgets have their own problem as a planning metric: they don't translate into business outcomes any more than CPU cycles do. Knowing that an agent consumed 180,000 tokens to implement rate limiting doesn't tell you whether the implementation was correct, whether it matched the ticket scope, or whether the code can be safely merged. Token budgets measure agent activity, not team output.

What agents actually consume — and why it doesn't map to story points

What agents actually consume — and why it doesn't map to story points:

  Story: "Add rate limiting to the public API endpoints"
  Story point estimate: 5 points (moderate complexity, known pattern)

  What a human does with this story:
  -> Reads existing middleware stack to understand insertion point
  -> Researches rate limiting libraries (time: cognitive load)
  -> Implements, tests, handles edge cases manually
  -> Effort: ~4 hours. Uncertainty was the expensive part.

  What an agent does with this story:
  -> Has no cognitive load about the library choice (picks a known pattern)
  -> Generates the implementation in one pass: ~180,000 tokens
  -> Hits ambiguity: public API includes both authenticated and unauthenticated
     endpoints — the ticket doesn't specify whether rate limits differ by auth state
  -> Agent stalls or makes an assumption. If it assumes wrong, a human has to
     review and correct — that review costs more human time than the original 5 points implied.

  The real capacity constraint
  -> The team's bottleneck isn't how many story points agents can generate
  -> It's how much generated code the human team can review and safely merge
  -> Scrum.org: "Teams must transition to measuring agentic capacity: how much
     parallel compute AI bots can utilize safely, measured against the human
     team's ability to review and merge generated code."
  -> Token budgets tell you what the agent can produce; human review bandwidth
     tells you what the team can actually ship

The real capacity constraint Scrum.org identifies is human review bandwidth — how much generated code the human team can safely evaluate and merge. That's the actual bottleneck in an agent-heavy sprint. An agent can generate code for thirty tickets in a weekend. The team of three engineers can review and merge code for eight tickets before the sprint demo. The planning question is not "how many story points can agents generate?" It is "how much agent output can the team absorb and ship with confidence?"

The planning conversation that's missing

Scrum.org describes a new planning discipline that most teams haven't implemented: "During planning, the Product Owner and Developers must slice Product Backlog items into two distinct categories: work suitable for humans and work suitable for AI agents." This slicing is not just about assigning tickets. It requires understanding what the codebase looks like right now — which areas are well-understood and AI-amenable, which require nuanced judgment, which have enough cross-service complexity that an agent will hit ambiguity and produce code that needs significant human review.

That understanding is not available to a scrum master or product owner running planning without system context. The gap between the sprint board and codebase reality is exactly where this planning slicing breaks down — the product owner and scrum master are making decisions about which work to route to agents based on ticket descriptions, not based on what the codebase actually looks like in that area.

A ticket that says "add rate limiting to public API" looks like straightforward AI-amenable work. A scrum master with codebase context might know that the public API layer is currently in the middle of an auth refactor by another squad, making it a poor candidate for agent work this sprint. Without that context, the ticket goes to an agent, the agent generates code against a moving target, and the review burden triples.

What the retrospective needs to answer — and can't

The sprint retrospective for an agent-heavy sprint has a new set of questions that the old artifacts don't answer. Did the agents build what the tickets said they would build? Did any agents expand scope beyond the ticket boundary? Which stories that closed in Jira actually have corresponding code changes in the codebase? Which code changes happened that no ticket accounts for?

These are not philosophical questions. They are audit questions, and the answers matter for the next sprint's planning. If an agent completed PROD-204 in Jira but the codebase shows the implementation is missing a key component, that's a story that should not have been closed — and it will create a production incident or a sprint collision when the gap is discovered later. Agent output is moving faster than the sprint board can track, and the retrospective is where that gap needs to be surfaced.

What Kognita gives the scrum master and product owner

Neither story points nor token counts give a scrum master or product owner what they actually need to understand an agent-heavy sprint. Story points tell you the estimation; token counts tell you the compute. Neither tells you what changed in the system, whether the changes matched the plan, or what the codebase looks like now versus before the sprint started.

Kognita connects the codebase and Jira state in plain language, giving scrum masters and product owners the system-level view that neither artifact provides on its own. The queries are not technical — they are the questions that every retrospective and planning session should be able to answer but currently can't without pulling an engineer into the conversation.

What Kognita surfaces for a scrum master or product owner after an agent-heavy sprint

What Kognita can show a scrum master or product owner after an agent-heavy sprint:

  "What services changed this sprint?"
  -> PaymentService: 4 files modified, 1 new endpoint added (POST /payments/retry)
  -> AuthService: token refresh logic updated across 3 files
  -> NotificationService: no changes
  -> UserService: 2 files touched in PROD-201 scope

  "Did the changes match what was planned?"
  -> PROD-201 (Add retry logic to payment processing): scope matches.
     Changes confined to PaymentService and PaymentRepository.
  -> PROD-198 (Update auth token expiry): scope expanded. AuthService changes
     also touched SessionService (not in original ticket scope). New file:
     SessionCleanupJob.ts — no linked ticket.
  -> PROD-204 (Add push notifications): partially complete. FCM dispatch added.
     DeviceTokenRegistry not found — may be blocked or incomplete.

  "Which tickets moved but have no corresponding code changes?"
  -> PROD-207 moved to Done. No files modified in the last 14 days
     reference PROD-207 or the feature it describes (UserPreferencesService).
     Possible: work was done in a different ticket. Possible: ticket was
     closed without implementation.

  "What's the complexity picture of what shipped?"
  -> 3 services touched, 2 new files created, 1 scope expansion outside
     ticket boundaries, 1 potentially incomplete story

The value of that system picture is not just retrospective. It feeds directly into the next planning session — the scrum master knows which services are actively in flux, the product owner knows which tickets have partial implementations that will create scope collisions, and the decision about which work to route to agents versus humans is grounded in actual codebase state rather than ticket descriptions and memory.

Final take

Story points were a reasonable proxy for human effort in a world where human effort was the only kind of effort a sprint contained. That world is gone for teams running AI agents. The metric didn't evolve because nobody told the scrum master to change the unit — not engineering, not the retrospective artifacts, and certainly not the story point fields in Jira.

The right question for an agent-heavy sprint is not "how many points did we close?" It is "what did the codebase look like before, what does it look like now, and did the sprint deliver what was planned?" Story points can't answer that. Token counts can't answer that. System context can.