Blog
Story Point Estimation Broke When Your Team Started Using AI. Here's What to Do About It.
10 min read
The sprint is at 150 points. The team is five people. Six months ago they averaged 40. The velocity chart looks like it has been hacked. The scrum master has no idea how to explain this to leadership, no idea how to set next sprint's capacity, and no idea whether the team is genuinely delivering more value or just moving faster through the wrong things. The metric that everything was calibrated against — velocity — is now useless, and nobody has a replacement that leadership will accept.
This is not a hypothetical. Teams adopting Cursor, Claude Code, and GitHub Copilot are seeing velocity numbers that bear no relationship to previous sprints. The estimation model broke the moment AI arrived, and most teams are still trying to force the old system to work on the new reality. It does not work. The fix is not re-calibrating story points. The fix requires understanding why estimation was failing before AI arrived and why AI made the failure mode larger.
What happened to velocity
Story points were always a proxy for effort, and effort was always a proxy for time, and time was the only thing that actually mattered for sprint capacity planning. The proxy chain worked reasonably well when the team's effort-to-time conversion was consistent. Three days of work was three days of work. The team's historical velocity gave you a reliable signal for next sprint's capacity.
AI tools broke the proxy chain by collapsing certain categories of effort dramatically. A task that required three days of implementation work — understanding the codebase, writing boilerplate, scaffolding the new service, writing the tests — now takes four hours with Cursor generating the scaffold. The team calibrated that at 5 story points because of the pre-AI effort. Now it is 5 story points and four hours. Velocity per sprint doubles, then triples, then a small team posts 150 points.
The problem compounds because AI acceleration is not uniform across work types. Greenfield scaffolding, CRUD operations, test generation, and boilerplate are highly AI-amenable — AI cuts the time by 70-80%. Cross-service coordination, database schema migrations, async workflow debugging, and architectural decisions are not AI-amenable in the same way — AI helps at the margins, but the coordination overhead is fundamentally human. A sprint that mixes both work types sees wildly inconsistent point-to-time ratios, making any velocity number meaningless as a capacity signal.
The three estimation approaches teams try — and why they break
When the velocity numbers stop making sense, teams try one of three things. None of them solve the underlying problem, but they fail in different ways.
Three estimation approaches teams try after AI adoption — and what breaks:
Approach 1: Keep the old point scale
What happens: a 3-point story that used to take 3 days now takes 4 hours.
Velocity doubles. Teams hit 80 points in sprints that were averaging 40.
Leadership thinks productivity doubled. Stakeholders recalibrate expectations
upward. The next sprint doesn't hit 80. Nobody knows why.
What breaks: the measurement is now meaningless. Velocity tells you nothing
about capacity. Forecasts are fiction.
Approach 2: Rescale to AI-assisted effort
What happens: the team recalibrates. 3 points now represents AI-assisted
effort. But some work is AI-heavy (boilerplate, scaffolding, CRUD) and some
isn't (cross-service coordination, schema migrations, debugging async
workflows). The variance within a "3-point" story is enormous.
What breaks: estimation becomes inconsistent. A 3-point ticket might be
45 minutes or it might be 2 days, depending on how AI-amenable the work is.
Burndown charts become noise.
Approach 3: Drop story points entirely
What happens: team moves to flow metrics, cycle time, throughput. Reasonable
in theory. In practice, the scrum master loses visibility into sprint load.
Planning becomes harder to communicate to stakeholders. Leadership asks for
a number. "We don't do story points anymore" lands badly in every
quarterly planning session.
What breaks: organizational visibility collapses. The team plans better but
communicates worse.The team that keeps the old scale ends up with velocity numbers that stakeholders use to set impossible expectations. The team that rescales loses the ability to compare stories against each other because the variance within a point value is too large. The team that drops story points entirely solves the internal planning problem and creates a new organizational communication problem.
All three approaches share the same flaw: they treat estimation as a measurement calibration problem. The real problem is not how you measure effort. The real problem is that you cannot accurately estimate what you cannot clearly see.
The deeper problem: you cannot estimate what you cannot see
Even if you perfectly calibrate story points to AI-assisted effort, estimation still requires knowing what you are building against. The system. What already exists, what would need to change, what the dependencies are, what is currently being modified by another engineer. AI tools accelerate execution dramatically. They do not improve your estimate of scope at all. If anything, they make the gap between planning-time scope understanding and implementation-time reality larger — because delivery moves faster now, there is less time between the commitment and the moment the scope surprise becomes visible.
Scope estimation has always been the harder problem. Teams have always known that "how long will it take" is less about raw effort and more about "how much is there to do, and how much of it did we undercount in planning." AI raised the stakes by removing the implementation time buffer that used to partially absorb scope surprises. When a 3-point story took three days, a scope expansion discovered on day two still left one day to adapt. When a 3-point story takes four hours, a scope expansion discovered at hour three means the story is already committed wrong.
Where estimates actually fail
The pattern is consistent across teams: a story gets estimated at 3 points based on what the engineer thinks is in scope. Mid-sprint, someone opens a file and discovers the feature touches three services nobody mapped in planning. The 3-point story becomes a 15-pointer. The sprint is overcommitted. Stories carry over. The burndown chart looks like a cliff.
The scope expansion was knowable before the sprint started. The three services were always there. The dependency existed in the code. The interface that needed to change was documented in the service contract. Nobody checked because checking requires someone to look at the current state of the system, and that requires either reading the code or asking an engineer to read it — neither of which happens in a planning session run by a scrum master or product owner.
Here is a concrete example. The story is "Add push notifications for order status updates." The engineer estimates 3 points because a notification service exists and integration seems straightforward. What nobody checks during planning: the notification service handles email only, push notification infrastructure was never built, the order service does not emit status change events, and there is no device token registry. The actual scope includes four separate pieces of infrastructure work. The estimate is wrong before the sprint starts. AI did not cause this estimate to be wrong. But AI accelerated execution enough that the gap between the wrong estimate and the reality becomes visible by day two instead of day five — and by then, the sprint is already off the rails.
What estimation actually needs to work
Accurate estimation requires accurate system state during planning, not just a calibrated effort measurement. The questions that determine whether an estimate holds are: What does the relevant part of the codebase currently look like? What would actually need to change to deliver this story? Does the infrastructure this story assumes already exist? Who else is touching this area this sprint?
These are not difficult questions to answer. The answers exist in the codebase. They are specific, current, and would materially change planning decisions if they were available before commitments were made. The problem is that nobody in the planning room can get to them quickly enough to use in real time. The scrum master cannot read the code. The product owner cannot run a dependency query. The engineering lead's mental model of the system is accurate for areas they touched recently and progressively unreliable for anything they have not looked at in the last few weeks.
The result: estimation runs on memory and gut feeling, and scope surprises happen after commitments are made, when the cost to handle them is highest. AI made teams faster at implementing what they planned. It did not make planning itself more accurate. That is the gap.
Connecting sprint planning to live codebase context
This is where Kognita's Jira MCP integration changes the estimation dynamic. The connection is not about making AI do the estimating — it is about ensuring that planning decisions are based on what the system actually looks like today, not what someone remembers it looking like last sprint.
Before a story gets estimated, a product owner or scrum master can ask: does a notification service exist and what does it currently handle? What services does the checkout flow touch? Is there existing infrastructure for device token management? What has changed in the payment module since the last sprint touched it? These questions get answered from live codebase state, not from engineer memory, before the commitment is made.
Estimation question — without and with system context:
Story: "Add push notifications for order status updates"
Estimate in planning: 3 points
Without system context:
Engineer assumption: notification service exists (it was mentioned in
the Q3 roadmap six months ago). Integration should be straightforward.
Reality discovered day two: NotificationService exists but only handles
email. Push notifications require FCM/APNS device token management.
DeviceTokenRegistry was never built. OrderService doesn't emit status
events — it updates a database field that nothing subscribes to.
Actual scope: build DeviceTokenRegistry, add event emission to
OrderService, wire OrderEventConsumer to NotificationService,
implement FCM/APNS dispatch.
Actual effort: 13 points
Cost: sprint overcommitted, story carries over, two downstream stories
blocked waiting for notification infrastructure
With Kognita during planning:
PO asks: "Does a notification service exist? Does it handle push?"
Kognita returns: NotificationService handles email via SendGrid. No push
implementation. No DeviceTokenRegistry. OrderService writes status
to orders.status_code — no events published on status change.
Last touched: sprint -4, PLAT-382.
Engineer in planning: "This is bigger than 3 points. We need to scope
this properly — it's infrastructure work, not feature work."
Committed: 13 points, sequenced across two sprints. No mid-sprint surprise.The difference is not about AI doing estimation. It is about the estimation conversation starting from an accurate system picture. When the product owner knows before planning that push notification infrastructure does not exist, the engineer estimates against reality, not against an assumed baseline. The 13-point story gets committed as 13 points and sequenced correctly. No mid-sprint discovery. No carry-over. No post-mortem about why the sprint failed.
The Jira side matters too. Open tickets modifying the same service as a planned story represent sprint collisions — two engineers modifying the same component from different directions. These collisions are knowable before planning if anyone looks across both Jira and the codebase simultaneously. That visibility is exactly what Kognita provides: current codebase state connected to in-progress Jira work, available to anyone in the planning session without requiring them to open an IDE or ask an engineer to look something up.
What to actually do about velocity
For teams that need practical guidance on the velocity problem right now, the answer is not to find the perfect unit of measurement. It is to separate the two distinct things velocity was trying to measure and track them independently.
The first thing is capacity: how much can the team deliver in a sprint? This requires normalizing for AI adoption level. A team that uses AI heavily on implementation-heavy stories has a different capacity profile than a team doing infrastructure and coordination work all sprint. Tracking velocity separately for AI-amenable work versus coordination-heavy work gives you a more honest picture of what a sprint can hold.
The second thing is scope accuracy: are we committing to stories with a correct understanding of what they involve? This is where most teams have the largest gap, and it is not a measurement calibration problem. It is an information problem. Tracking mid-sprint scope changes per sprint — how often did a committed story turn out to be larger than planned? — gives you a direct signal of how well planning-time system knowledge maps to implementation-time reality.
What to track instead — practical velocity recovery:
Old metric: story points per sprint (velocity)
Problem: calibration is broken; comparison to pre-AI baseline is meaningless
Replacement metrics:
1. Expected scope vs. actual scope ratio
Track: did the story deliver what planning said it would?
Target: ratio approaches 1.0 over time
Signal: ratio below 0.7 consistently = planning is regularly underscoping
Useful because: separates the effort question from the scope question
2. Mid-sprint scope changes per sprint
Track: how many stories had scope added after sprint start?
Target: fewer than 2 per sprint
Signal: high count = system knowledge gap at planning time
Useful because: directly measures the cost of planning on memory vs. system truth
3. AI-normalized velocity bands
Instead of a single velocity number, track three bands:
-> AI-amenable work (scaffolding, CRUD, migrations): calibrate separately
-> Coordination work (cross-service, shared schema): calibrate separately
-> Investigation/debugging: estimate in time, not points
Useful because: separates "AI made this faster" from "AI doesn't help here"
and gives planning a realistic capacity picture for mixed sprintsThe goal is not to restore the old velocity number. The old velocity number was a proxy that worked in a specific context. That context changed. The goal is to build a planning process that gives the team accurate capacity signals and reduces scope surprises. That requires both a recalibrated measurement approach and an information foundation that makes scope estimates accurate in the first place.
Final take
The story point debate — points vs. hours vs. t-shirts vs. no estimates — is a distraction from the actual problem. Teams that drop story points and move to flow metrics still have unplanned scope expansions mid-sprint. Teams that recalibrate to AI-assisted effort still discover mid-sprint that the infrastructure they assumed existed does not. The unit of measurement is not where estimation fails.
Estimation fails because planning runs on memory, and memory is always a stale picture of the system. AI made teams faster. It did not make scope more visible. The gap between planning-time system knowledge and implementation-time system reality is where sprints slip — and that gap closes when planning has access to live codebase context, not when teams find a better way to count effort. Fix the information problem first. The measurement problem becomes much easier after that.