Blog

Your AI Agents Shipped 200 PRs This Month. Which Ones Actually Mattered?

9 min read

The number on the slide says 200 PRs merged this month. A year ago it was 60. Leadership nods. The engineering org adopted AI tools, and the graph is pointing up. Case closed.

Nobody on that call can answer the next question: which of those 200 PRs shipped something a customer asked for? Which closed a ticket that had been sitting in the support queue for three weeks? Which introduced the regression that cost three engineers a full week to trace? Which were the AI reformatting test files, renaming variables, and splitting functions nobody asked it to split? The number 200 is real. The meaning of 200 is gone.

High PR count has become the vanity metric of the AI coding era. It replaced lines of code, which was the vanity metric of the pre-AI era. Both numbers are easy to generate, easy to report, and nearly impossible to interpret without the context that would make them mean something.

What 200 PRs actually contains

Run the breakdown on a real AI-assisted team operating at that velocity and the distribution is not what leadership's mental model assumes.

Anatomy of 200 merged PRs on an AI-assisted team

200 PRs in a month — what actually happened:

  Customer-reported bug fixes:          12 PRs
  Accepted epic progress:               28 PRs
  Unplanned refactors, no Jira link:    41 PRs
  Test file reformatting:               19 PRs
  Variable renames, function splits:    31 PRs
  Dependency bumps:                     22 PRs
  Regressions fixed (3 engineers, 1wk): 8 PRs
  Unclear / no description:             39 PRs

  PRs leadership counted as evidence of AI ROI: 200
  PRs that moved a customer outcome:    40

The 40 PRs that actually moved a customer outcome are real. They represent a genuine acceleration — a year ago, that number might have been 30 on a slower team. But leadership is celebrating 200 and measuring AI ROI against 200. The 160 PRs that weren't customer-outcome work are not failures — some are necessary maintenance, some are legitimate internal improvements — but they are not evidence of the business value the AI adoption narrative requires.

The problem is not that those 160 PRs happened. The problem is that nobody has distinguished them from the 40 that mattered. The signal has been averaged into the noise.

Why PR count broke as a metric

Before AI coding tools, PR count tracked feature output reasonably well because PRs required human deliberation. An engineer decided to write the code, scoped the change, made tradeoffs, and opened a PR when it was ready. That friction meant the denominator — number of PRs — was roughly proportional to the numerator — intentional development decisions.

How AI tools broke the PR count signal

Why PR count became a vanity metric:

  Before AI coding tools:
  -> PR count ≈ feature output
  -> each PR required human deliberation
  -> merge rate was naturally gated by review bandwidth

  After AI coding tools:
  -> AI generates test suites, reformats files, splits functions unprompted
  -> agents open PRs on tasks nobody requested
  -> review bandwidth becomes the constraint, not generation capacity
  -> high PR volume signals AI activity, not business progress

AI agents change the denominator without changing the numerator. An agent can open 15 PRs in a day: one that implements a feature, and fourteen that reorganize tests, extract helper functions, update imports, and rename things according to its own style preferences. The agent is not wrong to do this — many of those changes are improvements. But they register in the PR count the same way the feature PR does. The count no longer tracks decisions. It tracks generation activity.

This is why the volume increase in AI-generated code is straining review bandwidth even when individual PRs look clean. Review bandwidth is the scarce resource, and the signal-to-noise ratio of what needs real review versus what is mechanical activity is low when both types merge under the same metric.

The regression that cost three engineers a week

There was a regression in last month's 200 PRs. There almost always is. On a team shipping at this velocity, finding it after the fact requires answering a question that no GitHub dashboard surfaces: which of these PRs touched the services that customers are currently reporting problems with?

In a team where Jira holds the customer bug reports and GitHub holds the PR history, those two datasets are sitting in separate systems with no structural connection. The engineer assigned to debug the production issue has to manually cross-reference — which PRs merged in the window before the symptoms appeared, which services those PRs touched, which of those services has an open Jira ticket describing the symptoms. That manual cross-referencing is the week's worth of work. It is not debugging. It is archaeology, done against two disconnected systems.

The regression was not hidden. The information to find it was present in both systems. The connection between them was not.

What PR count cannot tell you — but needs to

The questions that would make the 200-PR stat meaningful are not exotic. They are the questions any product or engineering leader should be able to answer at a sprint retrospective.

Questions PR count cannot answer without Jira integration

Questions PR count cannot answer:

  -> Which PRs this month touched services with open P1 bugs in Jira?
  -> Which merged PRs correspond to accepted epics vs. unplanned changes?
  -> How many PRs this sprint had a linked Jira ticket at merge time?
  -> Which PRs introduced changes that support has reported symptoms of?
  -> What fraction of AI-generated PRs were in services that have active customer escalations?
  -> Which "refactor" PRs modified behavior, and which were purely cosmetic?

None of those questions can be answered from GitHub alone. None can be answered from Jira alone. They all require the relationship between what engineering shipped and what the business intended — which PRs correspond to accepted work, which correspond to customer pain, and which represent autonomous AI activity that was never explicitly planned.

The absence of that relationship is not a tooling inconvenience. It is a measurement gap. A team that cannot answer those questions is optimizing for a number that does not connect to the outcomes the business actually cares about. AI code review tools that operate only on diffs do not close this gap — the gap is not in the code, it is between the code and the intent.

Connecting the PR flood to business intent

The fix is not to count PRs differently. It is to stop treating PR count as the primary signal and start asking which PRs corresponded to work that mattered — measurably, against Jira epics, customer tickets, and sprint commitments.

Kognita: PR activity mapped to Jira intent

Kognita: connecting the PR flood to business intent

  "Which PRs this month touched services that have open
   customer-reported bugs in Jira?"
  -> surfaces the 12 that were actual bug fixes
  -> surfaces the 8 that may have introduced new regressions

  "Which of this sprint's merged PRs correspond to
   accepted epics vs. unplanned changes?"
  -> 28 linked to accepted work
  -> 41 with no Jira connection — unplanned drift

  "Which services have had the most PR activity but
   the fewest closed Jira tickets against them?"
  -> identifies AI churn: high output, no product progress

Kognita connects the codebase and Jira in a queryable layer. The question "which PRs this month touched services that have open customer-reported bugs?" becomes a query, not a multi-hour investigation. The question "which of this sprint's merged PRs correspond to accepted epics?" gets answered without manually cross-referencing GitHub and Jira export files.

This is what technical debt invisible to leadership shares with PR vanity metrics: both are symptoms of a measurement layer that was built for a world where engineering output was slower and more legible. AI-speed engineering generates output faster than the measurement layer can interpret it. The output accumulates without the business context that would make it meaningful.

What the metric should actually track

Teams that have moved past PR count as a primary signal are tracking a different set of questions. What fraction of PRs this sprint had a linked Jira ticket at merge time? Of those, what fraction corresponded to accepted epics rather than engineer-initiated tasks? How many customer-reported issues were closed by merged PRs this month, and how does that compare to how many new issues were opened?

These metrics require the connection between codebase activity and business intent. They are harder to compute than counting merged PRs. They are also the only metrics that tell you whether the AI adoption investment is delivering outcomes or just generating output. An organization that cannot distinguish between the two will keep celebrating the number while the outcomes stay flat — a pattern described in detail in the context of AI coding velocity that doesn't translate to business results.

Final take

200 PRs is a fact. It is not a result. The result is whether customers got what they needed, whether bugs got fixed, whether the system is in better shape than it was 30 days ago. Those results live in the connection between what was merged and what was intended — a connection that requires codebase truth and Jira truth in the same place.

PR count is not a business metric. It never was, and AI coding tools have made it less useful faster than most organizations have noticed. The question isn't how many PRs shipped. The question is which ones moved something that mattered — and answering that requires the link between engineering activity and business intent that most teams are still missing.