Blog

We're Running 30 AI Agents. The CFO Wants to Know What We Got for It.

9 min read

The Anthropic invoice for Q1 came in at $61,400. That number is on a slide in the quarterly business review. The CFO is looking at it. She is not angry — she approved the AI tooling expansion in November and she understood it would cost money. What she is asking, reasonably, is what the company got for it. The CTO says velocity is up. She nods. She asks which features shipped. He lists three. She asks which ones customers are actually using. He says he can follow up.

That follow-up never arrives with a satisfying answer — not because the CTO is disorganized, but because the information does not exist in a place anyone can pull it. The cost is on the invoice. The value is distributed across 312 merged PRs, 189 Jira tickets, and the institutional memory of twenty-eight engineers who ran Claude Code sessions all quarter and could not tell you, without significant effort, which of those sessions produced a feature a customer has seen.

The invoice is precise. The output is not.

Quarterly AI agent spend — Series B SaaS company

Quarterly AI agent spend — Series B SaaS company (Q1 2026):

  Anthropic API (Claude Code, agent pipelines):    $61,400
  Cursor Pro seats × 28 developers:               $8,120
  OpenAI API (legacy agent integrations):          $9,300
  GitHub Copilot seats × 34 engineers:             $7,480
  Total quarterly AI tooling spend:                $86,300

  CFO question: "What did we get for $86,300?"

  CTO answer: "Velocity is up. We shipped a lot this quarter."

  CFO follow-up: "Which things? What did customers actually use?"

  CTO answer: [silence]

This is the shape of the problem at almost every engineering organization running AI agents in 2026. The cost side of the AI ledger is metered, itemized, and delivered monthly. The value side is a narrative — usually delivered as a slide with velocity metrics and a bullet list of features that may or may not be live, may or may not have shipped to all users, and may or may not have addressed anything a customer cared about.

The CFO is not asking an unreasonable question. She is asking the same question she would ask about any other significant expense line: what did this produce? The problem is not the question. The problem is that engineering organizations have built almost no infrastructure to answer it. Multi-agent coding at team scale produces enormous amounts of output — PRs, deploys, closed tickets — and almost none of it is connected, in any automatic way, to the business value that output was supposed to create.

What "shipped" actually means in an agent-heavy org

When an engineering team says they "shipped" something, they mean they merged a PR and deployed to production. That is a real thing. It is also not the same as "a customer can use this" or "this addressed a problem customers were having" or even "this is live for more than a small percentage of the user base."

Feature flags complicate this further. A team running 30 AI agents is producing code fast enough that features go behind flags routinely — not because they are not complete, but because deployment and exposure are managed separately. A Jira epic gets closed when the code ships. Whether the flag is ever turned on for users is a different decision, tracked differently, by a different person. The CFO looking at "11 epics closed this quarter" is not seeing that two of those epics are behind flags that have not been enabled and may not be enabled this quarter either. She is seeing 11 closed epics and drawing the natural inference that 11 things are live.

This is not a new problem, but AI agents have made it dramatically more acute. When a team ships a sprint's worth of features every week because agents are doing the implementation work, the gap between Jira-closed and customer-live widens faster than any manual process can bridge it. The organization's ability to account for its own output does not scale with the speed of agent-driven development.

The gap between Jira reality and codebase reality

The quarter in numbers — with and without codebase ground truth

The quarter in numbers — before and after Kognita:

  WITHOUT VISIBILITY:
  -> PRs merged: 312
  -> Jira tickets closed: 189
  -> Deploys to production: 47
  -> Epics marked "done" in Jira: 11

  Engineering summary: "Strong quarter. Shipped the checkout redesign,
  payment retry improvements, and the new notification center."

  CFO interpretation: unclear. Which of the 11 epics map to the features
  sales is demoing? Which ones are actually live vs. behind a flag?
  Which of the 312 PRs were agents and which were humans?

  WITH KOGNITA (codebase + Jira ground truth):
  -> 4 of 11 epics shipped to production with customer-visible changes
  -> 2 epics are "done" in Jira but behind feature flags, not live
  -> 5 epics closed Jira tickets without any corresponding production deploy
  -> Checkout redesign: live for 100% of users, 6 services changed
  -> Payment retry: live for users on the new billing plan only (23% of base)
  -> Notification center: deployed but no user has the feature enabled yet
  -> Agent-generated PRs: 187 of 312 (60%), touching 14 of 22 services

The disconnect here is not primarily a project management failure. It is an information infrastructure failure. Jira tracks ticket state. It does not track whether the code associated with a ticket is deployed, which services changed, whether the deploy went to all users or only a cohort behind a flag, or whether the feature is reachable from any user-facing surface. That information is in the codebase and the deployment pipeline. It is not in Jira. The people who could connect these two data sources — senior engineers — are not spending their time doing it. They are running agent sessions.

The result is that when the CFO asks "what did we ship this quarter," the honest answer requires someone to manually cross-reference Jira tickets against git history against deployment logs against feature flag configurations. That is an hour of work per epic, minimum. For eleven epics, it is a half-day project. Most CTOs do not have a half day before the quarterly review. They prepare a slide based on what they remember and what they can pull from Jira in fifteen minutes. That slide is not wrong — it is just incomplete in ways that matter for the ROI conversation.

What agents actually touched, and what that cost

Agent-generated vs. human-generated PRs by service — Q1 2026

What agent-generated code actually touched this quarter:

  Service               | Agent PRs | Human PRs | Customer-Visible Change
  ----------------------|-----------|-----------|------------------------
  CheckoutService       |    34     |    12     | Yes — live for all users
  PaymentService        |    28     |     9     | Yes — live for 23% of users
  NotificationService   |    22     |     6     | No — behind flag, 0% exposure
  UserService           |    19     |    14     | No — internal refactor only
  APIGateway            |    17     |     8     | Yes — affects all API consumers
  BillingService        |    15     |     5     | Partial — new plan tier only
  LegacyAdapterLayer    |    12     |     3     | No — maintenance, no new capability
  ReportingService      |    11     |     4     | No — internal tooling only
  EmailService          |     8     |     7     | Yes — transactional templates updated
  SearchService         |     6     |     2     | No — in progress, not deployed

  Without this table: "agents shipped a lot across the codebase"
  With this table: a CFO can ask specific questions. A CTO can answer them.

This table does not exist anywhere in most engineering organizations today. The information that would produce it is in GitHub — which PRs were opened by which author, against which service, when — and in the deployment pipeline and feature flag system. But no one has stitched it together. The CTO who could show this table to the CFO would have a radically different conversation. Instead of "velocity is up," she can say: "Agents wrote 60% of our PRs this quarter. Of that work, here is what reached customers, here is what is deployed but unexposed, and here is what is still in progress. The $61,400 Anthropic bill bought us the checkout redesign and the payment retry improvement at a pace we could not have matched with human implementation alone."

That is an ROI conversation. The current version — "velocity is up, trust us" — is a faith conversation. Faith is fine when budgets are small. It does not hold at $86,300 per quarter and growing.

Why governance has not caught up to agent speed

The organizational tooling most teams use to track engineering output was designed for human-paced development. Jira workflows, sprint ceremonies, and quarterly planning all assume a delivery cadence where a product manager and an engineer discuss a feature, the engineer implements it over days or weeks, and someone reviews the output before it ships. In that model, the PM and the engineer share enough context that the output is legible to both of them.

Agent-driven development breaks this model in two ways. First, the speed of output exceeds the speed of human review. Agent pipelines running without rate limits or governance can produce more code in a day than a PM can meaningfully review in a week. Second, the attribution problem becomes acute. When a human engineer writes a feature, there is a clear chain: ticket, PR, deploy, engineer who can explain it. When an agent writes a feature, the chain is: ticket, agent session, 30 PRs from that session, deploy, and no single person who can explain all of it from first principles because they were running five other agent sessions simultaneously.

The CFO is not wrong to be skeptical of "velocity is up." Velocity measured as PRs-per-week or deploys-per-month is a throughput metric, not an outcome metric. A team running 30 agents can have extraordinary throughput and produce very little that customers care about, if the agents are pointed at the wrong things, or if the output is not making it through flags into users' hands.

What the ROI conversation requires

What the CFO sees vs. what she needs to see

What the CFO sees vs. what they need to see:

  WHAT THE CFO SEES NOW:
  -> Invoice: $86,300
  -> Engineering update: "Velocity is up ~40% since we rolled out agents"
  -> Jira dashboard: 189 tickets closed
  -> Slide: "We shipped the checkout redesign and payment improvements"

  WHAT THE CFO NEEDS TO SEE:
  -> $86,300 spent on AI agents
  -> 4 epics reached customers: checkout redesign, payment retry,
     notification service refactor, API versioning layer
  -> Checkout redesign: estimated to address the #1 support complaint
     category (12% of all tickets last quarter were checkout friction)
  -> Payment retry: reduced failed payment rate from 4.1% to 2.8%
     on the affected billing tier
  -> 2 epics "done" in Jira, not live in production — cost incurred,
     no customer value delivered yet
  -> 5 epics closed without production deploy — need review

  Gap between column A and column B: that gap is the ROI problem.

The gap between these two columns is the product management and measurement infrastructure problem that most AI-forward engineering organizations have not solved. It is not a hard problem in theory. The information exists. It is in the codebase, in the Jira history, in the deployment logs, in the feature flag system. The problem is that it exists in four different places, none of which talk to each other automatically, and the person who could stitch it together (a senior engineer) is not spending their time doing it.

Kognita connects the output layer — what the codebase actually looks like after the agent sprint, which Jira epics have corresponding production deploys, which services changed and which are customer-facing — to plain-language summaries that a non-technical leader can read and act on. A CTO can ask "what did our agents build in Q1 that customers actually interact with?" and get an answer grounded in codebase and Jira reality, not engineering self-reporting from memory. That answer is the foundation of the ROI conversation. Without it, the CFO is paying $86,000 per quarter on a handshake agreement that something good is happening. A managed runtime for AI agents that surfaces output visibility is not a luxury at this scale — it is the infrastructure that makes the spend defensible.

The conversation that actually needs to happen

The CFO asking "what did we get for this?" is not hostile to AI investment. She approved it. She wants it to work. What she needs is a way to evaluate it against business outcomes rather than engineering throughput metrics. That evaluation requires connecting three things that are currently disconnected: the AI spend (the invoice), the engineering output (PRs, deploys, epics), and the business outcome (what customers can see and use).

Most engineering organizations have the first, have a partial picture of the second, and have almost nothing on the third. The ROI conversation fails not because AI agents are not producing value, but because the value is invisible to everyone who is not reading the code. A PM who knows the checkout redesign is live for 100% of users, reduced the support ticket category it was targeting by 12%, and was implemented primarily by agent-generated PRs — that PM can make a case for the AI spend that will satisfy the CFO. A CTO who says "velocity is up" cannot.

Final take

Running 30 AI agents is not inherently a problem. The cost is real and the output is real. The problem is the measurement gap between the two. When the cost is on an invoice and the value is in engineering folk knowledge, the ROI conversation is structurally broken — not because the AI is failing to produce value, but because no one has built the infrastructure to surface that value in terms a non-technical leader can evaluate.

The quarterly review where the CFO asks what we got for the AI spend is not a communications problem. It is a visibility problem. Solving it requires connecting the codebase, the Jira history, and the deployment state — automatically, without a senior engineer spending a day on it before every review. That infrastructure is what makes the difference between "trust us, velocity is up" and a specific, defensible answer about what the agents actually built and who is using it.