Blog
The Cost of Running AI Agents Is Visible. The Value Isn't.
9 min read
The Anthropic invoice shows $43,000 this month. That number is exact. It was calculated by a meter and delivered to a billing system and it will appear on a finance report with a decimal point. The GitHub stats show 847 PRs merged. That number is also exact — it came from an API call against a database. Both numbers are real. Neither of them tells you whether the $43,000 was worth spending.
What those numbers cannot tell you: which of the 847 PRs shipped something a customer asked for. Which ones reduced support ticket volume. Which features customers actually found and used after they shipped. Which ones introduced the bugs that cost three engineers a week to diagnose and fix in the sprint that followed. The cost side of the AI ledger is precise. The value side is a guess dressed up as a summary in a quarterly slide.
The measurement asymmetry is structural
The AI agent ledger — what is measurable vs. what is not:
COST SIDE (precise, automatic):
-> Anthropic API: $43,000 this month
-> OpenAI API: $7,200 this month
-> Cursor Pro seats: $2,800 this month
-> GitHub Copilot seats: $2,100 this month
-> Total: $55,100
VALUE SIDE (approximate, manual, incomplete):
-> PRs merged: 847
-> Tickets closed: 631
-> Features shipped: "a lot — it was a good quarter"
-> Customer impact: ?
-> Support ticket reduction: ?
-> Revenue correlation: ?
-> Bugs introduced by agent code, cost to fix: ?
The cost side has a decimal point. The value side has a question mark.
That asymmetry is not a communications failure. It is a measurement infrastructure failure.This asymmetry is not a failure of organizational discipline. It is a structural property of how AI tooling bills and how engineering organizations account for their output. The billing infrastructure for AI is exceptionally well-instrumented — every token is counted, every API call is logged, every seat is tracked. The value infrastructure for engineering output is almost completely absent. Jira tells you a ticket was closed. It does not tell you whether the code behind the ticket is live, whether it reached users, or whether it did what it was supposed to do.
This gap existed before AI agents. Agents make it worse by dramatically increasing throughput without a corresponding increase in the organizational capacity to understand what that throughput produced. A team that shipped twenty features per quarter manually had enough friction that the PM, the CTO, and the engineering leads all knew what each feature was and where it stood. A team shipping eighty features per quarter with agents does not have that friction — and the knowledge it provided has not been replaced by any automated system.
What 847 merged PRs actually tells you
847 PRs merged — what that number does and does not tell you:
WHAT IT TELLS YOU:
-> Code was committed and reviewed (or auto-merged)
-> The CI pipeline passed (or was bypassed)
-> A developer or agent opened a PR that someone approved
WHAT IT DOES NOT TELL YOU:
-> Which PRs shipped a feature a customer asked for
-> Which PRs were maintenance, refactoring, or test coverage (no user value)
-> Which PRs introduced the regression that cost 3 engineers a week to debug
-> Which PRs are deployed to production vs. behind a feature flag vs. reverted
-> Which PRs touched customer-facing services vs. internal tooling
-> How many of the 847 were agent-generated vs. human-written
-> Whether the agent-generated PRs were reviewed with the same rigor
847 merged PRs is a throughput number.
Throughput is not value. Throughput is a precondition for value.Throughput metrics are seductive because they are easy to pull and impressive to report. 847 PRs is a large number. It sounds like a lot of work happened. And work did happen — agents wrote code, engineers reviewed it, CI passed, code merged. But throughput is a precondition for value, not the same thing as value. A team can merge 847 PRs and ship nothing that a customer cares about, if the PRs are pointed at internal refactoring, test coverage improvements, and features behind flags that were never enabled.
The more interesting question — and the one that connects to ROI — is not how many PRs merged but which services changed, which changes are live, and which live changes addressed something that mattered to customers. That information is not in the PR count. It is in the codebase, the deployment pipeline, and the feature flag system. None of those talk to each other automatically. None of them talk to Jira. The person who could stitch them together is a senior engineer, and that engineer is running another agent session.
The hidden cost that does not appear on the invoice
The hidden cost of agent-generated code at speed:
Month 1: Agents ship 312 PRs across 14 services
Month 2: QA finds a data integrity issue in CheckoutService
-> Root cause: agent PR from month 1 changed validation logic
-> Fix: 3 engineers, 6 days
-> Engineering cost: ~$18,000 (at fully-loaded rate)
-> That cost does not appear on the Anthropic invoice
Month 2: Customer escalates data export failures
-> Root cause: agent PR changed field ordering, broke downstream parser
-> Fix: 1 engineer, 3 days + customer success time
-> Engineering cost: ~$6,000
-> Customer success cost: ~$2,000
-> That cost also does not appear on the Anthropic invoice
The Anthropic invoice for month 1: $43,000
The actual cost of month 1 agent output: $43,000 + $26,000 in fixes
Total: $69,000
The invoice shows $43,000. The true cost is only visible if someone
connects the agent-generated PRs to the incidents they caused.The Anthropic invoice shows what you paid to generate the code. It does not show what you paid to fix the code that should not have shipped. Agent-generated code at high speed creates a specific risk: the review process that would catch a subtle logic error or a breaking API change gets compressed or skipped, because there is too much volume for careful review of every PR. A team merging 847 PRs per month is not reviewing each one with the same depth they would give to 100 PRs per month. The math does not work.
The bugs that get through are not immediately visible on the invoice. They appear as incident tickets, customer escalations, and engineering time spent on fixes that were not in the sprint plan. At $18,000 to $26,000 per significant regression — fully-loaded engineering cost over a week of debugging across multiple engineers — a couple of agent-introduced bugs can erase the efficiency gain the agents were supposed to provide. But because the bug fix cost and the agent generation cost appear in different places — one on the Anthropic invoice, one in engineering salaries — no one is adding them together.
This is not an argument against AI agents. It is an argument for connecting the output to the outcomes, so the true cost of agent-generated code is legible rather than distributed invisibly across multiple budget lines and multiple months. Evaluating AI tools without accounting for the downstream cost of their output produces systematically optimistic ROI estimates.
Why engineering self-reporting is not the answer
The default mechanism for translating engineering output into business terms is engineering self-reporting: the CTO writes a quarterly summary, the engineering leads contribute bullet points, the Jira dashboard provides numbers, and the result is a slide that says something like "strong quarter — shipped checkout redesign, payment improvements, search upgrade." Everyone in the room nods. The CFO files the information and moves on.
Self-reporting has two problems that compound in an agent-heavy org. First, it is slow. Producing an accurate quarterly summary requires someone to reconstruct what happened from git history, Jira, deployment logs, and institutional memory. At 847 PRs per month, that reconstruction is not a two-hour project. It is a multi-day project that almost no one does rigorously, so summaries are based on the features people remember rather than the features that actually shipped.
Second, self-reporting conflates what engineering considers done with what is actually live. A feature where the code merged is "done" from an engineering perspective. Whether the flag is enabled, whether the feature is visible to all users or a cohort, whether the A/B test concluded — those are details that may or may not make it into the summary. The people reading the summary have no way to know what they do not know. The checkout redesign is "shipped." Whether it shipped to 100% of users or 15% in an ongoing rollout is a fact that changes the business conversation, and it is frequently absent from engineering summaries because the person writing the summary does not think to include it, or does not know.
What connecting the output layer actually looks like
What codebase + Jira ground truth reveals about Q2 agent output:
SERVICES CHANGED BY AGENTS (by customer exposure):
-> CheckoutService: 89 agent PRs — live for 100% of users
-> PaymentService: 67 agent PRs — live for enterprise tier only (31%)
-> UserService: 54 agent PRs — internal refactor, 0% user exposure
-> NotificationService: 41 agent PRs — behind flag, not enabled
-> SearchService: 38 agent PRs — live for 100% of users
-> LegacyAdapter: 29 agent PRs — maintenance, no new capability
-> ReportingService: 24 agent PRs — internal dashboard only
JIRA EPICS WITH PRODUCTION REALITY:
-> "Checkout Redesign" — live, 100% users, 6 services changed
-> "Payment Retry Improvements" — live, enterprise only, 3 services
-> "Search Relevance Upgrade" — live, 100% users, 2 services
-> "Notification Center" — NOT live, behind flag, engineering done
-> "User Data Portability" — NOT live, Jira closed, deploy pending
-> "Reporting Overhaul" — internal only, no customer-facing change
A CTO who can show this table answers the ROI question.
A CTO who cannot show it is guessing.The difference between the table above and a standard engineering summary is not the information — it is the source. A standard engineering summary is produced by a person reconstructing from memory and Jira. This table is produced by connecting codebase state to Jira epic status to deployment records. The person producing it does not need to remember what shipped. They need a system that can answer: for each Jira epic, which services changed, are those changes deployed, are they live for users or behind a flag, and which services are customer-facing?
Kognita connects these data sources and produces exactly that kind of output. A CTO can ask "what did our agents build in Q2 that customers actually interact with?" and get an answer grounded in what the codebase actually looks like after the quarter, not in what someone remembers or what Jira says in a status field that may not reflect deployment reality. At the token cost and throughput volume of multi-agent coding, that answer is the difference between a defensible ROI story and a faith-based one.
The ROI conversation that is currently impossible
What a CTO wants to be able to say in the quarterly review: "We spent $55,000 on AI tooling this quarter. That produced the checkout redesign — live for 100% of users — which addressed the top support complaint category and reduced checkout abandonment friction. It produced the search relevance upgrade — also live for 100% of users — which improved search result quality in ways the product team had been requesting for two quarters. The payment retry work is live for enterprise customers and has measurably reduced failed payment rate. Three other projects are complete in engineering and will go live in Q3."
That statement is specific. It connects spend to output to customer exposure. It gives the CFO something to evaluate rather than something to take on faith. It tells the board which of the quarter's work is already producing results and which is deferred.
Most CTOs cannot currently make that statement because the information is not assembled anywhere. The $55,000 is on the invoice. The checkout redesign is in engineering folk memory and a Jira epic. Whether it is live for 100% of users is in a feature flag dashboard that the CTO may or may not have checked. The connection between the spend and the specific customer-visible outcome does not exist as a document anyone can pull before the quarterly review.
Final take
The AI agent cost problem is not that agents are expensive. It is that the expense is precise and the value is invisible. The invoice tells you what you paid. Nothing automatically tells you what you got — which services changed, which changes reached users, which changes addressed what customers cared about, and which changes cost more to clean up than they were worth to ship.
Connecting the output layer — codebase state, Jira epics, deployment reality — to business-readable summaries is not a nice-to-have for AI-forward engineering organizations. It is the infrastructure that makes the investment legible. Without it, the AI spend grows every quarter on the invoice while the value conversation stays stuck at "trust us, velocity is up." The invoice shows what you spent. Kognita shows what you got.