Blog
Same Model, 40% Different Task Completion Rate. The Harness Is Why.
10 min read
Two engineering teams. Same company, same codebase, same underlying AI model. One completes 73% of real-world engineering tasks end-to-end. The other completes 31%. Nobody upgraded the model between the two measurements. Nobody changed the prompts. The difference was entirely in the harness — the infrastructure layer that wraps the model and governs how it operates.
This is the data point that should reframe how engineering leaders think about AI tool investment. The gap between the best and worst models in controlled benchmarks is typically 1 to 3 percentage points. The gap that harness design creates — between a team using the same model with thoughtful infrastructure versus one running default settings — is 20 to 40 points. Teams optimizing for model selection while neglecting harness design are optimizing the wrong variable.
What "same model, different outcomes" actually looks like
Both teams used Claude Sonnet as their primary agent model. Both ran it through a similar IDE integration. Both gave developers access to roughly the same set of engineering tasks — bug fixes, feature implementations, refactors, migrations. The task completion metric was the same: did the agent complete the stated task correctly without requiring significant rework or human correction?
The high-performing team had invested in three things the low-performing team had not: pre-loaded codebase context that started every session grounded in the actual system, scoped tool permissions that contained the blast radius of any wrong decision, and domain-specific tools that returned precise answers for engineering-specific questions rather than generic file reads and web searches.
The low-performing team was not doing anything wrong. They were running the model the way most teams run it — with a system prompt, access to the files the developer had open, and the default tool set that ships with the IDE integration. That configuration works for simple, isolated tasks. It degrades systematically on the kinds of tasks that represent most real engineering work: tasks that span multiple services, require understanding conventions the developer never stated explicitly, and depend on knowing what else in the codebase might be affected.
The three harness variables that explain the gap
What separated the high-performing teams from the low-performing ones
(same model, real-world task completion benchmarks)
DIMENSION LOW PERFORMERS HIGH PERFORMERS
────────────────────────────────────────────────────────────────────────────
Context at session Manual: developer Automatic: index loads
start pastes files into correct context before
the session first prompt
Tool availability Generic: web search, Specific: codebase graph,
file read, shell service ownership, Jira
state, API schemas
Permission scope Unrestricted: agent Scoped: agent can read
can write anywhere broadly but write only
it can read within authorized paths
Feedback loops None: agent proceeds Active: agent gets tool
on failed tool calls output signals and retries
without signal with adjusted approach
Re-indexing Manual or absent: Automatic: context is
context goes stale always current with the
immediately live codebase
Task completion rate 31% 73%
Gap to top models 32 points 1–3 points
Note: the model upgrade path can recover 1–3 points.
The harness upgrade path can recover 30+ points.Context at session start is the variable with the highest leverage and the most invisible cost. When a developer starts an AI session without pre-loaded codebase context, the first 8 to 12 minutes of the session is orientation — pasting files, describing the system, explaining conventions that the agent would have known if the harness had been configured to load them. That orientation cost is invisible because it feels like "getting set up." But across a team of ten developers doing two sessions a day, it is 80 to 120 person-minutes of re-briefing overhead every day, and the agent still works from an incomplete picture of the system because no developer can paste everything relevant in under 15 minutes.
The agents in the high-performing group did not spend time on orientation. The context was already loaded. They could ask questions and take actions from the first message, because the harness had pre-selected the relevant codebase context before the session opened. The completion rate improvement is not because the model is smarter — it is because the model starts each task with the information it needs rather than with whatever the developer happened to remember to paste.
Why permission scope matters for completion rates
Permission scope sounds like a security concern. It is also a completion rate concern. Here is the mechanism: when an agent has unrestricted write access and makes a wrong decision mid-task, the wrong decision cascades. It writes files the developer did not intend to modify. It creates dependencies the original task did not require. By the time the developer realizes something went wrong, the task has diverged enough that the correct path forward requires reversing changes that were never part of the original scope.
That kind of mid-task cascading failure is one of the most common reasons coding agent tasks fail to complete correctly. The agent was not wrong about the immediate action — it was right about that specific step — but it did not know the action would have downstream consequences that it was not authorized or equipped to handle. A scoped harness limits the blast radius so that wrong decisions stay containable. The agent can make mistakes without those mistakes invalidating the entire task.
The counterintuitive result: more restricted agents complete more tasks successfully. Not because restriction makes agents smarter, but because it makes their failures smaller and more recoverable. A task where the agent took the wrong path and changed two files is recoverable in minutes. A task where the agent took the wrong path and changed forty files is a partial restart.
Domain-specific tools versus generic tools
The third variable is tool design. Most teams deploy AI agents with generic tools: read a file, search the web, run a shell command. These tools are correct in the sense that they can eventually produce any answer the agent needs — but they are inefficient, and efficiency matters for completion rates.
When an agent needs to know which services depend on a shared library, a generic file-read tool requires the agent to search through dozens of files, parse import statements, and synthesize the result from raw text. A domain-specific codebase graph tool returns the answer directly. The difference is not just speed — it is context consumption. Every file the agent reads to reconstruct what a purpose-built tool would have returned directly is context that is not available for the actual task. At scale, generic tool use fills the context window with retrieval scaffolding instead of task reasoning.
Teams using domain-specific tools — codebase graph traversal, service dependency lookup, API schema retrieval — complete tasks more reliably because their agents are spending context on understanding and planning rather than on reconstructing information that the harness could have provided directly.
The harness upgrade path versus the model upgrade path
Three harness improvements that move completion rates the most
IMPROVEMENT 1 — Pre-loaded codebase context
Before: developer spends 8–12 minutes at session start assembling files,
describing the system, explaining conventions
After: session opens with the codebase index already loaded; agent knows
what exists, what it's for, and what the team conventions are
Completion rate impact: +12–18 points
Why it works: eliminates the "agent working from partial information"
failure mode that causes compounding errors mid-task
IMPROVEMENT 2 — Scoped tool permissions
Before: agent has read/write access to everything the operator can reach;
a wrong decision mid-task cascades without bounds
After: agent reads broadly but writes only to paths explicitly authorized
for the current task; wrong decisions stay contained
Completion rate impact: +8–12 points
Why it works: prevents the "correct local decision, catastrophic global effect"
failure mode that forces complete task restarts
IMPROVEMENT 3 — Domain-specific tools over generic ones
Before: agent uses file read, web search, shell — generic tools that work
for any task but are optimal for none in a real codebase
After: agent has codebase graph traversal, service dependency lookup,
API schema access — tools that return precise answers for the
kinds of questions that appear in real engineering tasks
Completion rate impact: +9–14 points
Why it works: reduces tool call count, reduces hallucinated tool outputs,
reduces context consumed by irrelevant retrieval resultsEngineering leaders evaluating AI model upgrades are accustomed to measuring capability improvements in benchmark terms. A new model version scores 2 to 4 percentage points higher on SWE-bench. That translates to a roughly proportional improvement in real task completion rates — meaningful, but modest.
Harness improvements do not work this way. They address failure modes that the model cannot fix because they are not model problems — they are infrastructure problems. The model does not know your codebase is pre-loaded because the harness did not load it. The model does not know its permissions are scoped because the harness did not scope them. The model uses generic tools because the harness did not provide domain-specific ones. These are harness configurations, and changing them produces step-change improvements in completion rates rather than marginal ones.
The practical implication: before spending engineering time evaluating the next model generation, most teams would get more from auditing their harness. What context is being loaded at session start? What tools are actually available? What happens when the agent makes a wrong decision mid-task? Those questions often reveal more improvement opportunity than the delta between model generations.
Why teams don't invest in harness engineering
The answer is mostly visibility. Model quality is measurable on a benchmark. Harness quality is harder to measure — it requires tracking task completion rates over time, understanding why specific sessions failed, and attributing failures to infrastructure rather than model capability. Most teams do not instrument their AI usage this way. They see failed sessions and blame the model. They upgrade the model and see modest improvement. They do not see that the harness was the bottleneck all along.
There is also an activation energy problem. Building a production harness — proper context loading, permission scoping, domain-specific tools, re-indexing — is months of infrastructure work. Teams running tight engineering timelines defer it as a future optimization. By the time the team is large enough that the problem is undeniable, the deferred work has compounded into a larger rebuild. At fifty engineers, the harness problem is not deferrable — it is active context fragmentation, inconsistent agent behavior across the team, and completion rates that vary by developer rather than by task.
What a managed harness resolves
The teams with the highest completion rates have one thing in common beyond the three harness variables: they did not build the infrastructure themselves. They used a managed layer that handled context loading, re-indexing, and tool provisioning — and focused their own engineering effort on the business-specific configuration that the generic infrastructure could not provide.
Kognita operates as a managed agent runtime that handles the harness components that are expensive to build and maintain: the always-current codebase index, the MCP server that exposes the index to Claude Code, Cursor, and other agents, and the Jira integration that adds ticket context to codebase context. Every session starts with the right context loaded. Every developer on the team runs from the same index rather than from whatever they happened to paste before starting work.
The completion rate gap closes not because the model changed, but because the infrastructure the model runs on stopped being the bottleneck.
Final take
The 40-point completion rate gap between high and low-performing teams using the same model is not an edge case. It is the normal outcome when teams treat AI agent deployment as a prompt engineering problem rather than a harness engineering problem. The model is the same. The infrastructure around it is not.
Model upgrades move the needle by 1 to 3 points. Harness improvements move it by 20 to 40. The most actionable thing most teams can do with their AI tool budget is not to evaluate a newer model — it is to audit the infrastructure layer the current model runs on and address the failure modes that no model upgrade can fix.
Your AI agents are probably underperforming not because the model is insufficient, but because the harness around it is incomplete. That is fixable without waiting for the next model generation.