Harness Evolution: When to Simplify Your Agent Architecture
Anthropic's journey from complex sprint-based harnesses to single-session builds reveals a critical competitive intelligence lesson: every architectural component encodes an assumption about model limitations — and those assumptions go stale faster than you think.

Anthropic published their second harness design article in four months. The first, November 2025, introduced sprint-based context resets and structured handoffs for building software with Claude. The second, March 2026, simplified most of that machinery away. The interesting question is not what they built. It is what they removed, and what the removal pattern tells you about how capability improvements propagate through infrastructure decisions.
This is a competitive intelligence dispatch, not a tutorial. The subject is organizational assumption decay — the rate at which engineering decisions that were correct become carrying costs — and the framework for auditing your own systems before the decay compounds.
The Assumption Inventory
Every component in an agent harness encodes an assumption about what the model cannot do. The engineering is compensatory — it exists because the model, at the time of design, had a demonstrated failure mode that the harness mitigated. Mapping the components to their assumptions:
- Context resets encoded the assumption that the model degrades as context fills. This was context anxiety — the empirical observation that Sonnet 4.5 lost coherence past a certain token threshold, requiring periodic flushes to maintain output quality.
- Sprint decomposition encoded the assumption that the model could not hold a full project in working memory. Large tasks had to be broken into bounded sprints with explicit handoff artifacts.
- Contract negotiation encoded the assumption that the model would drift from scope if not formally constrained. A negotiation step between planner and generator locked the deliverable specification before work began.
- Evaluator agents encoded the assumption that the model could not objectively judge its own output. External evaluation ran after every sprint to catch regression.
These were not theoretical concerns. Sonnet 4.5 demonstrably exhibited all four failure modes under load. The harness was the correct engineering response to measured limitations. The word "correct" is doing critical work in that sentence, because it was correct then.
What Changed
When Anthropic moved to Opus 4.6 with its 1M-token context window, the assumption landscape shifted:
Context resets: removed. Context compaction — the model's ability to manage its own attention across a large window — proved sufficient. The failure mode the resets were designed to prevent no longer manifested at the same threshold.
Sprint decomposition: removed. A single continuous session could hold the full build. The working memory limitation that necessitated sprints was no longer the binding constraint.
Contract negotiation: removed. The model held scope without formal constraint mechanisms. Scope drift, previously a reliable failure mode, became rare enough that the overhead of preventing it exceeded the cost of the occasional correction.
Evaluator: kept, but repositioned. Moved from per-sprint evaluation to end-of-build only. The evaluator's role shifted from continuous quality gating to final adversarial review.
The harness went from a complex multi-sprint orchestration — Planner, Contract Negotiation, Generator (per sprint), Evaluator (per sprint), Context Reset (between sprints) — to a three-stage pipeline: Planner, Generator (one shot), Evaluator (at the end). Architecturally cleaner. Not necessarily cheaper — $125 for a four-hour DAW build is not a cost reduction story. But structurally simpler, with fewer failure modes and less coordination overhead.
The Competitive Intelligence Lens
Signal: Capability improvements obsolete infrastructure at a rate most organizations fail to track. The same dynamic applies to intelligence collection methods made redundant by new sensors, defensive postures invalidated by new attack surfaces, and market strategies undermined by shifts in the competitive environment. The constant is that the infrastructure built to compensate for a limitation persists long after the limitation is gone.
This pattern — assumption decay — is one that CI analysts should recognize immediately, because it operates in every domain where capability and infrastructure interact.
Organizations that prune complexity faster win. Dead assumptions in your architecture are carrying costs: latency from unnecessary coordination steps, token spend on redundant evaluation passes, maintenance burden on code that solves problems that no longer exist, and cognitive overhead for the team that has to understand why each component is there. The last cost is the most insidious. When an engineer asks "why do we do it this way?" and the answer is "because the model used to fail without it," you have an assumption that needs re-examination.
The meta-skill is assumption auditing. Not building the harness — that is straightforward engineering given known constraints. The difficult skill is maintaining an accurate inventory of which constraints are still binding and which have gone stale. Jeremy Knox has written about this in the context of engineering leadership: the same pattern governs when to add process versus when to remove it. Process that compensates for a team limitation is correct when the limitation exists. When the team has grown past it, the process becomes friction masquerading as discipline.
The audit cadence must match the capability change rate. In a domain where foundation model capabilities shift every few months, a harness designed six months ago is operating on assumptions that may be significantly stale. The organizations that treat their agent architecture as a living system — subject to the same continuous reassessment they apply to competitive positioning — will outperform those that build once and maintain indefinitely.
A Framework for Auditing Your Agent Architecture
Assumption Audit Protocol. For each component in your agent harness, run these five questions. Components that fail the audit are candidates for removal or simplification. Components that pass are load-bearing and should be reinforced.
For each harness component, ask:
1. What assumption does this encode? Be specific. Not "the model isn't reliable" but "the model loses coherence past 200K tokens when generating code that must maintain internal consistency across files." Vague assumptions cannot be tested.
2. Is the assumption still true? Test empirically, not theoretically. Run the pipeline without the component on representative tasks. Measure the failure rate. If the failure rate is within acceptable bounds, the assumption has decayed.
3. What is the carrying cost? Extra latency per run, additional token spend, code complexity that slows iteration, failure modes introduced by the component itself (coordination bugs, state management issues, timeout cascades). Every component has a cost. The question is whether the cost is still justified by the risk it mitigates.
4. What is the risk of removal? If the assumption turns out to still be partially true and the guardrail is gone, what fails? Catastrophic failure modes justify keeping components even with high carrying costs. Graceful degradation — the model produces slightly worse output but doesn't break — may not.
5. Can you test removal cheaply? A/B test with and without the component on a subset of tasks. Shadow mode — run the simplified pipeline alongside the full pipeline and compare outputs. If testing removal is expensive, that itself is an architectural smell: your system should be decomposable enough to evaluate individual components in isolation.
Anthropic's own progression demonstrates this framework in action. They tested the simplified harness against the sprint-based harness, measured outcomes, and found that the capability improvement had obsoleted three of four components. The evaluator survived because adversarial review at the boundary still caught failure modes that the model's self-assessment missed.
The Evaluator Paradox
The component Anthropic kept — the adversarial evaluator — reveals something about the topology of model limitations. Context management, scope adherence, and working memory all improved with scale. Self-evaluation did not improve at the same rate.
This maps to a known pattern in intelligence analysis: the gap between collection capability and analytical self-awareness. An intelligence service can improve its sensors, expand its coverage, increase its processing speed — and still fail to recognize when its own analytical framework is producing distorted output. The external-facing capabilities scale. The inward-facing discipline does not scale at the same rate, because it requires a different kind of reasoning: adversarial reasoning about your own process.
The practical implication for harness design: evaluator investment should scale with task difficulty, not be a fixed overhead. When the task is within the model's demonstrated wheelhouse, light evaluation is sufficient. When the task stretches the model's limits — novel domains, complex multi-file coordination, ambiguous specifications — heavy evaluation is justified. The InDecision Framework applies the same principle to decision-making: the depth of analysis should match the stakes of the decision. Not every competitor move requires a full intelligence assessment. Not every agent output requires adversarial review.
Signal: Self-evaluation capability lags behind task execution capability across model generations. This is not a temporary gap — it reflects a structural asymmetry between doing and judging. Architect your systems with this asymmetry as a persistent feature, not a temporary limitation to be solved.
What This Signals About the Industry
Reading between the lines requires accounting for incentive structures. Anthropic sells API access. Opus 4.6 with a 1M-token context window generates significantly more revenue per request than Sonnet 4.5 with sprint-based chunking. Their recommendation to stop chunking and run single continuous sessions is aligned with their commercial interest. This does not make the recommendation wrong — the architectural simplification is real and reproducible — but it means the claim "you don't need context resets" should be stress-tested against your specific workloads, not accepted on authority.
The broader convergence is more interesting than any single vendor's position. Stripe's Minions, the Ralph Wiggum Loop (generate-evaluate-retry), and spec-driven frameworks are all converging on the same architectural direction: less orchestration, better models, targeted evaluation. The orchestration layer is thinning across the industry because the models are absorbing capabilities that previously required external scaffolding.
The organizations that will struggle most are those with complex agent harnesses that were precisely right six months ago. Their architecture is a fossil record of model limitations that no longer apply, and the maintenance cost of that architecture is compounding while the benefit decays. The longer the audit is deferred, the wider the gap between what the harness assumes and what the model can actually do.
This is the same dynamic that traps intelligence organizations: collection systems designed for a previous threat environment continue operating at full overhead long after the threat has evolved. The infrastructure persists because it was expensive to build, because it has institutional defenders, and because no one has the mandate to audit assumptions that were once obviously correct.
The Architecture That Survives
The best agent architecture is not the most sophisticated one. It is the one that carries exactly the assumptions that are still true — no more, no less. Every component that compensates for a limitation the model no longer has is dead weight: latency you are paying for, complexity your team is maintaining, failure modes you are debugging, all in service of a problem that has already been solved at a lower layer.
Audit yours. Map each component to the assumption it encodes. Test whether the assumption holds against current model capabilities. Measure the carrying cost. Remove what is no longer load-bearing. Reinforce what still is.
The cadence of that audit is not annual. It is not quarterly. In a domain where foundation model capabilities shift materially every few months, assumption decay is measured in weeks. The organizations that internalize this — that treat their agent architecture as a living hypothesis rather than a settled infrastructure decision — will compound the advantage. The ones that build once and maintain indefinitely will find themselves, six months from now, running a sophisticated system that solves yesterday's problems at tomorrow's cost.
Explore the Invictus Labs Ecosystem
Follow the Signal
Intelligence dispatches, system breakdowns, and strategic thinking — follow along before the mainstream catches on.

