Abstract
This position paper synthesizes two recent academic findings - the system-scaling thesis (Gu, 2026) and the reasoning-depth limitation of context-window models (Lee et al., 2026) - with commercial and regulatory developments in AI compliance (ISO 42001, NIST AI RMF, EU AI Act) to argue that the agent economy needs a new evaluation paradigm: harness-level vetting that scores all six components of agent behavior, not just model accuracy.
We propose a practical framework for this evaluation, the AgentVet Harness Scorecard, and map it to compliance requirements that enterprises will increasingly demand.
1. Model Scores Are Harness Scores
When a benchmark reports that an agent achieves X% on SWE-bench or GAIA, that score is not a model score. It is a model-plus-harness score. The same model with a different tool interface, memory system, or orchestration loop produces different results.
SWE-agent demonstrated this clearly: redesigning the agent-computer interface alone - while holding the underlying model fixed - substantially improved benchmark accuracy. Yet the field continues to report results as if they reflect model capability alone.
Any evaluation that doesn't separate model capability from harness design is measuring the wrong unit of analysis.
2. The Reliability Gap
A single successful run doesn't prove an agent works. It proves it worked once.
Recent analysis shows that agents that achieve high single-shot pass rates can collapse under passk - the probability of succeeding on k independent rollouts. An agent that passes 80% of the time on a single run may pass only 50% of the time over three independent attempts. This reliability gap is invisible to benchmark leaderboards but critical for production deployment.
Trust requires reliability, not just capability. An agent that sometimes works is not an agent you can deploy.
3. The Reasoning Depth Problem
Bigger context windows don't solve the context problem. Transformer attention dilutes over long inputs - models prefer evidence at the start or end, missing the middle. Hybrid SSM-attention models compress old context into fast weights, but Lee et al. (2026) show that these models fail when the required reasoning depth increases, even when the information to store is held constant.
The bottleneck isn't memory capacity. It's computation over memory. Converting transient context into useful internal representations requires work, and one pass isn't enough. Their solution: a sleep-like consolidation mechanism where the model performs multiple offline recurrent passes over accumulated context before clearing its cache.
Long-running agents degrade. An agent that performs well at the start of a session may perform poorly hours later as context accumulates and reasoning depth requirements increase. This is a failure mode that no current benchmark tests for.
4. The Six-Component Harness Framework
Following Gu (2026), agent performance emerges from the interaction of six components:
PH = Φ(R, M, C, S, O, G)
Model scaling improves R. System scaling improves M, C, S, O, and G. The industry has invested heavily in R and neglected the other five. This creates the trust gap: agents that score well on benchmarks but fail in production because their memory is unreliable, their context construction is sloppy, their skill routing is insecure, their orchestration is fragile, or their governance is absent.
5. The Compliance Imperative
Three regulatory forces make the Governance component non-optional:
ISO/IEC 42001
The first international AI management system standard. 38 controls across governance, risk, transparency, and ethics. Certification costs $4K-75K+ and takes 4-12 months.
NIST AI RMF
Structured methodology for AI risk management (Govern, Map, Measure, Manage). An Agent Interoperability Profile planned for late 2026 will define agent identity, authorization, security, and logging standards.
EU AI Act
Enforceable since August 2024, with compliance deadlines hitting 2026. High-risk AI systems require conformity assessments, data lineage, and human-in-the-loop checkpoints.
The market is shifting from “is AI safe?” to “prove it's safe.” Companies that can't demonstrate compliance will lose contracts, fail audits, and face regulatory penalties.
6. The AgentVet Harness Scorecard
We propose a practical evaluation framework that scores all six harness components on a 1-10 scale:
Reasoning (R) - Model Capability
- • 1-3: Fails basic tasks, hallucination-prone
- • 4-6: Passes benchmarks but unreliable on repeated runs
- • 7-8: Strong single-shot, degrading under passk
- • 9-10: Reliable across repeated runs, consistent reasoning depth
Key test: Passk reliability at k=3, 5, 10
Memory (M) - Trustworthy Persistent State
- • 1-3: No persistent memory, or uncontrolled staleness
- • 4-6: Memory exists but no hygiene (drift, contamination)
- • 7-8: Structured memory with retrieval and freshness guarantees
- • 9-10: Consolidation mechanisms, memory verified and auditable
Key test: Memory hygiene over 8+ hour sessions
Context (C) - Information Governance
- • 1-3: Dumps entire context; no governance
- • 4-6: Basic windowing but no trust/verify policies
- • 7-8: Context governance with retrieval, compression, bias mitigation
- • 9-10: Context efficiency measured; context verified before use
Key test: Context efficiency ratio (signal / total context)
Skills (S) - Tool and Subagent Security
- • 1-3: Unrestricted tool access, no authorization
- • 4-6: Tool allowlisting but no permission tiers
- • 7-8: Scoped access, subagent authorization, key isolation
- • 9-10: Tool-level audit, dual-use detection, blast radius minimization
Key test: Can the agent access tools it shouldn't?
Orchestration (O) - Control Flow Reliability
- • 1-3: Fragile loops, no retry or escalation
- • 4-6: Basic retry but no failure isolation
- • 7-8: Circuit breakers, escalation, resource limits
- • 9-10: Auditable, cost-bounded, failure-recoverable
Key test: Infinite loop protection and graceful degradation
Governance (G) - Verification, Audit, Compliance
- • 1-3: No audit trail, no verification
- • 4-6: Basic logging but no compliance mapping
- • 7-8: Audit trails, ISO 42001 / NIST mapping, independent verification
- • 9-10: Certified-ready, continuous monitoring, cert body partnership
Key test: Can you generate an ISO 42001 evidence pack?
7. From Evaluation to Certification
The Harness Scorecard is not just an evaluation tool. It's a certification pathway:
Monitoring without certification is just observation. Certification without monitoring is just paperwork. Together, they're trust.
8. The Research Agenda
The field needs new benchmarks that evaluate systems, not models. We propose:
Conclusion
The agent economy is built on a false premise: that model evaluation is sufficient. It is not. Models are components. Agents are systems. Evaluating systems as if they were components produces misleading confidence, hidden reliability gaps, and unaccounted risk.
The academic evidence is clear. Gu (2026) shows that harness design determines agent behavior as much as model capability. Lee et al. (2026) show that reasoning depth - not just memory capacity - degrades without adequate computation over context. The regulatory evidence is equally clear. ISO 42001, NIST AI RMF, and the EU AI Act all demand system-level governance that model-only evaluation cannot provide.
The trust gap is real. It is measurable. It is the defining opportunity of the agent economy.
References
- Gu, S. (2026). “From Model Scaling to System Scaling: Scaling the Harness in Agentic AI.” arXiv:2605.26112. UC Berkeley, SafeRL-Lab.
- Lee, S., McLeish, S., Goldstein, T., & Fanti, G. (2026). “Language Models Need Sleep.” arXiv:2605.26099. Carnegie Mellon University, University of Maryland.
- Yang, J. et al. (2024). “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.” arXiv:2405.15793.
- Liu, N. et al. (2024). “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172.
- ISO/IEC 42001:2023. Artificial Intelligence - Management System.
- NIST AI Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology, 2023.
- Regulation (EU) 2024/1689. The EU Artificial Intelligence Act.