← Back to Research

Position Paper · May 2026

The Trust Gap: Why Model Evaluation Is Not Agent Evaluation

The AI industry evaluates agents as if they were models. They are not. A model is a reasoning substrate. An agent is a system - and evaluating only the model while ignoring the surrounding harness produces misleading scores, hidden reliability gaps, and unaccounted risk.

Abstract

This position paper synthesizes two recent academic findings - the system-scaling thesis (Gu, 2026) and the reasoning-depth limitation of context-window models (Lee et al., 2026) - with commercial and regulatory developments in AI compliance (ISO 42001, NIST AI RMF, EU AI Act) to argue that the agent economy needs a new evaluation paradigm: harness-level vetting that scores all six components of agent behavior, not just model accuracy.

We propose a practical framework for this evaluation, the AgentVet Harness Scorecard, and map it to compliance requirements that enterprises will increasingly demand.

1. Model Scores Are Harness Scores

When a benchmark reports that an agent achieves X% on SWE-bench or GAIA, that score is not a model score. It is a model-plus-harness score. The same model with a different tool interface, memory system, or orchestration loop produces different results.

SWE-agent demonstrated this clearly: redesigning the agent-computer interface alone - while holding the underlying model fixed - substantially improved benchmark accuracy. Yet the field continues to report results as if they reflect model capability alone.

Any evaluation that doesn't separate model capability from harness design is measuring the wrong unit of analysis.

2. The Reliability Gap

A single successful run doesn't prove an agent works. It proves it worked once.

Recent analysis shows that agents that achieve high single-shot pass rates can collapse under passk - the probability of succeeding on k independent rollouts. An agent that passes 80% of the time on a single run may pass only 50% of the time over three independent attempts. This reliability gap is invisible to benchmark leaderboards but critical for production deployment.

Trust requires reliability, not just capability. An agent that sometimes works is not an agent you can deploy.

3. The Reasoning Depth Problem

Bigger context windows don't solve the context problem. Transformer attention dilutes over long inputs - models prefer evidence at the start or end, missing the middle. Hybrid SSM-attention models compress old context into fast weights, but Lee et al. (2026) show that these models fail when the required reasoning depth increases, even when the information to store is held constant.

The bottleneck isn't memory capacity. It's computation over memory. Converting transient context into useful internal representations requires work, and one pass isn't enough. Their solution: a sleep-like consolidation mechanism where the model performs multiple offline recurrent passes over accumulated context before clearing its cache.

Long-running agents degrade. An agent that performs well at the start of a session may perform poorly hours later as context accumulates and reasoning depth requirements increase. This is a failure mode that no current benchmark tests for.

4. The Six-Component Harness Framework

Following Gu (2026), agent performance emerges from the interaction of six components:

PH = Φ(R, M, C, S, O, G)

RReasoning - The foundation model. Natural language, planning, inference. Over-evaluated by benchmarks.
MMemory - Persistent storage, retrieval, freshness, durability. Under-evaluated; treated as implementation detail.
CContext - What to retrieve, compress, order, refresh, trust. Barely evaluated.
SSkills - Tool dispatch, subagent delegation, API invocation. Minimally evaluated.
OOrchestration - Control flow, loops, error handling, retry logic. Not evaluated.
GGovernance - Verification, audit, access control, compliance. Not evaluated.

Model scaling improves R. System scaling improves M, C, S, O, and G. The industry has invested heavily in R and neglected the other five. This creates the trust gap: agents that score well on benchmarks but fail in production because their memory is unreliable, their context construction is sloppy, their skill routing is insecure, their orchestration is fragile, or their governance is absent.

5. The Compliance Imperative

Three regulatory forces make the Governance component non-optional:

ISO/IEC 42001

The first international AI management system standard. 38 controls across governance, risk, transparency, and ethics. Certification costs $4K-75K+ and takes 4-12 months.

NIST AI RMF

Structured methodology for AI risk management (Govern, Map, Measure, Manage). An Agent Interoperability Profile planned for late 2026 will define agent identity, authorization, security, and logging standards.

EU AI Act

Enforceable since August 2024, with compliance deadlines hitting 2026. High-risk AI systems require conformity assessments, data lineage, and human-in-the-loop checkpoints.

The market is shifting from “is AI safe?” to “prove it's safe.” Companies that can't demonstrate compliance will lose contracts, fail audits, and face regulatory penalties.

6. The AgentVet Harness Scorecard

We propose a practical evaluation framework that scores all six harness components on a 1-10 scale:

Reasoning (R) - Model Capability

  • • 1-3: Fails basic tasks, hallucination-prone
  • • 4-6: Passes benchmarks but unreliable on repeated runs
  • • 7-8: Strong single-shot, degrading under passk
  • • 9-10: Reliable across repeated runs, consistent reasoning depth

Key test: Passk reliability at k=3, 5, 10

Memory (M) - Trustworthy Persistent State

  • • 1-3: No persistent memory, or uncontrolled staleness
  • • 4-6: Memory exists but no hygiene (drift, contamination)
  • • 7-8: Structured memory with retrieval and freshness guarantees
  • • 9-10: Consolidation mechanisms, memory verified and auditable

Key test: Memory hygiene over 8+ hour sessions

Context (C) - Information Governance

  • • 1-3: Dumps entire context; no governance
  • • 4-6: Basic windowing but no trust/verify policies
  • • 7-8: Context governance with retrieval, compression, bias mitigation
  • • 9-10: Context efficiency measured; context verified before use

Key test: Context efficiency ratio (signal / total context)

Skills (S) - Tool and Subagent Security

  • • 1-3: Unrestricted tool access, no authorization
  • • 4-6: Tool allowlisting but no permission tiers
  • • 7-8: Scoped access, subagent authorization, key isolation
  • • 9-10: Tool-level audit, dual-use detection, blast radius minimization

Key test: Can the agent access tools it shouldn't?

Orchestration (O) - Control Flow Reliability

  • • 1-3: Fragile loops, no retry or escalation
  • • 4-6: Basic retry but no failure isolation
  • • 7-8: Circuit breakers, escalation, resource limits
  • • 9-10: Auditable, cost-bounded, failure-recoverable

Key test: Infinite loop protection and graceful degradation

Governance (G) - Verification, Audit, Compliance

  • • 1-3: No audit trail, no verification
  • • 4-6: Basic logging but no compliance mapping
  • • 7-8: Audit trails, ISO 42001 / NIST mapping, independent verification
  • • 9-10: Certified-ready, continuous monitoring, cert body partnership

Key test: Can you generate an ISO 42001 evidence pack?

7. From Evaluation to Certification

The Harness Scorecard is not just an evaluation tool. It's a certification pathway:

1Vet - Run the agent through the six-component scorecard. Produce a trust score.
2Map - Map vetting results to compliance controls (ISO 42001 Annex A, NIST AI RMF, EU AI Act).
3Evidence - Generate audit-ready documentation from vetting artifacts.
4Monitor - Continuous compliance monitoring post-deployment. Drift detection.
5Certify - Connect to accredited certification bodies for fast-tracked audits.

Monitoring without certification is just observation. Certification without monitoring is just paperwork. Together, they're trust.

8. The Research Agenda

The field needs new benchmarks that evaluate systems, not models. We propose:

Passk Reliability- Test agents across multiple independent runs. Single-shot accuracy is marketing. Reliability is engineering.
Memory Hygiene- Test whether agent memory degrades over 8+ hour sessions with reasoning depth controls.
Context Efficiency- Measure signal-to-noise ratio in agent context windows. Low efficiency = poor context governance.
Verification Cost- Quantify governance overhead as a fraction of total compute. Verification is not free.
Safe Evolution- Test whether self-modifying agents preserve alignment and reliability over time.

Conclusion

The agent economy is built on a false premise: that model evaluation is sufficient. It is not. Models are components. Agents are systems. Evaluating systems as if they were components produces misleading confidence, hidden reliability gaps, and unaccounted risk.

The academic evidence is clear. Gu (2026) shows that harness design determines agent behavior as much as model capability. Lee et al. (2026) show that reasoning depth - not just memory capacity - degrades without adequate computation over context. The regulatory evidence is equally clear. ISO 42001, NIST AI RMF, and the EU AI Act all demand system-level governance that model-only evaluation cannot provide.

The trust gap is real. It is measurable. It is the defining opportunity of the agent economy.

References

  1. Gu, S. (2026). “From Model Scaling to System Scaling: Scaling the Harness in Agentic AI.” arXiv:2605.26112. UC Berkeley, SafeRL-Lab.
  2. Lee, S., McLeish, S., Goldstein, T., & Fanti, G. (2026). “Language Models Need Sleep.” arXiv:2605.26099. Carnegie Mellon University, University of Maryland.
  3. Yang, J. et al. (2024). “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.” arXiv:2405.15793.
  4. Liu, N. et al. (2024). “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172.
  5. ISO/IEC 42001:2023. Artificial Intelligence - Management System.
  6. NIST AI Risk Management Framework (AI RMF 1.0). National Institute of Standards and Technology, 2023.
  7. Regulation (EU) 2024/1689. The EU Artificial Intelligence Act.