// Proof of depth

Research

Field notes on making agents reliable - hardening, failure modes, and the patterns behind production-grade work.

Position PaperMay 2026

The Trust Gap: Why Model Evaluation Is Not Agent Evaluation

Model benchmarks measure capability. Agent deployments depend on six harness components (R,M,C,S,O,G) that benchmarks never test. This position paper argues that the trust gap between model scores and agent reliability is the central problem in production AI.

Model evaluation and agent evaluation measure fundamentally different things
Six-component harness framework (R,M,C,S,O,G) reveals risks invisible to model benchmarks
Sleep-like consolidation improves deep reasoning by 12-23% without model changes
AgentVet Harness Scorecard provides the first standardized vetting framework for agent deployments
Research PaperMay 2026

The MCP Security Crisis: 9,400 Servers, 150M Downloads, Zero Guardrails

Drawing on the OX Security disclosure of April 2026, 10+ CVEs from a single architectural root cause, and analysis of the 9,400-server MCP ecosystem, this paper documents the widening gap between MCP adoption and security tooling.

Single STDIO architectural flaw spawned 10+ critical CVEs across the ecosystem
9 out of 11 MCP registries successfully poisoned with test malicious payloads
Ecosystem growing at 18% month-over-month while security tooling lags far behind
Only 3 of 13 tracked CVEs have been patched; root cause unaddressed at protocol level
Research PaperMay 2026

Five Eyes Agentic AI Guidance: What It Says, What It Means, What to Do Next

On May 1, 2026, six national cybersecurity agencies published the first coordinated multinational guidance on securing autonomous AI agents. This paper distills the 30-page guidance into its 5 risk categories, 23 attack surfaces, and an action checklist for organizations deploying agentic AI today.

Five Eyes agencies confirm agentic AI is already inside critical infrastructure (present tense)
87% of organizations deploying AI agents lack adequate governance structures
Prompt injection identified as the most pervasive and difficult-to-mitigate threat
PocketOS incident: agent deleted production database in 9 seconds using an API token it was never meant to access
Security AdvisoryMay 2026

BadHost: CVE-2026-48710

CVE-2026-48710, discovered by X41 D-Sec during an OSTIF-sponsored audit, is a critical authentication bypass in Starlette that affects every FastAPI app, vLLM server, LiteLLM proxy, and MCP gateway.

Host header manipulation allows attackers to bypass path-based auth middleware with a single crafted request
Every Starlette version before 1.0.1 is vulnerable - covering virtually all FastAPI, vLLM, and LiteLLM deployments
MCP gateways are disproportionately exposed due to unauthenticated OAuth discovery endpoints
Impact includes unrestricted LLM access, API key extraction, internal tool execution, and compute resource abuse
Research PaperMay 2026

Your Agent Needs a Wiki and a Recording, Not a Bigger Desk

Bigger context windows are the wrong solution to agent memory. This paper argues that agents need two things humans use: a wiki (structured, searchable, curated knowledge) and a recording (raw session history for recall). We analyze the two patterns that work and why stuffing more tokens into context is an anti-pattern.

Context window growth has not solved the agent memory problem
Wiki pattern: structured, curated knowledge that agents can query on demand
Recording pattern: raw session history with semantic search for precise recall
Agents using wiki + recording outperform agents with 2x context window size
Research PaperApril 2026

When Parallel Agents Outperform Single Agents

Drawing on Google Research/MIT's 260-configuration study, Stanford's equal-budget comparison, and 47 production deployments, this paper provides actionable thresholds for when to parallelize vs run single-agent.

3 parallel agents is the sweet spot (+30-80% improvement)
Decomposability, not complexity, determines value
68% of multi-agent deployments are over-engineered
Orchestrator synthesis is the best merge strategy (+80.8%)
← Back to AgentVet