AI evaluation is transitioning from a static, compressible task into a significant compute bottleneck. As benchmarks shift from simple text predictions to agentic rollouts and training-in-the-loop protocols, evaluation...

AI evals are becoming the new compute bottleneck

Summary

Key Points

The Holistic Agent Leaderboard (HAL) incurred approximately $40,000 in costs to execute 21,730 agent rollouts across 9 models and 9 benchmarks.
Single-run costs for the GAIA benchmark can reach $2,829 on frontier models before the application of caching.
Agentic evaluation costs are highly sensitive to "scaffold" configurations, which can cause a 33× cost spread even when performing identical tasks.
In scientific machine learning (SciML), the "The Well" benchmark requires 3,840 H100-hours for a full four-baseline sweep, often exceeding the cost of training individual architectures.
API pricing variance significantly impacts benchmark feasibility, with Claude Opus 4.1 ($15/$75 per 1M tokens) costing two orders of magnitude more than Gemini 2.0 Flash ($0.10/$0.40 per 1M tokens) on input alone.
PaperBench replication of 20 ICML 2024 papers costs approximately $8,000 in API usage when utilizing the o1 IterativeAgent rollout.

Technical Details

The transition from static LLM benchmarks to agentic benchmarks has rendered traditional compression techniques—such as using anchor points or item response theory to reduce dataset sizes—largely ineffective. While static benchmarks like MMLU or HELM can be reduced by 100× to 200× while preserving rank fidelity, agentic benchmarks are inherently noisy and scaffold-sensitive. In these environments, the evaluation cost is a product of the model, the scaffold, and the token budget. Because each item involves a multi-turn rollout, the cost is driven by the long trajectory of the agent's actions rather than a single prediction.

In specialized domains like Scientific ML, evaluation compute can exceed training compute by two orders of magnitude. For example, the MLE-Bench setup for 75 Kaggle competitions requires 1,800 GPU hours on A10 instances plus significant API usage, totaling roughly $5,500 per seed. Similarly, benchmarks like ResearchGym and PaperBench introduce a "cost floor" determined by wall-clock compute and training time. In these scenarios, a token budget no longer bounds the maximum cost from above, as the evaluation process itself involves executing real-time training and research pipelines.

Impact / Why It Matters

The escalating cost of high-fidelity evaluation limits the ability of smaller organizations to conduct large-scale benchmarking and necessitates the development of "coarse-to-fine" evaluation strategies. Developers must account for the massive cost variance introduced by agent scaffolds and token-heavy rollouts when designing testing and validation pipelines.

AI evals are becoming the new compute bottleneck

AI evals are becoming the new compute bottleneck

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources