State of Benchmarks
TL;DR: Benchmarks aren't useless--but single scores hide difficulty distributions, top-of-leaderboard differences are mostly noise, and lifetimes are shrinking from years to months. Labs cherry-pick which evals to report while methodology stays opaque. We need benchmarks that report richer metrics, resist saturation longer, and eventually serve as trust infrastructure for governance—not just marketing.
2025 was a year of benchmaxxing. New models came out every day, topping benchmark leaderboards -- yet if you actually used them, promised capabilities never fully materialized. This isn't because benchmarks are useless, but the way we treat them is problematic.
Benchmarks and how we use them are flawed
SWE-Bench Verified has become the reference point for coding ability because it feels realistic due to being grounded in real issues from real repositories. It is true that it is a deeply problematic benchmark that primarily evaluates bug fixes on 12 Python repositories (very Django-heavy). Even the original SWE-Bench paper acknowledges these limitations. Given that Claude Opus 4.5 and GPT 5.2, are still reporting performance on SWE-Bench Verified, it serves as a nice framework for examining general issues with benchmarks:
- Benchmark scores are lossy. All benchmarks come with some task difficulty distribution, yet we mostly see a single score reported. Say two models get 50% on SWE-Bench Verified, yet one of them passed all the easy tasks and the other maybe failed a couple easy ones but aced some difficult ones, they are not the same model. Pure scores reduce the trajectories of a model into a single bit of information that doesn't capture: planning ability, backtracking, efficiency, robustness, etc. Certainly 50% does mean something more than 30% but the nuance of how those scores are achieved is hidden.
- Near saturation, top performance is noise. Claude Opus 4.5 scores 80.9%, GPT-5.2 scores 80.0%, and Gemini-3-Flash scores 78.0%, but those exact deltas don't reliably quantify meaningful differences in real SWE work. Near the top, these scores are mostly dominated by scaffolding choices and benchmark noise. Epoch AI found that simply switching scaffolds causes up to 11-15% performance swings--larger than the gaps between frontier models. Obsessing over small point differences is a waste of time.
- Some tasks are genuinely broken or missing context (though to be fair, Epoch AI estimates the overall error rate is only 5-10%--the human vetting process filtered out most bad samples):
- django__django-10999 proposes a change to parse negative datetime strings, but the golden patch uses an undisclosed design decision to only support leading negatives.
- pydata__xarray-6992 describes a single isolated issue but the actual fix is 150+ lines long and addresses 5 different issues. As far as I could tell from the publicly available SWE-Bench Verified results this task has never been solved across 100+ submissions.
- Some tasks are genuinely broken or missing context (though to be fair, Epoch AI estimates the overall error rate is only 5-10%--the human vetting process filtered out most bad samples):
- Benchmarks measure capabilities much narrower than they appear. SWE-Bench Verified tests the ability to fix well-specified bugs in mature Python codebases with strong test coverage. But leaderboards report "coding ability" and readers fill in the rest. Benchmark scores are lossy in both directions. By most practical measures, Claude Opus 4.5 feels1 like a much better coding agent than its 80% SWE-Bench Verified score would suggest. This isn't to say that the benchmarks creators are being misleading -- the original paper merely claimed to study the frontier of model coding capabilities, which it has. It is more so that benchmarks grow much larger than their creators intend. The same could be said for ARC-AGI: it is not an AGI benchmark but merely measures some narrow slice of model generalization.
- Benchmarks become sanity checks before most people realize. SWE-Bench Verified contains 194 problems (of 500 total) solvable by humans in under 15 minutes. Even with a conservative 75% solve rate on those, we could claim ~29% overall as a bare minimum for a decent coding model. Below this, SWE-Bench Verified performs a fine job of filtering out bad coding ability, but above a certain threshold, it's obvious that the marketing utility of a good score persists much longer than its discriminative utility.
This does not mean public benchmarks are useless -- they're just better as filters than pure rankings. If you're picking a coding model, take them for a spin instead of choosing arbitrarily from imperfect leaderboards.
Shrinking lifetimes
Even if we fixed benchmark reporting, the lifecycle problem remains: benchmarks are defeated after just a few months.
When benchmarks become saturated, the simple solution is just to create a new one. Lifecycles of benchmarks are predictable: benchmark appears -> models improve faster than expected -> benchmark saturates -> harder variant appears -> repeat. What's changed is the lifetime of these benchmarks.
- ARC-AGI-1 held for an impressive five years. But GPT-5.2-Pro already hit 54% on ARC-AGI-2 within the year.
- SWE-Bench Verified had ~15 months before frontier models surpassed its ability to measure.
- Terminal Bench targeting agent capabilities in terminal environments launched version 1.0 in May 2025 and by November version 2.0 was already out.
Why is this accelerating? Models are improving faster. Agent harnesses and architectures are improving. Data Contamination spreads faster. Benchmarks are being gamed and optimized against.
New variants constantly show up to address weaknesses. SWE-Bench Pro, for example, adds harder problems across more repositories and languages compared to SWE-Bench Verified and claims to be contamination-proof thanks to "GPL-style copyleft repositories".
Benchmark churn is expected, but it creates an issue where most benchmarks end up as lagging indicators. By the time a capability is standardized enough to eval cleanly, frontier models are saturating it.
Cherry-Picking and Convergence
LLMs are general enough that the proliferation of so many specialized benchmarks is expected; a benchmark zoo, if you will. However, it creates a situation where model creators can cherry pick exactly which benchmarks they want to report.
Consider:
- OpenAI's GPT-5.2 release reported SWE-Bench Verified scores. But when GPT-5.2-Codex launched a week later -- a model explicitly optimized for coding -- SWE-Bench Verified was nowhere in the announcement. Did Codex underperform Claude Opus 4.5? We don't know, because methodology (harnesses, test-time scaling, environments) is rarely standardized and the evals that actually drive decisions stay private. Public benchmarks are becoming market theater.
- Meta's Llama 4 release on LMArena went through 27 different variants before being announced publicly. A clear-cut case of cherry-picking.
Ambitious, high-quality2, well-designed, well-maintained benchmarks will combat this -- they're hard to ignore and create pressure towards convergence.
Benchmarks Should Be Even More Ambitious
What does an ambitious benchmark look like?
Even the trendiest coding benchmarks don't fully measure what matters for real software engineering:
- Working with large, closed-source codebases
- Gathering context across scattered documentation
- Operating within business constraints
- Making architectural design decisions with tradeoffs
- Problems where unit tests can't capture correctness
- Human-in-the-loop and Agent-in-the-loop collaboration
These types of capabilities are harder to measure and that's exactly why capturing them publicly is important. Public benchmarks matter because they build trust by allowing independent model verification, help guide research directions, and accelerate the field as a whole. In Ofir Press' blog on building good LLM benchmarks, he calls for benchmark creators to "think of benchmarks that would have systems achieving -200% at launch. Find questions that are so hard that even if the models improve 3x they'll still get zero".
Concretely, this means benchmarks that:
- Report score distributions and difficulty breakdowns, not just single numbers
- Capture full trajectories: planning, backtracking, efficiency--not just pass/fail
- Require standardized methodology reporting (harnesses, compute, environments)
- Resist saturation long enough to remain useful
However, these are genuinely difficult to build. Curating realistic tasks at scale requires expertise. Environments need to be continually maintained. Contamination requires active monitoring and addressing. Tasks also require the right abstractions to measure what the benchmarks intend. This creation process is slow, expensive work and running increasingly evals also grows increasingly expensive, which can disincentivize labs from running them.
Ambitious benchmarks should not only force labs to use them, but also drive research directions with public input.
Afterword: Future of Benchmarks
All that said, when benchmarks eventually shift away from capability discovery3, what is their remaining purpose?
In my opinion, they shift to:
- regression detection
- behavior monitoring
- regulatory audits
- safety and compliance
Leaderboard style evaluations will persist for two main reasons:
- Marketing: Labs will want public numbers to cite
- Governance: regulators and institutions will need standardized evaluation before deploying AI in high-stakes areas. These situations require external verification by definition. Benchmarks become trust infrastructure.
Evals--continuous, task-specific testing--are slightly different. Regression testing, behavior monitoring: these are inherently internal. They reflect specific use cases, users, and failure modes.
As benchmarks move past capability discovery, the shift towards compliance and external verification only works if rigorous benchmarks already exist.
1 This real-world performance could be partially attributed to the Claude Code harness, of course ↩
2 High-quality meaning: tasks are human-validated, environments are reproducible, test cases aren't flaky, and the benchmark is actively maintained against contamination and gaming. ↩
3 As long as LLM capabilities are mostly a subset of human capabilities, it's easy to create contrived "hard" benchmarks but these will matter less and less ↩