Hugging Face has introduced a decentralized evaluation system designed to aggregate and transparently report benchmark scores across the Hub. This feature allows the community to contribute results via Pull Requests,...

Community Evals: Because we're done trusting black-box leaderboards over the community

Summary

Key Points

Dataset repositories can register as official benchmarks (e.g., MMLU-Pro, GPQA, HLE) by implementing an eval.yaml configuration.
Evaluation specifications are standardized based on the Inspect AI format to ensure reproducibility.
Model evaluation scores are stored within model repositories in the .eval_results/*.yaml directory.
Community-submitted results via Pull Requests are labeled as "community" and can include links to external sources such as research papers, model cards, or Inspect AI evaluation logs.
All reported scores are accessible via Hugged Face Hub APIs, enabling the creation of custom dashboards and curated leaderboards.
Model authors maintain control over their repositories, including the ability to close score-related Pull Requests and hide specific results.

Technical Details

The system operates by decentralizing the reporting of scores, moving away from centralized, opaque leaderboards. Benchmark datasets act as aggregators that automatically collect reported results from across the Hub, provided the results align with the task definition defined in the eval.yaml file. On the model side, the architecture relies on the .eval_results/*.yaml file structure within model repositories to host scores; these files are parsed to update model cards and populate the leaderboards of registered benchmarks. Because the Hub is Git-based, the system maintains a complete version history of all evaluation updates, changes, and contributions.

Impact / Why It Matters

Developers can leverage a unified, API-accessible source of truth for model performance, reducing the manual effort required to aggregate fragmented scores from various papers and model cards. This transparency allows for more accurate tracking of model capabilities across evolving benchmarks.

Community Evals: Because we're done trusting black-box leaderboards over the community

Community Evals: Because we're done trusting black-box leaderboards over the community

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources