H-Score Methodology

Rules:
1. Benchmarks are omitted unless completed by the top 3 models.
2. Highest score in a benchmark used.
3. Score weighs all benchmarks evenly.
4. Scores are averaged to the nearest whole number.
5. Scores above 99 are not rounded.
6. Benchmarks will be added if ALL top 3 complete.
7. AI model must be available to public.



NumberBenchmarkStatus
1GPQAIncluded
2MMLUIncluded
3AIMEIncluded
4LiveBench (Coding)Included
5SWE-benchInferred*
7Humanity Last ExamInferred*
6MATHOmitted
8ARC-AGIOmitted

*See Scoring Meta Data for inference details

Inference Methodology

To compare performance to a reference model, we first calculate the absolute difference between their scores:

This measures the raw gap between the two scores without considering which one is higher. Next, we normalize the difference by computing the average of both scores:

This ensures a fair comparison across different benchmarks. Using these values, we calculate the percentage difference:

Which standardizes the comparison. We then compute the average percentage difference across multiple benchmarks:

This provides a balanced measure inferred model's overall performance gap relative to the reference models. Finally, we adjust the score by scaling down the reference score using the average percentage difference:

This adjustment proportionally reduces the score, ensuring that model's inferred score accurately reflects its performance relative to the reference models. This step-by-step approach provides a consistent and standardized method for evaluating model’s effectiveness across multiple tasks.Inference scores are essential for obtaining a holistic score across benchmarks. AI companies often exclude benchmarks where their models underperform, introducing bias in comparisons. This inference score mitigates that bias by providing a more balanced and accurate assessment of overall performance.See this equation in action within the scoring meta data:

Current Benchmarks Scores


AIBenchmarkScore
o3 DRHumanity Last Exam27
o3 DRGPQA88
o3 DRMMLU92
o3 DRAIME94
o3 DRSWE72
o3 DRLiveBench83
Grok-3Humanity Last Exam14*
Grok-3GPQA85
Grok-3MMLU93
Grok-3AIME96
Grok-3SWE68*
Grok-3LiveBench67
Claude 3.7Humanity Last Exam9
Claude 3.7GPQA85
Claude 3.7MMLU90
Claude 3.7AIME80
Claude 3.7SWE70
Claude 3.7LiveBench75

See Scoring Meta Data for full list of all 5 models.

Benchmarks


BenchmarkLink
GPQAResearch Paper
GPQAData Set
MMLUResearch Paper
MMLUData Set
AIMETest Details
LiveBench (Coding)Website
SWE-benchWebsite
MATHResearch Paper
Humanity Last ExamWebsite
ARC-AGIWebsite

Reporting Errors

If you believe that the ranking is missing data or is incorrect. Please direct message me on X.



© 2025 AGI-Race.com - All rights reserved.
Privacy Policy

Mouse Trail