H-Score Methodology
Rules:
1. Benchmarks are omitted unless completed by the top 3 models.
2. Highest score in a benchmark used.
3. Score weighs all benchmarks evenly.
4. Scores are averaged to the nearest whole number.
5. Scores above 99 are not rounded.
6. Benchmarks will be added if ALL top 3 complete.
7. AI model must be available to public.
Number | Benchmark | Status |
---|---|---|
1 | GPQA | Included |
2 | MMLU | Included |
3 | AIME | Included |
4 | LiveBench (Coding) | Included |
5 | SWE-bench | Inferred* |
7 | Humanity Last Exam | Inferred* |
6 | MATH | Omitted |
8 | ARC-AGI | Omitted |
*See Scoring Meta Data for inference details
Inference Methodology
To compare performance to a reference model, we first calculate the absolute difference between their scores:

This measures the raw gap between the two scores without considering which one is higher. Next, we normalize the difference by computing the average of both scores:

This ensures a fair comparison across different benchmarks. Using these values, we calculate the percentage difference:

Which standardizes the comparison. We then compute the average percentage difference across multiple benchmarks:

This provides a balanced measure inferred model's overall performance gap relative to the reference models. Finally, we adjust the score by scaling down the reference score using the average percentage difference:

This adjustment proportionally reduces the score, ensuring that model's inferred score accurately reflects its performance relative to the reference models. This step-by-step approach provides a consistent and standardized method for evaluating model’s effectiveness across multiple tasks.Inference scores are essential for obtaining a holistic score across benchmarks. AI companies often exclude benchmarks where their models underperform, introducing bias in comparisons. This inference score mitigates that bias by providing a more balanced and accurate assessment of overall performance.See this equation in action within the scoring meta data:
Current Benchmarks Scores
AI | Benchmark | Score |
---|---|---|
o3 DR | Humanity Last Exam | 27 |
o3 DR | GPQA | 88 |
o3 DR | MMLU | 92 |
o3 DR | AIME | 94 |
o3 DR | SWE | 72 |
o3 DR | LiveBench | 83 |
Grok-3 | Humanity Last Exam | 14* |
Grok-3 | GPQA | 85 |
Grok-3 | MMLU | 93 |
Grok-3 | AIME | 96 |
Grok-3 | SWE | 68* |
Grok-3 | LiveBench | 67 |
Claude 3.7 | Humanity Last Exam | 9 |
Claude 3.7 | GPQA | 85 |
Claude 3.7 | MMLU | 90 |
Claude 3.7 | AIME | 80 |
Claude 3.7 | SWE | 70 |
Claude 3.7 | LiveBench | 75 |
See Scoring Meta Data for full list of all 5 models.
Benchmarks
Benchmark | Link |
---|---|
GPQA | Research Paper |
GPQA | Data Set |
MMLU | Research Paper |
MMLU | Data Set |
AIME | Test Details |
LiveBench (Coding) | Website |
SWE-bench | Website |
MATH | Research Paper |
Humanity Last Exam | Website |
ARC-AGI | Website |
Reporting Errors
If you believe that the ranking is missing data or is incorrect. Please direct message me on X.
© 2025 AGI-Race.com - All rights reserved.
Privacy Policy