Methodology | The AGI Race: AI Evaluation, Benchmarking & Recursive Self-Improvement (RSI)

H-Score Index™ Method

_Rules:
_{1. Benchmarks are omitted unless completed by the top 3 models.}
_{2. Highest score in a benchmark used.}
_{3. Score weighs all benchmarks evenly.}
_{4. Scores are averaged to the nearest whole number.}
_{5. Scores above 99 are not rounded.}
_{6. Benchmarks will be added if ALL top 3 complete.}
_{7. AI model must be available to public.}

Scoring Meta Data

Number	Benchmark	Status
1	GPQA	Included
2	MMLU	Included
3	AIME	Included
4	LiveBench (Coding)	Included
5	SWE-bench	Inferred*
7	Humanity Last Exam	Inferred*
6	MATH	Omitted
8	ARC-AGI-2	Omitted

*_{See Scoring Meta Data for inference details}

Inference Methodology

_{To compare performance to a reference model, we first calculate the absolute difference between their scores:}

_{This measures the raw gap between the two scores without considering which one is higher. Next, we normalize the difference by computing the average of both scores:}

_{This ensures a fair comparison across different benchmarks. Using these values, we calculate the percentage difference:}

_{Which standardizes the comparison. We then compute the average percentage difference across multiple benchmarks:}

_{This provides a balanced measure inferred model's overall performance gap relative to the reference models. Finally, we adjust the score by scaling down the reference score using the average percentage difference:}

_{This adjustment proportionally reduces the score, ensuring that model's inferred score accurately reflects its performance relative to the reference models. This step-by-step approach provides a consistent and standardized method for evaluating model’s effectiveness across multiple tasks.}_{Inference scores are essential for obtaining a holistic score across benchmarks. AI companies often exclude benchmarks where their models underperform, introducing bias in comparisons. This inference score mitigates that bias by providing a more balanced and accurate assessment of overall performance.}_{See this equation in action within the scoring meta data:}

Scoring Meta Data

Current Benchmarks Scores

AI	Benchmark	Score
o3 DR	Humanity Last Exam	27
o3 DR	GPQA	88
o3 DR	MMLU	92
o3 DR	AIME	94
o3 DR	SWE	72
o3 DR	LiveBench	83
Grok-3	Humanity Last Exam	14*
Grok-3	GPQA	85
Grok-3	MMLU	93
Grok-3	AIME	96
Grok-3	SWE	68*
Grok-3	LiveBench	67
Claude 3.7	Humanity Last Exam	9
Claude 3.7	GPQA	85
Claude 3.7	MMLU	90
Claude 3.7	AIME	80
Claude 3.7	SWE	70
Claude 3.7	LiveBench	75

_{See Scoring Meta Data for full list of all 5 models.}

Benchmarks

Benchmark	Link
GPQA	Research Paper
GPQA	Data Set
MMLU	Research Paper
MMLU	Data Set
AIME	Test Details
LiveBench (Coding)	Website
SWE-bench	Website
MATH	Research Paper
Humanity Last Exam	Website
ARC-AGI	Website

Reporting Errors

_{If you believe that the ranking is missing data or is incorrect. Please direct message me on X.}

Report

Mouse Trail