๐Ÿ“Š

Benchmarking: Preliminary Results

Scientific evaluation across graduate-level reasoning, instruction-following, and narrative tasks

๐Ÿงช GPQA-Diamond

Graduate Physics/Chemistry
MassGen
87.4% ๐Ÿ†
Gemini
85.9%
Grok-4
85.4%
GPT-5
84.8%
Claude
68.2%

๐Ÿ“‹ IFEval

Instruction Following
MassGen
88.0% ๐Ÿ†
GPT-5
87.4%
Grok-4
84.7%
Gemini
66.0%
Claude
63.6%

๐Ÿ“– MuSR

Narrative Reasoning
Gemini
69.6% ๐Ÿ†
GPT-5
69.2%
MassGen
68.3%
Grok-4
67.6%
Claude
62.8%
๐Ÿ† Overall Champion
MassGen: 81.2%
Wins 2/3 benchmarks โ€ข Statistically significant
โœ… Key Results:
โ€ข Highest on 2/3 benchmarks
โ€ข Best overall average
โ€ข Consistent performance
๐Ÿ“ˆ Statistical:
โ€ข vs Claude: p = 1.4e-07 โญโญโญ
โ€ข vs Gemini: p = 1.1e-28 โญโญโญ
โ€ข Not due to chance
๐Ÿ”ฌ Research Gap:
โ€ข Oracle: 95.5% (GPQA)
โ€ข Actual: 87.4%
โ€ข Potential: 8.1 points