๐
Benchmarking: Preliminary Results
Scientific evaluation across graduate-level reasoning, instruction-following, and narrative tasks
๐งช GPQA-Diamond
Graduate Physics/Chemistry
๐ IFEval
Instruction Following
๐ MuSR
Narrative Reasoning
๐ Overall Champion
MassGen: 81.2%
Wins 2/3 benchmarks โข Statistically significant
โ
Key Results:
โข Highest on 2/3 benchmarks
โข Best overall average
โข Consistent performance
๐ Statistical:
โข vs Claude: p = 1.4e-07 โญโญโญ
โข vs Gemini: p = 1.1e-28 โญโญโญ
โข Not due to chance
๐ฌ Research Gap:
โข Oracle: 95.5% (GPQA)
โข Actual: 87.4%
โข Potential: 8.1 points