📊

Benchmarking: Preliminary Results

Scientific evaluation across graduate-level reasoning, instruction-following, and narrative tasks

Graduate Physics/Chemistry

MassGen

87.4% 🏆

Gemini

85.9%

Grok-4

85.4%

GPT-5

84.8%

Claude

68.2%

Instruction Following

MassGen

88.0% 🏆

GPT-5

87.4%

Grok-4

84.7%

Gemini

66.0%

Claude

63.6%

Narrative Reasoning

Gemini

69.6% 🏆

GPT-5

69.2%

MassGen

68.3%

Grok-4

67.6%

Claude

62.8%

🏆 Overall Champion

MassGen: 81.2%

Wins 2/3 benchmarks • Statistically significant

✅ Key Results:

• Highest on 2/3 benchmarks
• Best overall average
• Consistent performance

📈 Statistical:

• vs Claude: p = 1.4e-07 ⭐⭐⭐
• vs Gemini: p = 1.1e-28 ⭐⭐⭐
• Not due to chance

🔬 Research Gap:

• Oracle: 95.5% (GPQA)
• Actual: 87.4%
• Potential: 8.1 points