MassGen: Scaling AI Through Multi-Agent Collaboration

Diagram showing collaborative AI agents working together in parallel threads

Scaling AI Through Multi-Agent Collaboration

🎯 RecSys'25 Tutorial

Agentic Recommendation Systems

📍 Prague, Czech Republic • September 26, 2025

🌐 massgen.ai | GitHub

📈

How Do We Scale Up AI?

⚡

Traditional Scaling Laws Hit Limits

💡

Power Crisis

4-16 GW by 2030
Enough to power entire cities

📚

Data Depletion

Depleted by 2026-2028
Quality text data exhausted

🚧

Performance Plateau

GPT-5 delayed >1 year
Early training runs failed

⚠️ Inference-time scaling

No universal way to leverage improvements & address limitations

The New Paradigm:

Model → Agent → Multi-Agent Systems

🤝

The Promise of Multi-Agent Collaboration

Study Group Dynamics: Like humans collaborating on complex problems
Cross-Ecosystem Integration: Bridge Claude, Gemini, GPT, Grok, and specialized coding agents
Emergent Intelligence: Collective problem-solving beyond individual capabilities
Real-time Intelligence Sharing: Agents learn and adapt from each other

Visual representation of collaborative AI reasoning and cognitive processes

The Promise of Collaborative Reasoning

📖 root.massgen.ai - "Myth of Reasoning"

🏗️ Built on AG2's foundational multi-agent research and community

📈

Proven Performance Gains

Grok-4 Standard

1

Single Agent Processing

38.6%

Last Human Exam Score

$30/month

Grok-4 Heavy

A1

A2

A3

Multi-Agent Collaboration

44.4%

Last Human Exam Score

$300/month

Gemini 2.5 DeepThink

🏆

🥇

Competition Gold Medals

IMO + ICPC

5/6 IMO Problems (2024)

10/12 ICPC Problems (2025)

First AI Gold Medals

🚀 Multi-Agent Revolution
"Individual AI excellence + Multi-agent coordination = Next frontier of AI capabilities"

🚀

MassGen Orchestrator

Task Distribution & Coordination

↓

🏗️

Agent 1

Anthropic/Claude

👨‍💻

Agent 2

Claude Code

🌟

Agent 3

Google/Gemini

🤖

Agent 4

OpenAI/GPT

⚡

Agent 5

xAI/Grok

↕ Real-time Collaboration ↕

↓

🔄

Shared Collaboration Hub

Real-time Notification & Consensus

🚀

Key Features & Capabilities

🔄

Iterative Refinement

The Reality of Reasoning

⚙️

Multi-Backend Support

⚡

Parallel Processing

👥

Intelligence Sharing

🎯

Consensus Building

🔧

Tech Deep Dive: Backend Abstraction Challenges

🎯

Tech Deep Dive: Binary Decision Framework Solution

🎯 Key Innovation: Vote Invalidation Creates Dynamic Consensus

🔬

Case Study: Success Through Peer Correction

Graduate-level physics question from GPQA-Diamond benchmark

🌌 The Problem

A quasar shows a peak at 790 nm wavelength. Given Lambda-CDM cosmological parameters (H₀ = 70 km/s/Mpc, Ωₘ = 0.3, ΩΛ = 0.7), what is the comoving distance?

Options: A) 8 Gpc B) 7 Gpc C) 6 Gpc D) 9 Gpc

🎯 Final Result

✅

Correct Answer: A (8 Gpc)

Orchestration succeeded where individual agents initially failed

🤖 Round 1: Initial Answers

Claude: "I calculate ~6 Gpc → Answer C"
GPT-5: "I get ~8.95 Gpc → Answer D"
Gemini: "~6.1 Gpc → Answer C"

🔄 Self-Correction Process

Claude observes: "There is significant discrepancy in calculations: Agent1 gets ~6.1 Gpc, Agent2 gets ~8.95 Gpc. Let me re-examine..."

✨ Breakthrough Moment

Claude revises: "Standard cosmological calculators yield 8000-8500 Mpc for z=5.5. This equals 8.0-8.5 Gpc, closest to option A."

Result: 3/4 agents converge on correct answer

💡 Success Mechanism:
Peer observation → Discrepancy detection → Self-correction → Consensus

📊

Benchmarking: Preliminary Results

Scientific evaluation across graduate-level reasoning, instruction-following, and narrative tasks

🧪 GPQA-Diamond

Graduate Physics/Chemistry

MassGen

87.4% 🏆

Gemini

85.9%

Grok-4

85.4%

GPT-5

84.8%

Claude

68.2%

📋 IFEval

Instruction Following

MassGen

88.0% 🏆

GPT-5

87.4%

Grok-4

84.7%

Gemini

66.0%

Claude

63.6%

📖 MuSR

Narrative Reasoning

Gemini

69.6% 🏆

GPT-5

69.2%

MassGen

68.3%

Grok-4

67.6%

Claude

62.8%

🏆 Overall Champion

MassGen: 81.2%

Wins 2/3 benchmarks • Statistically significant

✅ Key Results:

• Highest on 2/3 benchmarks
• Best overall average
• Consistent performance

📈 Statistical:

• vs Claude: p = 1.4e-07 ⭐⭐⭐
• vs Gemini: p = 1.1e-28 ⭐⭐⭐
• Not due to chance

🔬 Research Gap:

• Oracle: 95.5% (GPQA)
• Actual: 87.4%
• Potential: 8.1 points

🎯

Agentic Recommendation Applications

Multi-Agent Personalization Pipeline

🔍

Content Analysis

Multiple agents analyze user behavior, content features, and contextual signals simultaneously

⚖️

Preference Fusion

Collaborative filtering, content-based, and deep learning agents debate optimal recommendations

🎯

Dynamic Ranking

Real-time consensus building for personalized item ranking and diversity optimization

Cross-domain expertise, multi-objective optimization, explainable recommendations

🎬

Live Demonstrations

🌐 LLM Fun Facts Website (v0.0.14): Claude Code agents create interactive websites with enhanced logging and workspace isolation

Result: Conflict-free parallel development with comprehensive versioning

📁 Unified Filesystem (v0.0.16): Cross-backend collaboration between Gemini and Claude Code agents creating educational content with shared workspace management

Result: First-time cross-backend coordination producing comprehensive 25-slide presentations

🏆 IMO 2025 Winner Research: Multi-agent fact-checking → unanimous consensus on Google DeepMind victory

Result: Accurate identification despite conflicting information

💰 Technical Analysis: Complex Grok-4 HLE pricing calculation through iterative refinement

Result: Accurate cost estimates through collaborative validation

📚 case.massgen.ai - Complete Case Studies

⚡

Get Started in 60 Seconds

                # 1. Clone and setup

                git clone https://github.com/Leezekun/MassGen

                cd MassGen && pip install uv && uv venv

                # 2. Configure API keys

                cp .env.example .env  # Add your API keys

                # 3. Run single agent (quick test)

                uv run python -m massgen.cli --model gemini-2.5-flash "When is your knowledge up to"

                # 4. Run multi-agent collaboration

                uv run python -m massgen.cli --config three_agents_default.yaml "Summarize latest news of github.com/Leezekun/MassGen"

✅ Supported Models & Providers

🏢 Major Providers:

Anthropic Claude & Claude Code • Google Gemini • OpenAI GPT • xAI Grok • ZAI GLM

🏠 Local & Extended:

Cerebras • Fireworks • Groq • LM Studio • OpenRouter • Together...

🛠️ Advanced Tools

Web Search • Code Execution • MCP Tools • File Operations • Browser Automation • Advanced Permissions

🚀

Build Agentic Recommendation Systems

Multi-Agent Personalization at Scale

⭐ github.massgen.ai 💬 discord.massgen.ai 🐦 x.massgen.ai 🌐 massgen.ai

🚀 v0.0.3→v0.0.23 Latest | 10+ providers; Claude Code CLI; GPT-5 & GPT-OSS; Local Models; MCP Integration; Unified Filesystem

Thank you RecSys'25!
Questions & Discussion