MassGen logo featuring multi-agent collaboration design
Diagram showing collaborative AI agents working together in parallel threads
Scaling AI Through Multi-Agent Collaboration
๐ŸŽฏ RecSys'25 Tutorial
Agentic Recommendation Systems
๐Ÿ“ Prague, Czech Republic โ€ข September 26, 2025
๐ŸŒ massgen.ai | GitHub
๐Ÿ“ˆ

How Do We Scale Up AI?

โšก

Traditional Scaling Laws Hit Limits

๐Ÿ’ก

Power Crisis

4-16 GW by 2030
Enough to power entire cities
๐Ÿ“š

Data Depletion

Depleted by 2026-2028
Quality text data exhausted
๐Ÿšง

Performance Plateau

GPT-5 delayed >1 year
Early training runs failed

โš ๏ธ Inference-time scaling

No universal way to leverage improvements & address limitations
The New Paradigm:
Model โ†’ Agent โ†’ Multi-Agent Systems
๐Ÿค

The Promise of Multi-Agent Collaboration

  • Study Group Dynamics: Like humans collaborating on complex problems
  • Cross-Ecosystem Integration: Bridge Claude, Gemini, GPT, Grok, and specialized coding agents
  • Emergent Intelligence: Collective problem-solving beyond individual capabilities
  • Real-time Intelligence Sharing: Agents learn and adapt from each other
Visual representation of collaborative AI reasoning and cognitive processes
The Promise of Collaborative Reasoning
๐Ÿ—๏ธ Built on AG2's foundational multi-agent research and community
๐Ÿ“ˆ

Proven Performance Gains

Grok-4 Standard
1
Single Agent Processing
38.6%
Last Human Exam Score
$30/month
Grok-4 Heavy
A1
A2
A3
Multi-Agent Collaboration
44.4%
Last Human Exam Score
$300/month
Gemini 2.5 DeepThink
๐Ÿ†
๐Ÿฅ‡
Competition Gold Medals
IMO + ICPC
5/6 IMO Problems (2024)
10/12 ICPC Problems (2025)
First AI Gold Medals
๐Ÿš€ Multi-Agent Revolution
"Individual AI excellence + Multi-agent coordination = Next frontier of AI capabilities"
๐Ÿš€
MassGen Orchestrator
Task Distribution & Coordination
โ†“
๐Ÿ—๏ธ
Agent 1
Anthropic/Claude
๐Ÿ‘จโ€๐Ÿ’ป
Agent 2
Claude Code
๐ŸŒŸ
Agent 3
Google/Gemini
๐Ÿค–
Agent 4
OpenAI/GPT
โšก
Agent 5
xAI/Grok
โ†• Real-time Collaboration โ†•
โ†“
๐Ÿ”„
Shared Collaboration Hub
Real-time Notification & Consensus
๐Ÿš€

Key Features & Capabilities

๐Ÿ”„
Iterative Refinement
Diagram showing iterative refinement cycles in multi-agent collaboration
The Reality of Reasoning
โš™๏ธ
Multi-Backend Support
โšก
Parallel Processing
๐Ÿ‘ฅ
Intelligence Sharing
๐ŸŽฏ
Consensus Building
๐Ÿ”ง

Tech Deep Dive: Backend Abstraction Challenges

๐ŸŽญ Unified Interface Challenge Claude Code CLI Context sharing across agents Gemini API Can't mix builtin + custom tools GPT-5 API changes (reasoning, streaming) Most Backends Unable to autonomously collaborate ๐ŸŽฏ Unified ChatAgent Protocol โš™๏ธ StreamChunk Normalization ๐Ÿ”€ Workarounds Backend-specific ๐Ÿ› ๏ธ Tool Integration MCP, Web, Code ๐Ÿง  Binary Decision Framework Intelligent agent coordination and workspace sharing
๐ŸŽฏ

Tech Deep Dive: Binary Decision Framework Solution

Agent 1 Agent 2 Agent 3 Agent 4 Vote: 4 Writing Vote: 4 Vote: 4 โš–๏ธ Binary Choice: vote OR new_answer Agent 2 provides new_answer โ†’ ALL VOTES RESET After Reset: Deciding Deciding Deciding Deciding Async Results: Vote: 2 Vote: 2 Writing Deciding โ†’ If new_answer: reset again
๐ŸŽฏ Key Innovation: Vote Invalidation Creates Dynamic Consensus
๐Ÿ”ฌ

Case Study: Success Through Peer Correction

Graduate-level physics question from GPQA-Diamond benchmark

๐ŸŒŒ The Problem

A quasar shows a peak at 790 nm wavelength. Given Lambda-CDM cosmological parameters (Hโ‚€ = 70 km/s/Mpc, ฮฉโ‚˜ = 0.3, ฮฉฮ› = 0.7), what is the comoving distance?

Options: A) 8 Gpc B) 7 Gpc C) 6 Gpc D) 9 Gpc

๐ŸŽฏ Final Result

โœ…
Correct Answer: A (8 Gpc)
Orchestration succeeded where individual agents initially failed

๐Ÿค– Round 1: Initial Answers

Claude: "I calculate ~6 Gpc โ†’ Answer C"
GPT-5: "I get ~8.95 Gpc โ†’ Answer D"
Gemini: "~6.1 Gpc โ†’ Answer C"

๐Ÿ”„ Self-Correction Process

Claude observes: "There is significant discrepancy in calculations: Agent1 gets ~6.1 Gpc, Agent2 gets ~8.95 Gpc. Let me re-examine..."

โœจ Breakthrough Moment

Claude revises: "Standard cosmological calculators yield 8000-8500 Mpc for z=5.5. This equals 8.0-8.5 Gpc, closest to option A."

Result: 3/4 agents converge on correct answer
๐Ÿ’ก Success Mechanism:
Peer observation โ†’ Discrepancy detection โ†’ Self-correction โ†’ Consensus
๐Ÿ“Š

Benchmarking: Preliminary Results

Scientific evaluation across graduate-level reasoning, instruction-following, and narrative tasks

๐Ÿงช GPQA-Diamond

Graduate Physics/Chemistry
MassGen
87.4% ๐Ÿ†
Gemini
85.9%
Grok-4
85.4%
GPT-5
84.8%
Claude
68.2%

๐Ÿ“‹ IFEval

Instruction Following
MassGen
88.0% ๐Ÿ†
GPT-5
87.4%
Grok-4
84.7%
Gemini
66.0%
Claude
63.6%

๐Ÿ“– MuSR

Narrative Reasoning
Gemini
69.6% ๐Ÿ†
GPT-5
69.2%
MassGen
68.3%
Grok-4
67.6%
Claude
62.8%
๐Ÿ† Overall Champion
MassGen: 81.2%
Wins 2/3 benchmarks โ€ข Statistically significant
โœ… Key Results:
โ€ข Highest on 2/3 benchmarks
โ€ข Best overall average
โ€ข Consistent performance
๐Ÿ“ˆ Statistical:
โ€ข vs Claude: p = 1.4e-07 โญโญโญ
โ€ข vs Gemini: p = 1.1e-28 โญโญโญ
โ€ข Not due to chance
๐Ÿ”ฌ Research Gap:
โ€ข Oracle: 95.5% (GPQA)
โ€ข Actual: 87.4%
โ€ข Potential: 8.1 points
๐ŸŽฏ

Agentic Recommendation Applications

Multi-Agent Personalization Pipeline

๐Ÿ”

Content Analysis

Multiple agents analyze user behavior, content features, and contextual signals simultaneously
โš–๏ธ

Preference Fusion

Collaborative filtering, content-based, and deep learning agents debate optimal recommendations
๐ŸŽฏ

Dynamic Ranking

Real-time consensus building for personalized item ranking and diversity optimization

Cross-domain expertise, multi-objective optimization, explainable recommendations

๐ŸŽฌ

Live Demonstrations

๐ŸŒ LLM Fun Facts Website (v0.0.14): Claude Code agents create interactive websites with enhanced logging and workspace isolation
Result: Conflict-free parallel development with comprehensive versioning
๐Ÿ“ Unified Filesystem (v0.0.16): Cross-backend collaboration between Gemini and Claude Code agents creating educational content with shared workspace management
Result: First-time cross-backend coordination producing comprehensive 25-slide presentations
๐Ÿ† IMO 2025 Winner Research: Multi-agent fact-checking โ†’ unanimous consensus on Google DeepMind victory
Result: Accurate identification despite conflicting information
๐Ÿ’ฐ Technical Analysis: Complex Grok-4 HLE pricing calculation through iterative refinement
Result: Accurate cost estimates through collaborative validation
๐Ÿ“š case.massgen.ai - Complete Case Studies
โšก

Get Started in 60 Seconds

# 1. Clone and setup
git clone https://github.com/Leezekun/MassGen
cd MassGen && pip install uv && uv venv

# 2. Configure API keys
cp .env.example .env # Add your API keys

# 3. Run single agent (quick test)
uv run python -m massgen.cli --model gemini-2.5-flash "When is your knowledge up to"

# 4. Run multi-agent collaboration
uv run python -m massgen.cli --config three_agents_default.yaml "Summarize latest news of github.com/Leezekun/MassGen"

โœ… Supported Models & Providers

๐Ÿข Major Providers:
Anthropic Claude & Claude Code โ€ข Google Gemini โ€ข OpenAI GPT โ€ข xAI Grok โ€ข ZAI GLM
๐Ÿ  Local & Extended:
Cerebras โ€ข Fireworks โ€ข Groq โ€ข LM Studio โ€ข OpenRouter โ€ข Together...

๐Ÿ› ๏ธ Advanced Tools

Web Search โ€ข Code Execution โ€ข MCP Tools โ€ข File Operations โ€ข Browser Automation โ€ข Advanced Permissions
๐Ÿš€

Build Agentic Recommendation Systems

Multi-Agent Personalization at Scale
๐Ÿš€ v0.0.3โ†’v0.0.23 Latest | 10+ providers; Claude Code CLI; GPT-5 & GPT-OSS; Local Models; MCP Integration; Unified Filesystem
Thank you RecSys'25!
Questions & Discussion
1 / 24