MassGen logo featuring multi-agent collaboration design
Diagram showing collaborative AI agents working together in parallel threads
Scaling AI Through Multi-Agent Collaboration
๐ŸŽ“ M2L Summer School
๐Ÿ“ Split โ€ข September 11, 2025
๐ŸŒ massgen.ai | GitHub
๐Ÿšซ

The Single-Agent Limitation

  • Siloed Thinking: Single models miss diverse perspectives
  • Limited Context: No peer review or validation
  • Sequential Processing: Linear, not parallel exploration
  • Fixed Approach: Limited mid-task adaptation to new information
Illustration demonstrating the isolation and limitations of single-agent AI systems
From Isolation to Collaboration
๐Ÿค

The Promise of Multi-Agent Collaboration

  • Study Group Dynamics: Like humans collaborating on complex problems
  • Cross-Model Synergy: Leverage unique strengths of Claude, Gemini, GPT, Grok
  • Parallel Processing: Multiple perspectives tackle same task simultaneously
  • Real-time Intelligence Sharing: Agents learn and adapt from each other
Visual representation of collaborative AI reasoning and cognitive processes
The Promise of Collaborative Reasoning
๐Ÿ—๏ธ Built on AG2's foundational multi-agent research and community
๐Ÿ—๏ธ

AG2: The Foundation for Multi-Agent Research

AG2 research foundation and community history
๐Ÿค Community-Driven Innovation
MassGen evolved from AG2's pioneering work in multi-agent conversations
and the vibrant research community it fostered
๐Ÿ“ˆ

Proven Performance Gains - Grok Heavy Evidence

Grok-4 Standard
1
Single Agent Processing
38.6%
Last Human Exam Score
$30/month
Grok-4 Heavy
A1
A2
A3
Multi-Agent Collaboration
44.4%
Last Human Exam Score
$300/month
+15% Performance Boost
Multi-agent "study group" approach outperforms single agent

"The exploration of the art & science of multi-agent collaboration has just started."
๐Ÿš€
MassGen Orchestrator
Task Distribution & Coordination
โ†“
๐Ÿ—๏ธ
Agent 1
Anthropic/Claude
๐Ÿ‘จโ€๐Ÿ’ป
Agent 2
Claude Code
๐ŸŒŸ
Agent 3
Google/Gemini
๐Ÿค–
Agent 4
OpenAI/GPT
โšก
Agent 5
xAI/Grok
โ†• Real-time Collaboration โ†•
โ†“
๐Ÿ”„
Shared Collaboration Hub
Real-time Notification & Consensus

Key Features & Capabilities

  • ๐Ÿค Cross-Model Synergy: Harness strengths from diverse models
  • โšก Parallel Processing: Multiple agents tackle problems simultaneously
  • ๐Ÿ”„ Iterative Refinement: Non-linear reasoning through cycles of improvement
  • ๐Ÿ‘ฅ Intelligence Sharing: Agents share working summaries, tool results, and insights in real-time
  • ๐ŸŽฏ Consensus Building: Natural convergence through collaboration
Diagram showing iterative refinement cycles in multi-agent collaboration
Iterative Refinement: The Reality of Reasoning
โš™๏ธ

Tech Deep Dive: Async Streaming & Dynamic Scheduling

  • ๐Ÿ”„ AsyncGenerator Pattern: Real-time streaming from 5+ agents simultaneously
  • โšก Dynamic Task Management: Agents start/stop based on voting status
  • ๐Ÿ” Graceful Restart & Wrap-up: Dynamic wrapping-up as part of scheduling
Orchestrator Agent 1 Agent 2 Agent 3 Agent 4 content tool_calls reasoning ๐Ÿ” Restart Trigger When Agent 2 provides new_answer restarting restarting restarting
Key Innovation: Dynamic coordination without deadlocks
๐Ÿ”ง

Tech Deep Dive: Backend Abstraction Challenges

  • ๐ŸŽญ Unified Interface: Standardized ChatAgent protocol for 8+ different backends
  • ๐Ÿ› ๏ธ Tool Integration: Web search, code execution, MCP
  • โš™๏ธ StreamChunk Normalization: Convert diverse response formats to common protocol
  • ๐Ÿ”€ Backend-Specific Workarounds: Each provider has unique limitations
Backend Challenges:
Claude Code CLI
Context sharing across agents
Gemini API
Can't mix builtin + custom tools
GPT-5
API change (reasoning, streaming etc.)
Most Backends
Unable to autonomously collaborate
๐ŸŽฏ Our Solution:
Binary Decision Framework & Advanced Workspace Sharing
Result: Unified interface with backend-specific optimizations
๐ŸŽฏ

Tech Deep Dive: Binary Decision Framework Solution

  • โš–๏ธ Binary Choice: Each agent must choose: vote OR new_answer
  • ๐Ÿ’ฅ Vote Invalidation: Any new_answer invalidates ALL existing votes
  • ๐Ÿ”„ Reset & Restart: All agents restart with updated answer context
  • ๐ŸŽญ Anonymous Voting: Agents see "agent1", "agent2" etc.
Round 1: Agents 1,3,4 vote for Agent 4 Vote: agent4 No vote yet Vote: agent4 Answer+Vote โšก Agent 2 provides new_answer new_answer Round 2: All agents restart with 2 answers restart_pending Has new answer restart_pending Has old answer Each agent decides: vote OR new_answer ๐Ÿ”‘ Any new_answer resets ALL votes Votes from Round 1: โŒ INVALID New decisions needed based on 2 available answers
Key Innovation: Dynamic equilibrium through vote invalidation
โš ๏ธ

The Context Sharing Challenge

โŒ Naive Approach 1: Share Answers Only

  • Agents only see final text answers
  • Can't verify methodology or data
  • Unable to test or build upon work
  • Lost intermediate context

โŒ Naive Approach 2: Share Workspace Paths

  • Agents interfere with each other's work
  • Data corruption from simultaneous edits
  • Loss of original work context
  • Workspace pollution and conflicts
Approach 1: Answer Only Agent 1 Agent 2 "Answer text" โŒ No verification possible Approach 2: Workspace Sharing Agent 1 Agent 2 Shared Workspace โš ๏ธ Conflicts! โŒ Data corruption & interference
The Challenge: How to share context without interference?
โœ…

Our Context Sharing Solution

  • ๐Ÿ“ธ Workspace Snapshots: Orchestrator captures agent workspaces after each round
  • ๐Ÿ“ Temporary Directories: Each agent gets a clean temp workspace with all snapshots
  • ๐ŸŽญ Anonymous Mapping: agent1/, agent2/ folders preserve anonymity
  • ๐Ÿ”’ Clean Separation: Read from temp dir, write to permanent workspace
  • ๐Ÿ”„ Context Preservation: Snapshots linked to coordination rounds
Orchestrator Snapshot Storage agent1/ agent2/ Agent 1 Permanent WS Temp Context Agent 2 Permanent WS Temp Context context sharing Agent 1 Temp Copy agent1/ agent2/ Own temp copy Agent 2 Temp Copy agent1/ agent2/ Own temp copy โœ… Safe Context Sharing + โœ… No Interference + โœ… Full Verification
๐ŸŽฌ

Context Sharing in Action

๐Ÿ”ฌ Round 1: Agent 1 (Data Scientist)

โ€ข Creates analysis.py and results.csv
โ€ข Saves to permanent workspace
โ€ข ๐Ÿ“ธ Snapshot captured

๐Ÿงช Round 2: Agent 2 (Code Reviewer)

โ€ข Sees agent1/analysis.py in temp workspace
โ€ข Reads & tests the analysis code
โ€ข Modifications in temp dir don't affect Agent 1
โ€ข Creates improved_analysis.py in own workspace

๐ŸŽฏ Final Presentation

โ€ข Winning agent has full context
โ€ข Can reference both agents' work
โ€ข Snapshots ensure correct version access

๐Ÿ—‚๏ธ Workspace Structure

๐Ÿ‘จโ€๐Ÿ’ผ Agent 1 Permanent Workspace
๐Ÿ“„ analysis.py
๐Ÿ“Š results.csv
๐Ÿ“ methodology.md
๐Ÿ‘๏ธ Agent 2 Temp Workspace (Read-Only Context)
๐Ÿ“ agent1/
  ๐Ÿ“„ analysis.py โœ… readable & testable
  ๐Ÿ“Š results.csv
  ๐Ÿ“ methodology.md
๐Ÿ‘จโ€๐Ÿ’ป Agent 2 Permanent Workspace
๐Ÿ“„ improved_analysis.py
๐Ÿ“‹ code_review.md
๐Ÿงช test_results.json
๐Ÿ”‘ Key Benefits Illustrated:
โœ… Agent 2 can READ & execute Agent 1's work
โœ… Temp modifications don't corrupt original
โœ… Each agent maintains workspace integrity
โœ… Final answer has complete context
๐Ÿ“Š

Benchmarking: Preliminary Results

Scientific evaluation across graduate-level reasoning, instruction-following, and narrative tasks

๐Ÿงช GPQA-Diamond

Graduate Physics/Chemistry
MassGen
87.4% ๐Ÿ†
Gemini
85.9%
Grok 4
85.4%
GPT-5
84.8%
Claude
68.2%

๐Ÿ“‹ IFEval

Instruction Following
MassGen
88.0% ๐Ÿ†
GPT-5
87.4%
Grok 4
84.7%
Gemini
66.0%
Claude
63.6%

๐Ÿ“– MuSR

Narrative Reasoning
Gemini
69.6% ๐Ÿ†
GPT-5
69.2%
MassGen
68.3%
Grok 4
67.6%
Claude
62.8%
๐Ÿ† Overall Champion
MassGen: 81.2%
Wins 2/3 benchmarks โ€ข Statistically significant
โœ… Key Results:
โ€ข Highest on 2/3 benchmarks
โ€ข Best overall average
โ€ข Consistent performance
๐Ÿ“ˆ Statistical:
โ€ข vs Claude: p = 1.4e-07 โญโญโญ
โ€ข vs Gemini: p = 1.1e-28 โญโญโญ
โ€ข Not due to chance
๐Ÿ”ฌ Research Gap:
โ€ข Oracle: 95.5% (GPQA)
โ€ข Actual: 87.4%
โ€ข Potential: 8.1 points
๐Ÿ”ฌ

Case Study: Success Through Peer Correction

Graduate-level physics question from GPQA-Diamond benchmark

๐ŸŒŒ The Problem

A quasar shows a peak at 790 nm wavelength. Given Lambda-CDM cosmological parameters (Hโ‚€ = 70 km/s/Mpc, ฮฉโ‚˜ = 0.3, ฮฉฮ› = 0.7), what is the comoving distance?

Options: A) 8 Gpc B) 7 Gpc C) 6 Gpc D) 9 Gpc

๐ŸŽฏ Final Result

โœ…
Correct Answer: A (8 Gpc)
Orchestration succeeded where individual agents initially failed

๐Ÿค– Round 1: Initial Answers

Claude: "I calculate ~6 Gpc โ†’ Answer C"
GPT-5: "I get ~8.95 Gpc โ†’ Answer D"
Gemini: "~6.1 Gpc โ†’ Answer C"

๐Ÿ”„ Self-Correction Process

Claude observes: "There is significant discrepancy in calculations: Agent1 gets ~6.1 Gpc, Agent2 gets ~8.95 Gpc. Let me re-examine..."

โœจ Breakthrough Moment

Claude revises: "Standard cosmological calculators yield 8000-8500 Mpc for z=5.5. This equals 8.0-8.5 Gpc, closest to option A."

Result: 3/4 agents converge on correct answer
๐Ÿ’ก Success Mechanism:
Peer observation โ†’ Discrepancy detection โ†’ Self-correction โ†’ Consensus
โš ๏ธ

Case Study: When Good Analysis Goes Wrong

Understanding current limitations and improvement opportunities

๐Ÿ”ญ The Problem

"How many stars would be detectable using the ESPRESSO spectrograph with S/N โ‰ฅ 10 in 1-hour exposure?"

Candidates: Canopus, Polaris, and 4 synthetic stars
Correct Answer: 2 stars (Canopus saturates detector)

โŒ Orchestration Failed

โœ—
Wrong Answer: 3 stars
Despite Claude having the correct reasoning

โœ… Claude (Correct)

"Canopus is too bright and would saturate the detector in a 1-hour exposure. Only 2 stars are detectable without saturation."

โŒ Others (Detailed but Wrong)

GPT-5, Gemini, Grok: Provided comprehensive magnitude calculations but failed to consider detector saturation โ†’ concluded 3 stars

๐ŸŽฏ What Happened

Orchestration selected Gemini for having "most accurate and comprehensive reasoning," prioritizing analysis quality over correctness
๐Ÿ” Root Cause:
System confused detailed explanation with correctness
๐Ÿ’ก Improvement Opportunity:
Better balance reasoning quality and answer validation
๐Ÿง 

Coordination Psychology: How Voting Behavior Changes

Controlled experiments reveal how coordination mechanisms shape multi-agent behavior

๐ŸŽญ Identity Disclosure Effects

Anonymous Voting (Baseline)

GPT-5 Self-Voting
81.0%
Agents see: "agent1", "agent2", "agent3", "agent4"

Identified Voting

GPT-5 Self-Voting
88.4%
+7.4% increase in self-voting
Consensus ties: 14.1% โ†’ 23.2%
Agents see: "GPT-5", "Claude", "Gemini", "Grok"

๐Ÿ‘๏ธ Vote Visibility Effects

Hidden Tally (Baseline)

First-voted Selected
54.1%
Agents vote independently without seeing others' choices

Visible Tally

First-voted Selected
67.8%
+13.7% herding behavior
Especially strong: GPT-5 (+40%)
Agents see ongoing vote counts and reasons

๐ŸŽฏ Key Behavioral Insights

๐ŸŽญ
Identity Matters
Knowing "who said what" increases bias and reduces consensus
๐Ÿ‘
Herding Behavior
Visible votes create momentum effects, sometimes premature consensus
๐Ÿš€

MassGen Evolution

๐Ÿ—๏ธ

Foundation Era

v0.0.1 - v0.0.3
Core framework, basic streaming,
Claude, Gemini, GPT/o, Grok
LATEST
๐Ÿš€

Rapid Evolution

v0.0.3 โ†’ v0.0.17
Claude Code CLI, GPT-5, 10+ providers
MCP integration, browsing, coding
Unified filesystem & enhanced tooling
Industrial & academic adoption
15 Releases
โ€ข
25+ Days Foundationโ†’Expansion
๐ŸŒŸ

Early Adopters & Community Growth

๐Ÿข Industrial Institutions

  • Microsoft Azure
  • Gradient
  • Sparsity
  • Leading Trading Firm
  • AgentWeb

๐ŸŽ“ Academic Institutions

  • Simon Fraser University
  • UC Berkeley & Santa Cruz
  • University of Pennsylvania
  • University of Sydney
  • University of Toronto

๐Ÿš€ Open Source Community

Active development
Growing contributor base
Global adoption
Research partnerships
MassGen Contributors Wall
๐Ÿค Community Contributors - Thank you!
๐ŸŽฌ

Live Demonstrations

๐ŸŒ LLM Fun Facts Website (v0.0.14): Claude Code agents create interactive websites with enhanced logging and workspace isolation
Result: Conflict-free parallel development with comprehensive versioning
๐Ÿ“ Unified Filesystem (v0.0.16): Cross-backend collaboration between Gemini and Claude Code agents creating educational content with shared workspace management
Result: First-time cross-backend coordination producing comprehensive 25-slide presentations
๐Ÿ† IMO 2025 Winner Research: Multi-agent fact-checking โ†’ unanimous consensus on Google DeepMind victory
Result: Accurate identification despite conflicting information
๐Ÿ’ฐ Technical Analysis: Complex Grok-4 HLE pricing calculation through iterative refinement
Result: Accurate cost estimates through collaborative validation
๐Ÿ“š case.massgen.ai - Complete Case Studies
๐Ÿ”ฎ

Multi-Agent Scaling Opportunity

Diversity of Thought Drives Quality

๐Ÿง 

Complex Analysis

๐ŸŽจ

Creative Tasks

๐Ÿ”„

Coordination

Multiple perspectives, cross-verification, iterative refinement

โšก

Get Started in 60 Seconds

# 1. Clone and setup
git clone https://github.com/Leezekun/MassGen
cd MassGen && pip install uv && uv venv

# 2. Configure API keys
cp .env.example .env # Add your API keys

# 3. Run single agent (quick test)
uv run python -m massgen.cli --model gemini-2.5-flash "When is your knowledge up to"

# 4. Run multi-agent collaboration
uv run python -m massgen.cli --config three_agents_default.yaml "Summarize latest news of github.com/Leezekun/MassGen"

โœ… Supported Models & Providers

๐Ÿข Major Providers:
Anthropic Claude & Claude Code โ€ข Google Gemini โ€ข OpenAI GPT โ€ข xAI Grok โ€ข ZAI GLM
๐Ÿ  Local & Extended:
Cerebras โ€ข Fireworks โ€ข Groq โ€ข LM Studio โ€ข OpenRouter โ€ข Together...

๐Ÿ› ๏ธ Advanced Tools

Web Search โ€ข Code Execution โ€ข MCP Tools โ€ข File Operations โ€ข Browser Automation
๐Ÿ”ฎ

Vision: The Path to Exponential Intelligence

  • Hurdles: Shared memory, context, interoperability
  • Roadmap: More models/agents, web UI
  • Vision: Recursive agents bootstrapping intelligence
Agents Grok Gemini Claude GPT AG2 Systems Grok Heavy DeepThink Claude Code ChatGPT AG2 Orchestrator MassGen 1ร— 10ร— 100ร— Challenges Consensus Shared Context
The Path to Exponential Intelligence
๐Ÿš€

Join the Multi-Agent Revolution

Build Scalable, Collaborative AI Systems
๐Ÿš€ v0.0.3โ†’v0.0.17 Latest | 10+ providers; Claude Code CLI; GPT-5 & GPT-OSS; Local Models; MCP Integration; Unified Filesystem
Thank you M2L Summer School!
Questions & Discussion
1 / 24