MassGen: Scaling AI Through Multi-Agent Collaboration

Diagram showing collaborative AI agents working together in parallel threads

Scaling AI Through Multi-Agent Collaboration

🎓 Columbia University
DAPLab
📍 New York • Fall 2025

🌐 massgen.ai | GitHub

🚫

The Single-Agent Limitation

Siloed Thinking: Single models miss diverse perspectives
Limited Context: No peer review or validation
Sequential Processing: Linear, not parallel exploration
Fixed Approach: Limited mid-task adaptation to new information

Illustration demonstrating the isolation and limitations of single-agent AI systems

From Isolation to Collaboration

🤝

The Promise of Multi-Agent Collaboration

Study Group Dynamics: Like humans collaborating on complex problems
Cross-Model Synergy: Leverage unique strengths of Claude, Gemini, GPT, Grok
Parallel Processing: Multiple perspectives tackle same task simultaneously
Real-time Intelligence Sharing: Agents learn and adapt from each other

Visual representation of collaborative AI reasoning and cognitive processes

The Promise of Collaborative Reasoning

📖 root.massgen.ai - "Myth of Reasoning"

🏗️ Built on AG2's foundational multi-agent research and community

🏗️

AG2: The Foundation for Multi-Agent Research

AG2 research foundation and community history

🤝 Community-Driven Innovation

MassGen evolved from AG2's pioneering work in multi-agent conversations
and the vibrant research community it fostered

📈

Proven Performance Gains - Grok Heavy Evidence

Grok-4 Standard

1

Single Agent Processing

38.6%

Last Human Exam Score

$30/month

Grok-4 Heavy

A1

A2

A3

Multi-Agent Collaboration

44.4%

Last Human Exam Score

$300/month

+15% Performance Boost
Multi-agent "study group" approach outperforms single agent

"The exploration of the art & science of multi-agent collaboration has just started."

🚀

MassGen Orchestrator

Task Distribution & Coordination

↓

🏗️

Agent 1

Anthropic/Claude

👨‍💻

Agent 2

Claude Code

🌟

Agent 3

Google/Gemini

🤖

Agent 4

OpenAI/GPT

⚡

Agent 5

xAI/Grok

↕ Real-time Collaboration ↕

↓

🔄

Shared Collaboration Hub

Real-time Notification & Consensus

Key Features & Capabilities

🤝 Cross-Model Synergy: Harness strengths from diverse models
⚡ Parallel Processing: Multiple agents tackle problems simultaneously
🔄 Iterative Refinement: Non-linear reasoning through cycles of improvement
👥 Intelligence Sharing: Agents share working summaries, tool results, and insights in real-time
🎯 Consensus Building: Natural convergence through collaboration

Diagram showing iterative refinement cycles in multi-agent collaboration

Iterative Refinement: The Reality of Reasoning

⚙️

Tech Deep Dive: Async Streaming & Dynamic Scheduling

🔄 AsyncGenerator Pattern: Real-time streaming from 5+ agents simultaneously
⚡ Dynamic Task Management: Agents start/stop based on voting status
🔁 Graceful Restart & Wrap-up: Dynamic wrapping-up as part of scheduling

Key Innovation: Dynamic coordination without deadlocks

🔧

Tech Deep Dive: Backend Abstraction Challenges

🎭 Unified Interface: Standardized ChatAgent protocol for 8+ different backends
🛠️ Tool Integration: Web search, code execution, MCP
⚙️ StreamChunk Normalization: Convert diverse response formats to common protocol
🔀 Backend-Specific Workarounds: Each provider has unique limitations

Backend Challenges:

Claude Code CLI

Context sharing across agents

Gemini API

Can't mix builtin + custom tools

GPT-5

API change (reasoning, streaming etc.)

Most Backends

Unable to autonomously collaborate

🎯 Our Solution:

Binary Decision Framework & Advanced Workspace Sharing

Result: Unified interface with backend-specific optimizations

🎯

Tech Deep Dive: Binary Decision Framework Solution

⚖️ Binary Choice: Each agent must choose: vote OR new_answer
💥 Vote Invalidation: Any new_answer invalidates ALL existing votes
🔄 Reset & Restart: All agents restart with updated answer context
🎭 Anonymous Voting: Agents see "agent1", "agent2" etc.

Key Innovation: Dynamic equilibrium through vote invalidation

⚠️

The Context Sharing Challenge

❌ Naive Approach 1: Share Answers Only

Agents only see final text answers
Can't verify methodology or data
Unable to test or build upon work
Lost intermediate context

❌ Naive Approach 2: Share Workspace Paths

Agents interfere with each other's work
Data corruption from simultaneous edits
Loss of original work context
Workspace pollution and conflicts

The Challenge: How to share context without interference?

✅

Our Context Sharing Solution

📸 Workspace Snapshots: Orchestrator captures agent workspaces after each round
📁 Temporary Directories: Each agent gets a clean temp workspace with all snapshots
🎭 Anonymous Mapping: agent1/, agent2/ folders preserve anonymity
🔒 Clean Separation: Read from temp dir, write to permanent workspace
🔄 Context Preservation: Snapshots linked to coordination rounds

🎬

Context Sharing in Action

🔬 Round 1: Agent 1 (Data Scientist)

• Creates analysis.py and results.csv
• Saves to permanent workspace
• 📸 Snapshot captured

🧪 Round 2: Agent 2 (Code Reviewer)

• Sees agent1/analysis.py in temp workspace
• Reads & tests the analysis code
• Modifications in temp dir don't affect Agent 1
• Creates improved_analysis.py in own workspace

🎯 Final Presentation

• Winning agent has full context
• Can reference both agents' work
• Snapshots ensure correct version access

🗂️ Workspace Structure

👨‍💼 Agent 1 Permanent Workspace

                                📄 analysis.py

                                📊 results.csv

                                📝 methodology.md

👁️ Agent 2 Temp Workspace (Read-Only Context)

                                📁 agent1/

                                  📄 analysis.py ✅ readable & testable

                                  📊 results.csv

                                  📝 methodology.md

👨‍💻 Agent 2 Permanent Workspace

                                📄 improved_analysis.py

                                📋 code_review.md

                                🧪 test_results.json

🔑 Key Benefits Illustrated:

✅ Agent 2 can READ & execute Agent 1's work
✅ Temp modifications don't corrupt original
✅ Each agent maintains workspace integrity
✅ Final answer has complete context

📊

Benchmarking: Preliminary Results

Scientific evaluation across graduate-level reasoning, instruction-following, and narrative tasks

🧪 GPQA-Diamond

Graduate Physics/Chemistry

MassGen

87.4% 🏆

Gemini

85.9%

Grok 4

85.4%

GPT-5

84.8%

Claude

68.2%

📋 IFEval

Instruction Following

MassGen

88.0% 🏆

GPT-5

87.4%

Grok 4

84.7%

Gemini

66.0%

Claude

63.6%

📖 MuSR

Narrative Reasoning

Gemini

69.6% 🏆

GPT-5

69.2%

MassGen

68.3%

Grok 4

67.6%

Claude

62.8%

🏆 Overall Champion

MassGen: 81.2%

Wins 2/3 benchmarks • Statistically significant

✅ Key Results:

• Highest on 2/3 benchmarks
• Best overall average
• Consistent performance

📈 Statistical:

• vs Claude: p = 1.4e-07 ⭐⭐⭐
• vs Gemini: p = 1.1e-28 ⭐⭐⭐
• Not due to chance

🔬 Research Gap:

• Oracle: 95.5% (GPQA)
• Actual: 87.4%
• Potential: 8.1 points

🔬

Case Study: Success Through Peer Correction

Graduate-level physics question from GPQA-Diamond benchmark

🌌 The Problem

A quasar shows a peak at 790 nm wavelength. Given Lambda-CDM cosmological parameters (H₀ = 70 km/s/Mpc, Ωₘ = 0.3, ΩΛ = 0.7), what is the comoving distance?

Options: A) 8 Gpc B) 7 Gpc C) 6 Gpc D) 9 Gpc

🎯 Final Result

✅

Correct Answer: A (8 Gpc)

Orchestration succeeded where individual agents initially failed

🤖 Round 1: Initial Answers

Claude: "I calculate ~6 Gpc → Answer C"
GPT-5: "I get ~8.95 Gpc → Answer D"
Gemini: "~6.1 Gpc → Answer C"

🔄 Self-Correction Process

Claude observes: "There is significant discrepancy in calculations: Agent1 gets ~6.1 Gpc, Agent2 gets ~8.95 Gpc. Let me re-examine..."

✨ Breakthrough Moment

Claude revises: "Standard cosmological calculators yield 8000-8500 Mpc for z=5.5. This equals 8.0-8.5 Gpc, closest to option A."

Result: 3/4 agents converge on correct answer

💡 Success Mechanism:
Peer observation → Discrepancy detection → Self-correction → Consensus

⚠️

Case Study: When Good Analysis Goes Wrong

Understanding current limitations and improvement opportunities

🔭 The Problem

"How many stars would be detectable using the ESPRESSO spectrograph with S/N ≥ 10 in 1-hour exposure?"

Candidates: Canopus, Polaris, and 4 synthetic stars
Correct Answer: 2 stars (Canopus saturates detector)

❌ Orchestration Failed

✗

Wrong Answer: 3 stars

Despite Claude having the correct reasoning

✅ Claude (Correct)

"Canopus is too bright and would saturate the detector in a 1-hour exposure. Only 2 stars are detectable without saturation."

❌ Others (Detailed but Wrong)

GPT-5, Gemini, Grok: Provided comprehensive magnitude calculations but failed to consider detector saturation → concluded 3 stars

🎯 What Happened

Orchestration selected Gemini for having "most accurate and comprehensive reasoning," prioritizing analysis quality over correctness

🔍 Root Cause:

System confused detailed explanation with correctness

💡 Improvement Opportunity:

Better balance reasoning quality and answer validation

🧠

Coordination Psychology: How Voting Behavior Changes

Controlled experiments reveal how coordination mechanisms shape multi-agent behavior

🎭 Identity Disclosure Effects

Anonymous Voting (Baseline)

GPT-5 Self-Voting

81.0%

Agents see: "agent1", "agent2", "agent3", "agent4"

Identified Voting

GPT-5 Self-Voting

88.4%

+7.4% increase in self-voting
Consensus ties: 14.1% → 23.2%

Agents see: "GPT-5", "Claude", "Gemini", "Grok"

👁️ Vote Visibility Effects

Hidden Tally (Baseline)

First-voted Selected

54.1%

Agents vote independently without seeing others' choices

Visible Tally

First-voted Selected

67.8%

+13.7% herding behavior
Especially strong: GPT-5 (+40%)

Agents see ongoing vote counts and reasons

🎯 Key Behavioral Insights

🎭

Identity Matters
Knowing "who said what" increases bias and reduces consensus

🐑

Herding Behavior
Visible votes create momentum effects, sometimes premature consensus

🚀

MassGen Evolution: v0.0.1 → v0.0.16

🏗️

Foundation Era

v0.0.1 - v0.0.3

Core framework, basic streaming,
Claude, Gemini, GPT/o, Grok

LATEST

🚀

Rapid Evolution

v0.0.3 → v0.0.16

Claude Code CLI, GPT-5, 10+ providers
MCP integration, browsing, coding
Unified filesystem & enhanced tooling
Industrial & academic adoption

15 Releases

•

25+ Days Foundation→Expansion

🌟

Early Adopters & Community Growth

🏢 Industrial Institutions

Microsoft Azure
Gradient
Sparsity
Leading Trading Firm
AgentWeb

🎓 Academic Institutions

Simon Fraser University
UC Berkeley & Santa Cruz
University of Pennsylvania
University of Sydney
University of Toronto

🚀 Open Source Community

Active development

Growing contributor base

Global adoption

Research partnerships

🤝 Community Contributors - Thank you!

🎬

Live Demonstrations

🌐 LLM Fun Facts Website (v0.0.14): Claude Code agents create interactive websites with enhanced logging and workspace isolation

Result: Conflict-free parallel development with comprehensive versioning

📋 Notion MCP Integration (v0.0.15): Gemini agents generate and store todo lists via external API with real-time verification

Result: Seamless external tool integration and persistent output management

🏆 IMO 2025 Winner Research: Multi-agent fact-checking → unanimous consensus on Google DeepMind victory

Result: Accurate identification despite conflicting information

💰 Technical Analysis: Complex Grok-4 HLE pricing calculation through iterative refinement

Result: Accurate cost estimates through collaborative validation

📚 case.massgen.ai - Complete Case Studies

🎓

Columbia Research Applications

🧬

Computational Biology

Multi-agent protein folding prediction, drug discovery optimization, and genomics research acceleration

📚

Digital Humanities

Collaborative text analysis, historical document processing, and linguistic research

🏗️

Engineering

Distributed system design, infrastructure optimization, and smart city planning

💼

Business School

Market analysis, financial modeling, and strategic decision-making through AI collaboration

🏛️ Research Collaboration Possibilities

Ready to explore multi-agent AI research collaborations

⚡

Get Started in 60 Seconds

                # 1. Clone and setup

                git clone https://github.com/Leezekun/MassGen

                cd MassGen && pip install uv && uv venv

                # 2. Configure API keys

                cp .env.example .env  # Add your API keys

                # 3. Run single agent (quick test)

                uv run python -m massgen.cli --model gemini-2.5-flash "When is your knowledge up to"

                # 4. Run multi-agent collaboration

                uv run python -m massgen.cli --config three_agents_default.yaml "Summarize latest news of github.com/Leezekun/MassGen"

✅ Supported Models & Providers

🏢 Major Providers:

Anthropic Claude & Claude Code • Google Gemini • OpenAI GPT • xAI Grok • ZAI GLM

🏠 Local & Extended:

Cerebras • Fireworks • Groq • LM Studio • OpenRouter • Together...

🛠️ Advanced Tools

Web Search • Code Execution • MCP Tools • File Operations • Browser Automation

🔮

Vision: The Path to Exponential Intelligence

Hurdles: Shared memory, context, interoperability
Roadmap: More models/agents, web UI
Vision: Recursive agents bootstrapping intelligence

The Path to Exponential Intelligence

🚀

Join the Multi-Agent Revolution

Build Scalable, Collaborative AI Systems

⭐ github.massgen.ai 💬 discord.massgen.ai 🐦 x.massgen.ai 🌐 massgen.ai

🚀 v0.0.3→v0.0.16 Latest | 10+ providers; Claude Code CLI; GPT-5 & GPT-OSS; Local Models; MCP Integration; Unified Filesystem

Thank you Columbia DAPLab!
Questions & Discussion