Scaling AI Through Multi-Agent Collaboration
๐ Columbia University
DAPLab
๐ New York โข Fall 2025
๐ซ
The Single-Agent Limitation
Siloed Thinking: Single models miss diverse perspectives
Limited Context: No peer review or validation
Sequential Processing: Linear, not parallel exploration
Fixed Approach: Limited mid-task adaptation to new information
From Isolation to Collaboration
๐ค
The Promise of Multi-Agent Collaboration
Study Group Dynamics: Like humans collaborating on complex problems
Cross-Model Synergy: Leverage unique strengths of Claude, Gemini, GPT, Grok
Parallel Processing: Multiple perspectives tackle same task simultaneously
Real-time Intelligence Sharing: Agents learn and adapt from each other
The Promise of Collaborative Reasoning
๐๏ธ Built on AG2's foundational multi-agent research and community
๐๏ธ
AG2: The Foundation for Multi-Agent Research
๐ค Community-Driven Innovation
MassGen evolved from AG2's pioneering work in multi-agent conversations
and the vibrant research community it fostered
๐
Proven Performance Gains - Grok Heavy Evidence
Grok-4 Standard
1
Single Agent Processing
38.6%
Last Human Exam Score
$30/month
Grok-4 Heavy
Multi-Agent Collaboration
44.4%
Last Human Exam Score
$300/month
+15% Performance Boost
Multi-agent "study group" approach outperforms single agent
"The exploration of the art & science of multi-agent collaboration has just started."
๐
MassGen Orchestrator
Task Distribution & Coordination
โ
๐๏ธ
Agent 1
Anthropic/Claude
๐จโ๐ป
Agent 2
Claude Code
๐
Agent 3
Google/Gemini
โ Real-time Collaboration โ
โ
๐
Shared Collaboration Hub
Real-time Notification & Consensus
Key Features & Capabilities
๐ค Cross-Model Synergy: Harness strengths from diverse models
โก Parallel Processing: Multiple agents tackle problems simultaneously
๐ Iterative Refinement: Non-linear reasoning through cycles of improvement
๐ฅ Intelligence Sharing: Agents share working summaries, tool results, and insights in real-time
๐ฏ Consensus Building: Natural convergence through collaboration
Iterative Refinement: The Reality of Reasoning
โ๏ธ
Tech Deep Dive: Async Streaming & Dynamic Scheduling
๐ AsyncGenerator Pattern: Real-time streaming from 5+ agents simultaneously
โก Dynamic Task Management: Agents start/stop based on voting status
๐ Graceful Restart & Wrap-up: Dynamic wrapping-up as part of scheduling
Orchestrator
Agent 1
Agent 2
Agent 3
Agent 4
content
tool_calls
reasoning
๐ Restart Trigger
When Agent 2 provides new_answer
restarting
restarting
restarting
Key Innovation: Dynamic coordination without deadlocks
๐ง
Tech Deep Dive: Backend Abstraction Challenges
๐ญ Unified Interface: Standardized ChatAgent protocol for 8+ different backends
๐ ๏ธ Tool Integration: Web search, code execution, MCP
โ๏ธ StreamChunk Normalization: Convert diverse response formats to common protocol
๐ Backend-Specific Workarounds: Each provider has unique limitations
Backend Challenges:
Claude Code CLI
Context sharing across agents
Gemini API
Can't mix builtin + custom tools
GPT-5
API change (reasoning, streaming etc.)
Most Backends
Unable to autonomously collaborate
๐ฏ Our Solution:
Binary Decision Framework & Advanced Workspace Sharing
Result: Unified interface with backend-specific optimizations
๐ฏ
Tech Deep Dive: Binary Decision Framework Solution
โ๏ธ Binary Choice: Each agent must choose: vote OR new_answer
๐ฅ Vote Invalidation: Any new_answer invalidates ALL existing votes
๐ Reset & Restart: All agents restart with updated answer context
๐ญ Anonymous Voting: Agents see "agent1", "agent2" etc.
Round 1: Agents 1,3,4 vote for Agent 4
Vote: agent4
No vote yet
Vote: agent4
Answer+Vote
โก Agent 2 provides new_answer
new_answer
Round 2: All agents restart with 2 answers
restart_pending
Has new answer
restart_pending
Has old answer
Each agent decides: vote OR new_answer
๐ Any new_answer resets ALL votes
Votes from Round 1: โ INVALID
New decisions needed based on 2 available answers
Key Innovation: Dynamic equilibrium through vote invalidation
โ ๏ธ
The Context Sharing Challenge
โ Naive Approach 1: Share Answers Only
Agents only see final text answers
Can't verify methodology or data
Unable to test or build upon work
Lost intermediate context
โ Naive Approach 2: Share Workspace Paths
Agents interfere with each other's work
Data corruption from simultaneous edits
Loss of original work context
Workspace pollution and conflicts
Approach 1: Answer Only
Agent 1
Agent 2
"Answer text"
โ No verification possible
Approach 2: Workspace Sharing
Agent 1
Agent 2
Shared
Workspace
โ ๏ธ Conflicts!
โ Data corruption & interference
The Challenge: How to share context without interference?
โ
Our Context Sharing Solution
๐ธ Workspace Snapshots: Orchestrator captures agent workspaces after each round
๐ Temporary Directories: Each agent gets a clean temp workspace with all snapshots
๐ญ Anonymous Mapping: agent1/, agent2/ folders preserve anonymity
๐ Clean Separation: Read from temp dir, write to permanent workspace
๐ Context Preservation: Snapshots linked to coordination rounds
Orchestrator
Snapshot Storage
agent1/
agent2/
Agent 1
Permanent WS
Temp Context
Agent 2
Permanent WS
Temp Context
context
sharing
Agent 1 Temp Copy
agent1/
agent2/
Own temp copy
Agent 2 Temp Copy
agent1/
agent2/
Own temp copy
โ
Safe Context Sharing + โ
No Interference + โ
Full Verification
๐ฌ
Context Sharing in Action
๐ฌ Round 1: Agent 1 (Data Scientist)
โข Creates analysis.py and results.csv
โข Saves to permanent workspace
โข ๐ธ Snapshot captured
๐งช Round 2: Agent 2 (Code Reviewer)
โข Sees agent1/analysis.py in temp workspace
โข Reads & tests the analysis code
โข Modifications in temp dir don't affect Agent 1
โข Creates improved_analysis.py in own workspace
๐ฏ Final Presentation
โข Winning agent has full context
โข Can reference both agents' work
โข Snapshots ensure correct version access
๐๏ธ Workspace Structure
๐จโ๐ผ Agent 1 Permanent Workspace
๐ analysis.py
๐ results.csv
๐ methodology.md
๐๏ธ Agent 2 Temp Workspace (Read-Only Context)
๐ agent1/
๐ analysis.py โ
readable & testable
๐ results.csv
๐ methodology.md
๐จโ๐ป Agent 2 Permanent Workspace
๐ improved_analysis.py
๐ code_review.md
๐งช test_results.json
๐ Key Benefits Illustrated:
โ
Agent 2 can READ & execute Agent 1's work
โ
Temp modifications don't corrupt original
โ
Each agent maintains workspace integrity
โ
Final answer has complete context
๐
Benchmarking: Preliminary Results
Scientific evaluation across graduate-level reasoning, instruction-following, and narrative tasks
๐งช GPQA-Diamond
Graduate Physics/Chemistry
๐ IFEval
Instruction Following
๐ MuSR
Narrative Reasoning
๐ Overall Champion
MassGen: 81.2%
Wins 2/3 benchmarks โข Statistically significant
โ
Key Results:
โข Highest on 2/3 benchmarks
โข Best overall average
โข Consistent performance
๐ Statistical:
โข vs Claude: p = 1.4e-07 โญโญโญ
โข vs Gemini: p = 1.1e-28 โญโญโญ
โข Not due to chance
๐ฌ Research Gap:
โข Oracle: 95.5% (GPQA)
โข Actual: 87.4%
โข Potential: 8.1 points
๐ฌ
Case Study: Success Through Peer Correction
Graduate-level physics question from GPQA-Diamond benchmark
๐ The Problem
A quasar shows a peak at 790 nm wavelength. Given Lambda-CDM cosmological parameters
(Hโ = 70 km/s/Mpc, ฮฉโ = 0.3, ฮฉฮ = 0.7), what is the comoving distance?
Options: A) 8 Gpc B) 7 Gpc C) 6 Gpc D) 9 Gpc
๐ฏ Final Result
โ
Correct Answer: A (8 Gpc)
Orchestration succeeded where individual agents initially failed
๐ค Round 1: Initial Answers
Claude: "I calculate ~6 Gpc โ Answer C"
GPT-5: "I get ~8.95 Gpc โ Answer D"
Gemini: "~6.1 Gpc โ Answer C"
๐ Self-Correction Process
Claude observes: "There is significant discrepancy in calculations:
Agent1 gets ~6.1 Gpc, Agent2 gets ~8.95 Gpc. Let me re-examine..."
โจ Breakthrough Moment
Claude revises: "Standard cosmological calculators yield 8000-8500 Mpc
for z=5.5. This equals 8.0-8.5 Gpc, closest to option A."
Result: 3/4 agents converge on correct answer
๐ก Success Mechanism:
Peer observation โ Discrepancy detection โ Self-correction โ Consensus
โ ๏ธ
Case Study: When Good Analysis Goes Wrong
Understanding current limitations and improvement opportunities
๐ญ The Problem
"How many stars would be detectable using the ESPRESSO spectrograph
with S/N โฅ 10 in 1-hour exposure?"
Candidates: Canopus, Polaris, and 4 synthetic stars
Correct Answer: 2 stars (Canopus saturates detector)
โ Orchestration Failed
โ
Wrong Answer: 3 stars
Despite Claude having the correct reasoning
โ
Claude (Correct)
"Canopus is too bright and would saturate the detector
in a 1-hour exposure. Only 2 stars are detectable without saturation."
โ Others (Detailed but Wrong)
GPT-5, Gemini, Grok: Provided comprehensive magnitude calculations
but failed to consider detector saturation โ concluded 3 stars
๐ฏ What Happened
Orchestration selected Gemini for having "most accurate and comprehensive reasoning,"
prioritizing analysis quality over correctness
๐ Root Cause:
System confused detailed explanation with correctness
๐ก Improvement Opportunity:
Better balance reasoning quality and answer validation
๐ง
Coordination Psychology: How Voting Behavior Changes
Controlled experiments reveal how coordination mechanisms shape multi-agent behavior
๐ญ Identity Disclosure Effects
Anonymous Voting (Baseline)
Agents see: "agent1", "agent2", "agent3", "agent4"
Identified Voting
+7.4% increase in self-voting
Consensus ties: 14.1% โ 23.2%
Agents see: "GPT-5", "Claude", "Gemini", "Grok"
๐๏ธ Vote Visibility Effects
Hidden Tally (Baseline)
First-voted Selected
54.1%
Agents vote independently without seeing others' choices
Visible Tally
First-voted Selected
67.8%
+13.7% herding behavior
Especially strong: GPT-5 (+40%)
Agents see ongoing vote counts and reasons
๐ฏ Key Behavioral Insights
๐ญ
Identity Matters
Knowing "who said what" increases bias and reduces consensus
๐
Herding Behavior
Visible votes create momentum effects, sometimes premature consensus
๐
MassGen Evolution: v0.0.1 โ v0.0.16
๐๏ธ
Foundation Era
v0.0.1 - v0.0.3
Core framework, basic streaming,
Claude, Gemini, GPT/o, Grok
LATEST
๐
Rapid Evolution
v0.0.3 โ v0.0.16
Claude Code CLI, GPT-5, 10+ providers
MCP integration, browsing, coding
Unified filesystem & enhanced tooling
Industrial & academic adoption
15 Releases
โข
25+ Days FoundationโExpansion
๐
Early Adopters & Community Growth
๐ข Industrial Institutions
Microsoft Azure
Gradient
Sparsity
Leading Trading Firm
AgentWeb
๐ Academic Institutions
Simon Fraser University
UC Berkeley & Santa Cruz
University of Pennsylvania
University of Sydney
University of Toronto
๐ Open Source Community
Active development
Growing contributor base
Global adoption
Research partnerships
๐ค Community Contributors - Thank you!
๐ฌ
Live Demonstrations
๐ LLM Fun Facts Website (v0.0.14): Claude Code agents create interactive websites with enhanced logging and workspace isolation
Result: Conflict-free parallel development with comprehensive versioning
๐ Notion MCP Integration (v0.0.15): Gemini agents generate and store todo lists via external API with real-time verification
Result: Seamless external tool integration and persistent output management
๐ IMO 2025 Winner Research: Multi-agent fact-checking โ unanimous consensus on Google DeepMind victory
Result: Accurate identification despite conflicting information
๐ฐ Technical Analysis: Complex Grok-4 HLE pricing calculation through iterative refinement
Result: Accurate cost estimates through collaborative validation
๐
Columbia Research Applications
๐งฌ
Computational Biology
Multi-agent protein folding prediction, drug discovery optimization, and genomics research acceleration
๐
Digital Humanities
Collaborative text analysis, historical document processing, and linguistic research
๐๏ธ
Engineering
Distributed system design, infrastructure optimization, and smart city planning
๐ผ
Business School
Market analysis, financial modeling, and strategic decision-making through AI collaboration
๐๏ธ Research Collaboration Possibilities
Ready to explore multi-agent AI research collaborations
โก
Get Started in 60 Seconds
# 1. Clone and setup
git clone https://github.com/Leezekun/MassGen
cd MassGen && pip install uv && uv venv
# 2. Configure API keys
cp .env.example .env # Add your API keys
# 3. Run single agent (quick test)
uv run python -m massgen.cli --model gemini-2.5-flash "When is your knowledge up to"
# 4. Run multi-agent collaboration
uv run python -m massgen.cli --config three_agents_default.yaml "Summarize latest news of github.com/Leezekun/MassGen"
โ
Supported Models & Providers
๐ข Major Providers:
Anthropic Claude & Claude Code โข Google Gemini โข OpenAI GPT โข xAI Grok โข ZAI GLM
๐ Local & Extended:
Cerebras โข Fireworks โข Groq โข LM Studio โข OpenRouter โข Together...
๐ ๏ธ Advanced Tools
Web Search โข Code Execution โข MCP Tools โข File Operations โข Browser Automation
๐ฎ
Vision: The Path to Exponential Intelligence
Hurdles: Shared memory, context, interoperability
Roadmap: More models/agents, web UI
Vision: Recursive agents bootstrapping intelligence
Agents
Grok
Gemini
Claude
GPT
AG2
Systems
Grok Heavy
DeepThink
Claude Code
ChatGPT
AG2
Orchestrator
MassGen
1ร
10ร
100ร
Challenges
Consensus
Shared Context
The Path to Exponential Intelligence
๐
Join the Multi-Agent Revolution
Build Scalable, Collaborative AI Systems
๐ v0.0.3โv0.0.16 Latest | 10+ providers; Claude Code CLI; GPT-5 & GPT-OSS; Local Models; MCP Integration; Unified Filesystem
Thank you Columbia DAPLab!
Questions & Discussion