MassGen logo featuring multi-agent collaboration design
Diagram showing collaborative AI agents working together in parallel threads
Scaling AI Through Multi-Agent Collaboration
๐ŸŽ“ University of Cambridge
Guest Lecture - October 27, 2025
๐Ÿ‡ฌ๐Ÿ‡ง Cambridge, UK
๐ŸŒ massgen.ai | GitHub
๐Ÿ“ˆ

How Do We Scale Up AI?

โšก

Traditional Scaling Laws Hit Limits

๐Ÿ’ก

Power Crisis

4-16 GW by 2030
Enough to power entire cities
๐Ÿ“š

Data Depletion

Depleted by 2026-2028
Quality text data exhausted
๐Ÿšง

Performance Plateau

GPT-5 delayed >1 year
Early training runs failed

โš ๏ธ Inference-time scaling

No universal way to leverage improvements & address limitations
The New Paradigm:
Model โ†’ Agent โ†’ Multi-Agent Systems
๐Ÿค

The Promise of Multi-Agent Collaboration

  • Study Group Dynamics: Like humans collaborating on complex problems
  • Cross-Ecosystem Integration: Bridge Claude, Gemini, GPT, Grok, and specialized coding agents
  • Emergent Intelligence: Collective problem-solving beyond individual capabilities
  • Real-time Intelligence Sharing: Agents learn and adapt from each other
Visual representation of collaborative AI reasoning and cognitive processes
The Promise of Collaborative Reasoning
๐Ÿ—๏ธ Built on AG2's foundational multi-agent research and community
๐Ÿ—๏ธ

AG2: The Foundation for Multi-Agent Research

AG2 research foundation and community history
๐Ÿค Community-Driven Innovation
MassGen evolved from AG2's pioneering work in multi-agent conversations
and the vibrant research community it fostered
๐Ÿ“ˆ

Proven Performance Gains

Grok-4 Standard
1
Single Agent Processing
38.6%
Last Human Exam Score
$30/month
Grok-4 Heavy
A1
A2
A3
Multi-Agent Collaboration
44.4%
Last Human Exam Score
$300/month
Gemini 2.5 DeepThink
๐Ÿ†
๐Ÿฅ‡
Competition Gold Medals
IMO + ICPC
5/6 IMO Problems (2024)
10/12 ICPC Problems (2025)
First AI Gold Medals
๐Ÿš€ Multi-Agent Revolution
"Individual AI excellence + Multi-agent coordination = Next frontier of AI capabilities"
๐Ÿš€
MassGen Orchestrator
Task Distribution & Coordination
โ†“
๐Ÿ—๏ธ
Agent 1
Anthropic/Claude
๐Ÿ‘จโ€๐Ÿ’ป
Agent 2
Claude Code
๐ŸŒŸ
Agent 3
Google/Gemini
๐Ÿค–
Agent 4
OpenAI/GPT
โšก
Agent 5
xAI/Grok
โ†• Real-time Collaboration โ†•
โ†“
๐Ÿ”„
Shared Collaboration Hub
Real-time Notification & Consensus
๐Ÿš€

Key Features & Capabilities

๐Ÿ”„
Iterative Refinement
Diagram showing iterative refinement cycles in multi-agent collaboration
The Reality of Reasoning
โš™๏ธ
Multi-Backend Support
โšก
Parallel Processing
๐Ÿ‘ฅ
Intelligence Sharing
๐ŸŽฏ
Consensus Building
โš™๏ธ

Tech Deep Dive: Async Streaming & Dynamic Scheduling

Orchestrator Agent 1 Agent 2 Agent 3 Agent 4 Agent 5 content t=1.2s new_answer t=2.1s reasoning t=0.8s content t=1.5s voting t=3.0s chunk2 paused chunk1 stopping deciding ๐Ÿ” Dynamic Restart: new_answer triggers all triggers restarting restarting restarting
Key Innovation: Dynamic coordination without deadlocks
๐Ÿ”ง

Tech Deep Dive: Backend Abstraction Challenges

๐ŸŽญ Unified Interface Challenge Claude Code CLI Context sharing across agents Gemini API Can't mix builtin + custom tools GPT-5 API changes (reasoning, streaming) Most Backends Unable to autonomously collaborate ๐ŸŽฏ Unified ChatAgent Protocol โš™๏ธ StreamChunk Normalization ๐Ÿ”€ Workarounds Backend-specific ๐Ÿ› ๏ธ Tool Integration MCP, Web, Code ๐Ÿง  Binary Decision Framework Intelligent agent coordination and workspace sharing
๐ŸŽฏ

Tech Deep Dive: Binary Decision Framework Solution

Agent 1 Agent 2 Agent 3 Agent 4 Vote: 4 Writing Vote: 4 Vote: 4 โš–๏ธ Binary Choice: vote OR new_answer Agent 2 provides new_answer โ†’ ALL VOTES RESET After Reset: Deciding Deciding Deciding Deciding Async Results: Vote: 2 Vote: 2 Writing Deciding โ†’ If new_answer: reset again
๐ŸŽฏ Key Innovation: Vote Invalidation Creates Dynamic Consensus
โš ๏ธ

The Context Sharing Challenge

โŒ Naive Approach 1: Share Answers Only

  • Agents only see final text answers
  • Can't verify methodology or data
  • Unable to test or build upon work
  • Lost intermediate context

โŒ Naive Approach 2: Share Workspace Paths

  • Agents interfere with each other's work
  • Data corruption from simultaneous edits
  • Loss of original work context
  • Workspace pollution and conflicts
Approach 1: Answer Only Agent 1 Agent 2 "Answer text" โŒ No verification possible Approach 2: Workspace Sharing Agent 1 Agent 2 Shared Workspace โš ๏ธ Conflicts! โŒ Data corruption & interference
The Challenge: How to share context without interference?
Orchestrator Snapshot Storage agent1/ agent2/ Agent 1 Permanent Temp Context Agent 2 Permanent Temp Context Agent 1 Temp agent1/ agent2/ Agent 2 Temp agent1/ agent2/ โœ… Safe Sharing + โœ… No Interference + โœ… Full Verification
๐ŸŽฌ

Context Sharing in Action

๐Ÿ”ฌ Agent 1 Finished First

โ€ข Creates analysis.py and results.csv
โ€ข Saves to permanent workspace
โ€ข ๐Ÿ“ธ Snapshot captured

๐Ÿ”„ Agent 2 Restarted

โ€ข Sees agent1/analysis.py in temp workspace
โ€ข Reads & tests the analysis code
โ€ข Modifications in temp dir don't affect Agent 1
โ€ข Creates improved_analysis.py in own permanent workspace
โ€ข ๐Ÿ“ธ Snapshot captured
...

๐ŸŽฏ Final Presentation

โ€ข Winning agent has full context
โ€ข Can reference both agents' work
โ€ข Snapshots ensure correct version access

๐Ÿ—‚๏ธ Workspace Structure

๐Ÿ‘จโ€๐Ÿ’ผ Agent 1 Permanent Workspace
๐Ÿ“„ analysis.py
๐Ÿ“Š results.csv
๐Ÿ“ methodology.md
๐Ÿ‘๏ธ Agent 2 Temp Workspace (Read-Only Context)
๐Ÿ“ agent1/
  ๐Ÿ“„ analysis.py โœ… readable & testable
  ๐Ÿ“Š results.csv
  ๐Ÿ“ methodology.md
๐Ÿ‘จโ€๐Ÿ’ป Agent 2 Permanent Workspace
๐Ÿ“„ improved_analysis.py
๐Ÿ“‹ code_review.md
๐Ÿงช test_results.json
๐Ÿ”‘ Key Benefits Illustrated:
โœ… Agent 2 can READ & execute Agent 1's work
โœ… Temp modifications don't corrupt original
โœ… Each agent maintains workspace integrity
โœ… Final answer has complete context
๐Ÿ“Š

Benchmarking: Preliminary Results

Scientific evaluation across graduate-level reasoning, instruction-following, and narrative tasks

๐Ÿงช GPQA-Diamond

Graduate Physics/Chemistry
MassGen
87.4% ๐Ÿ†
Gemini
85.9%
Grok-4
85.4%
GPT-5
84.8%
Claude
68.2%

๐Ÿ“‹ IFEval

Instruction Following
MassGen
88.0% ๐Ÿ†
GPT-5
87.4%
Grok-4
84.7%
Gemini
66.0%
Claude
63.6%

๐Ÿ“– MuSR

Narrative Reasoning
Gemini
69.6% ๐Ÿ†
GPT-5
69.2%
MassGen
68.3%
Grok-4
67.6%
Claude
62.8%
๐Ÿ† Overall Champion
MassGen: 81.2%
Wins 2/3 benchmarks โ€ข Statistically significant
โœ… Key Results:
โ€ข Highest on 2/3 benchmarks
โ€ข Best overall average
โ€ข Consistent performance
๐Ÿ“ˆ Statistical:
โ€ข vs Claude: p = 1.4e-07 โญโญโญ
โ€ข vs Gemini: p = 1.1e-28 โญโญโญ
โ€ข Not due to chance
๐Ÿ”ฌ Research Gap:
โ€ข Oracle: 95.5% (GPQA)
โ€ข Actual: 87.4%
โ€ข Potential: 8.1 points
๐Ÿ”ฌ

Case Study: Success Through Peer Correction

Graduate-level physics question from GPQA-Diamond benchmark

๐ŸŒŒ The Problem

A quasar shows a peak at 790 nm wavelength. Given Lambda-CDM cosmological parameters (Hโ‚€ = 70 km/s/Mpc, ฮฉโ‚˜ = 0.3, ฮฉฮ› = 0.7), what is the comoving distance?

Options: A) 8 Gpc B) 7 Gpc C) 6 Gpc D) 9 Gpc

๐ŸŽฏ Final Result

โœ…
Correct Answer: A (8 Gpc)
Orchestration succeeded where individual agents initially failed

๐Ÿค– Initial Answers

Agent 1: "I calculate ~6 Gpc โ†’ Answer C"
Agent 2: "I get ~8.95 Gpc โ†’ Answer D"
Agent 3: "~6.1 Gpc โ†’ Answer C"

๐Ÿ”„ Self-Correction Process

Agent 1 observes: "There is significant discrepancy in calculations: Agent1 gets ~6.1 Gpc, Agent2 gets ~8.95 Gpc. Let me re-examine..."

โœจ Breakthrough Moment

Agent 1 revises: "Standard cosmological calculators yield 8000-8500 Mpc for z=5.5. This equals 8.0-8.5 Gpc, closest to option A."

Result: 3/4 agents converge on correct answer
๐Ÿ’ก Success Mechanism:
Peer observation โ†’ Discrepancy detection โ†’ Self-correction โ†’ Consensus
โš ๏ธ

Case Study: When Good Analysis Goes Wrong

Understanding current limitations and improvement opportunities

๐Ÿ”ญ The Problem

"How many stars would be detectable using the ESPRESSO spectrograph with S/N โ‰ฅ 10 in 1-hour exposure?"

Candidates: Canopus, Polaris, and 4 synthetic stars
Correct Answer: 2 stars (Canopus saturates detector)

โŒ Orchestration Failed

โœ—
Wrong Answer: 3 stars
Despite Claude having the correct reasoning

โœ… Claude (Correct)

"Canopus is too bright and would saturate the detector in a 1-hour exposure. Only 2 stars are detectable without saturation."

โŒ Others (Detailed but Wrong)

GPT-5, Gemini, Grok: Provided comprehensive magnitude calculations but failed to consider detector saturation โ†’ concluded 3 stars

๐ŸŽฏ What Happened

Orchestration selected Gemini for having "most accurate and comprehensive reasoning," prioritizing analysis quality over correctness
๐Ÿ” Root Cause:
System confused detailed explanation with correctness
๐Ÿ’ก Improvement Opportunity:
Better balance reasoning quality and answer validation
๐Ÿง 

Coordination Psychology: How Voting Behavior Changes

Controlled experiments reveal how coordination mechanisms shape multi-agent behavior

๐ŸŽญ Identity Disclosure Effects

Anonymous Voting (Baseline)

GPT-5 Self-Voting
81.0%
Agents see: "agent1", "agent2", "agent3", "agent4"

Identified Voting

GPT-5 Self-Voting
88.4%
+7.4% increase in self-voting
Consensus ties: 14.1% โ†’ 23.2%
Agents see: "GPT-5", "Claude", "Gemini", "Grok"

๐Ÿ‘๏ธ Vote Visibility Effects

Hidden Tally (Baseline)

First-voted Selected
54.1%
Agents vote independently without seeing others' choices

Visible Tally

First-voted Selected
67.8%
+13.7% herding behavior
Especially strong: GPT-5 (+40%)
Agents see ongoing vote counts and reasons

๐ŸŽฏ Key Behavioral Insights

๐ŸŽญ
Identity Matters
Knowing "who said what" increases bias and reduces consensus
๐Ÿ‘
Herding Behavior
Visible votes create momentum effects, sometimes premature consensus
๐Ÿš€

MassGen Evolution

๐Ÿ—๏ธ

Foundation Era

v0.0.1 - v0.0.3
Core framework, basic streaming,
Claude, Gemini, GPT/o, Grok
LATEST
๐Ÿš€

Rapid Evolution

v0.0.3 โ†’ v0.1.3
12+ providers, AG2, Claude Code
MCP, browsing, coding, multimodal
Filesystem, docker, restart
34 Releases
โ€ข
Foundationโ†’Expansion
๐ŸŒŸ

Early Adopters & Community Growth

๐Ÿข Industrial Institutions

  • Microsoft Azure
  • Gradient
  • Sparsity
  • Leading Trading Firm
  • AgentWeb

๐ŸŽ“ Academic Institutions

  • Simon Fraser University
  • UC Berkeley & Santa Cruz
  • University of Pennsylvania
  • University of Sydney
  • University of Toronto

๐Ÿš€ Open Source Community

Active development
Growing contributor base
Global adoption
Research partnerships
MassGen Contributors Wall
๐Ÿค Community Contributors - Thank you!
๐ŸŽฌ

Live Demonstrations

๐ŸŽฅ Multimodal Video Analysis (v0.1.3): Agents autonomously discover YouTube URLs in documentation, download videos, and analyze content to create automation scripts
Result: End-to-end multimedia workflow - discovered and analyzed 17 case study videos without human intervention
๐Ÿ› ๏ธ Custom Tools & Self-Evolution (v0.1.1): Agents use custom Python tools to analyze GitHub issues and market trends, driving autonomous feature prioritization
Result: AI systems that improve themselves through user feedback analysis and market research
๐Ÿ›ก๏ธ MCP Planning Mode (v0.0.29): Agents plan Discord messaging without execution during coordination, preventing duplicate notifications
Result: Safe MCP tool coordination - only winner executes planned actions
๐ŸŒ Claude Code Workspace (v0.0.14): Multiple Claude Code agents develop websites in parallel with enhanced logging and workspace isolation
Result: Conflict-free development with per-agent versioning and final workspace snapshots
๐Ÿ“š case.massgen.ai - Complete Case Studies
๐Ÿ”ฎ

Multi-Agent Scaling Opportunity

Diversity of Thought Drives Quality

๐Ÿง 

Complex Analysis

๐ŸŽจ

Creative Tasks

๐Ÿ”„

Coordination

Multiple perspectives, cross-verification, iterative refinement

โšก

Get Started in 60 Seconds

# 1. Install via PyPI
pip install massgen

# 2. Interactive setup (first time only)
massgen --setup # Guided API key configuration

# 3. Run single agent (quick test)
massgen --model gemini-2.5-flash "When is your knowledge up to"

# 4. Run multi-agent collaboration
massgen --config @examples/basic/multi/three_agents_default "Summarize latest news of github.com/Leezekun/MassGen"

โœ… Supported Models & Providers

๐Ÿข Major Providers:
Anthropic Claude โ€ข Google Gemini โ€ข OpenAI GPT โ€ข xAI Grok โ€ข ZAI GLM
๐Ÿ  Local & Extended:
Cerebras โ€ข Fireworks โ€ข Groq โ€ข LM Studio โ€ข OpenRouter โ€ข SGLang โ€ข Together โ€ข vLLM...

๐Ÿค– Agents & Frameworks

AG2 โ€ข Claude Code CLI

๐Ÿ› ๏ธ Advanced Tools

Browser Automation โ€ข Code Execution โ€ข File Operations โ€ข MCP โ€ข Multimodal โ€ข Web Search
๐Ÿ”ฎ

Vision: The Path to Exponential Intelligence

  • Hurdles: Shared memory, context, interoperability
  • Next: General interoperability, submit/restart tools
  • Future: DSPy integration, memory module, advanced multimodal
  • Vision: Recursive agents bootstrapping intelligence
Agents Grok Gemini Claude GPT AG2 Systems Grok Heavy DeepThink Claude Code ChatGPT AG2 Orchestrator MassGen 1ร— 10ร— 100ร— Challenges Consensus Shared Context Interoperability
The Path to Exponential Intelligence
๐Ÿš€

Join the Multi-Agent Revolution

1 / 24