⚠️

Case Study: When Good Analysis Goes Wrong

Understanding current limitations and improvement opportunities

🔭 The Problem

"How many stars would be detectable using the ESPRESSO spectrograph with S/N ≥ 10 in 1-hour exposure?"

Candidates: Canopus, Polaris, and 4 synthetic stars
Correct Answer: 2 stars (Canopus saturates detector)

❌ Orchestration Failed

✗

Wrong Answer: 3 stars

Despite Claude having the correct reasoning

✅ Claude (Correct)

"Canopus is too bright and would saturate the detector in a 1-hour exposure. Only 2 stars are detectable without saturation."

❌ Others (Detailed but Wrong)

GPT-5, Gemini, Grok: Provided comprehensive magnitude calculations but failed to consider detector saturation → concluded 3 stars

🎯 What Happened

Orchestration selected Gemini for having "most accurate and comprehensive reasoning," prioritizing analysis quality over correctness

🔍 Root Cause:

System confused detailed explanation with correctness

💡 Improvement Opportunity:

Better balance reasoning quality and answer validation