A minimal Streamlit application that hosts a debate between two instances of Claude, arguing opposite sides of a topic with human judge evaluation.
This project was conceived and implemented in two hours at the First Ever Claude Speedrun Hackathon @ Berkeley
- Select from predefined debate topics or create your own
- Watch two Claude instances debate opposite sides of an issue
- Control debate progression with a turn-based system
- Vote for the debater you thought presented better arguments
- Simple and intuitive UI built with Streamlit
- Real-time research capabilities using Perplexity Sonar API
- Citation support for factual claims in debates
graph TD
%% Main Components
User[User/Judge] --> |Selects Topic & Settings| App[Streamlit App]
App --> |Renders UI| UI[UI Components]
App --> |Manages Debate| DE[Debate Engine]
DE --> |API Calls| CAPI[Claude API]
DE --> |Research Queries| RC[Research Component]
%% Research Flow
RC --> |Direct API| PAPI[Perplexity API Client]
RC --> |MCP Fallback| PMCP[Perplexity MCP Integration]
PAPI --> |Web Search| Web((Internet))
PMCP --> |Web Search| Web
%% Debate Flow
subgraph Debate Flow
DE --> |Preparation| Stage1[Preparation Phase]
Stage1 --> |Plans Approved| Stage2[Opening Statements]
Stage2 --> Stage3[First Rebuttal]
Stage3 --> Stage4[Second Rebuttal]
Stage4 --> Stage5[Closing Statements]
Stage5 --> |Complete| Vote[User Voting]
end
%% Two Claude Instances
CAPI --> |Pro Arguments| Claude1[Claude Instance 1\nPRO Position]
CAPI --> |Con Arguments| Claude2[Claude Instance 2\nCON Position]
%% Research Integration
subgraph Research Integration
RC --> |Pro Research| ProResearch[Pro Side Research]
RC --> |Con Research| ConResearch[Con Side Research]
ProResearch --> |Citations| Claude1
ConResearch --> |Citations| Claude2
end
%% Data Storage
Config[Configuration\nSettings.py] --> DE
Config --> CAPI
Config --> RC
%% UI Components
UI --> |Displays| DebateUI[Debate Content]
UI --> |Shows| ResearchUI[Research Data]
UI --> |Controls| ProgressUI[Debate Progress]
%% Styling
classDef core fill:#f9f,stroke:#333,stroke-width:2px,color:#333
classDef api fill:#bbf,stroke:#333,stroke-width:2px,color:#333
classDef ui fill:#bfb,stroke:#333,stroke-width:2px,color:#333
classDef flow fill:#fbb,stroke:#333,stroke-width:2px,color:#333
class App,DE,RC core
class CAPI,PAPI,PMCP,Claude1,Claude2 api
class UI,DebateUI,ResearchUI,ProgressUI ui
class Stage1,Stage2,Stage3,Stage4,Stage5 flow
- Python 3.8+
- Anthropic API key
- Perplexity Sonar API key (optional, for research capabilities)
- Clone the repository
git clone https://github.com/yourusername/claude-debate.git
cd claude-debate
- Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies
pip install -r requirements.txt
- Set up your environment variables
cp .env.example .env
Then edit .env
to add your Anthropic API key and Perplexity API key (if using research features).
streamlit run app.py
The app will be available at http://localhost:8501
/claude-debate/
├── app.py # Main Streamlit entry point
├── requirements.txt # Project dependencies
├── .env.example # Example environment variables
├── .gitignore # Git ignore file
├── modules/
│ ├── claude_api.py # Claude API wrapper
│ └── debate_engine.py # Core debate logic
├── ui/
│ └── components.py # UI components
└── config/
└── settings.py # App settings and configurations
The project uses the following branching strategy for collaborative development:
main
: Production-ready codedevelop
: Integration branch for features- Feature branches: Individual components (branched from
develop
)
- Create a feature branch from
develop
- Implement your changes
- Submit a pull request to merge back into
develop
- After testing and review, changes will be merged into
main
The Claude vs Claude Debate System can be more than just a demo application, offering potential for advancing AI research across multiple critical domains. By creating structured, argumentative discourse between AI systems with traceable reasoning paths, this framework provides capabilities for studying and improving existing and frontier AI systems.
The debate format provides a powerful mechanism for knowledge distillation:
-
Cross-model Distillation: Debates between different model versions (e.g., Claude-3-Sonnet vs Claude-3-Opus) can identify where the larger model's superior reasoning appears, allowing targeted capture of these capabilities.
-
Synthetic Data Generation: Debate transcripts create a rich corpus of high-quality reasoning chains with built-in critique and improvement cycles. This synthetic data can train smaller, specialized models that retain sophisticated reasoning capabilities at a fraction of the computational cost.
-
Reasoning Template Extraction: The stage-based progression (preparation, opening, rebuttals, closing) provides explicit templates for different phases of analytical thinking that can be distilled into more compact models.
The debate framework offers unique advantages for post-training:
-
Adversarial Improvement: By having models critique each other, the system naturally identifies weaknesses in reasoning, creating a targeted dataset of "hard cases" for fine-tuning.
-
Factuality Enhancement: The research integration with Perplexity creates a powerful mechanism for generating training data that couples claims with citations, teaching models to ground assertions in verifiable sources.
-
Multi-step Reasoning: Debates naturally involve complex chains of reasoning with rebuttals addressing potential flaws, creating ideal training examples of thorough multi-step reasoning processes.
-
Balance Calibration: Exposure to multiple perspectives on contentious topics helps calibrate models to recognize the legitimate arguments on different sides, improving epistemological humility.
Perhaps the most promising applications are in alignment and safety:
-
Value Pluralism Exploration: Debates on complex ethical and philosophical ideas can map out different value systems and how they interact, helping researchers understand how models reason about normative questions.
-
Deception Detection: Debates with strategic incentives can reveal how models might attempt to persuade through subtle rhetorical tactics rather than honest reasoning, allowing researchers to identify and mitigate such behaviors.
-
Red-teaming Through Opposition: By setting up debates on sensitive topics, researchers can observe how models formulate arguments that might be concerning from a safety perspective, even when not explicitly prompted to produce harmful content.
-
Preference Learning: Human judging of debates can provide rich signals about what constitutes high-quality reasoning from a human perspective, offering nuanced feedback data for aligning models with human values.
-
Constitutional Principles Testing: Debates can test how models apply constitutional principles or axiomatic thinking when arguing for positions that might test emotional boundaries, revealing edge cases and ambiguities in constitutional AI approaches.
The system architecture enables sophisticated data collection:
-
Fine-grained Instrumentation: Each debate generates structured, stage-specific data on model outputs, enabling detailed analysis of reasoning patterns across different topics and debate phases.
-
Comparative Evaluation: Direct comparison between positions on the same topic can facilitate nuanced evaluation of model capabilities, going beyond simple benchmarks.
-
Human Feedback Integration: The voting mechanism at the tail end of the workflow provides a natural channel for human feedback, creating a reinforcement learning from human feedback (RLHF) pipeline for model improvement.
-
Longitudinal Studies: Running debates with model versions over time enables tracking of capability evolution and alignment drift on consistent scenarios.
To fully realize this vision, several research-oriented features can and will be implemented:
-
Stretch Goal for Labs - Model Trace Visualization: Adding tools to visualize attention patterns and activation values during key reasoning steps, especially when models change their stance or concede points.
-
Automated Logical Analysis: Implementing formal verification of argument structures to identify fallacies, contradictions, and strong inferential patterns.
-
Similar to LLMArena: Multi-model Tournaments: Expanding beyond Claude to create tournaments between different models (Claude, GPT, Gemini, etc.) to identify relative strengths in reasoning domains.
-
Interaction Structures - Specialized Debate Formats: Implementing structured debate formats like the Gricean Scorecard or Bayesian updating frameworks that enforce particular reasoning norms.
-
Cognitive Science Research: Partnering with cognitive scientists to compare AI debate behaviors with human debate patterns, identifying areas where alignment diverges from human reasoning.
By developing these capabilities, the Claude vs Claude Debate System could evolve from just another demonstration to a critical research infrastructure for advancing our understanding and improvement of AI systems through dialectical methods.
This project is licensed under the MIT License - see the LICENSE file for details.