Claude vs Claude Debate System

A minimal Streamlit application that hosts a debate between two instances of Claude, arguing opposite sides of a topic with human judge evaluation.

This project was conceived and implemented in two hours at the First Ever Claude Speedrun Hackathon @ Berkeley

DEMO LINKS

Features

Select from predefined debate topics or create your own
Watch two Claude instances debate opposite sides of an issue
Control debate progression with a turn-based system
Vote for the debater you thought presented better arguments
Simple and intuitive UI built with Streamlit
Real-time research capabilities using Perplexity Sonar API
Citation support for factual claims in debates

System Architecture

graph TD
    %% Main Components
    User[User/Judge] --> |Selects Topic & Settings| App[Streamlit App]
    App --> |Renders UI| UI[UI Components]
    App --> |Manages Debate| DE[Debate Engine]
    DE --> |API Calls| CAPI[Claude API]
    DE --> |Research Queries| RC[Research Component]
    
    %% Research Flow
    RC --> |Direct API| PAPI[Perplexity API Client]
    RC --> |MCP Fallback| PMCP[Perplexity MCP Integration]
    PAPI --> |Web Search| Web((Internet))
    PMCP --> |Web Search| Web

    %% Debate Flow
    subgraph Debate Flow
        DE --> |Preparation| Stage1[Preparation Phase]
        Stage1 --> |Plans Approved| Stage2[Opening Statements]
        Stage2 --> Stage3[First Rebuttal]
        Stage3 --> Stage4[Second Rebuttal]
        Stage4 --> Stage5[Closing Statements]
        Stage5 --> |Complete| Vote[User Voting]
    end

    %% Two Claude Instances
    CAPI --> |Pro Arguments| Claude1[Claude Instance 1\nPRO Position]
    CAPI --> |Con Arguments| Claude2[Claude Instance 2\nCON Position]
    
    %% Research Integration
    subgraph Research Integration
        RC --> |Pro Research| ProResearch[Pro Side Research]
        RC --> |Con Research| ConResearch[Con Side Research]
        ProResearch --> |Citations| Claude1
        ConResearch --> |Citations| Claude2
    end

    %% Data Storage
    Config[Configuration\nSettings.py] --> DE
    Config --> CAPI
    Config --> RC
    
    %% UI Components
    UI --> |Displays| DebateUI[Debate Content]
    UI --> |Shows| ResearchUI[Research Data]
    UI --> |Controls| ProgressUI[Debate Progress]
    
    %% Styling
    classDef core fill:#f9f,stroke:#333,stroke-width:2px,color:#333
    classDef api fill:#bbf,stroke:#333,stroke-width:2px,color:#333
    classDef ui fill:#bfb,stroke:#333,stroke-width:2px,color:#333
    classDef flow fill:#fbb,stroke:#333,stroke-width:2px,color:#333
    
    class App,DE,RC core
    class CAPI,PAPI,PMCP,Claude1,Claude2 api
    class UI,DebateUI,ResearchUI,ProgressUI ui
    class Stage1,Stage2,Stage3,Stage4,Stage5 flow

Getting Started

Prerequisites

Python 3.8+
Anthropic API key
Perplexity Sonar API key (optional, for research capabilities)

Installation

Clone the repository

git clone https://github.com/yourusername/claude-debate.git
cd claude-debate

Create a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Set up your environment variables

cp .env.example .env

Then edit .env to add your Anthropic API key and Perplexity API key (if using research features).

Running the App

streamlit run app.py

The app will be available at http://localhost:8501

Project Structure

/claude-debate/
├── app.py                   # Main Streamlit entry point
├── requirements.txt         # Project dependencies
├── .env.example             # Example environment variables
├── .gitignore               # Git ignore file
├── modules/
│   ├── claude_api.py        # Claude API wrapper
│   └── debate_engine.py     # Core debate logic
├── ui/
│   └── components.py        # UI components
└── config/
    └── settings.py          # App settings and configurations

Branching Strategy

The project uses the following branching strategy for collaborative development:

main: Production-ready code
develop: Integration branch for features
Feature branches: Individual components (branched from develop)

Contributing

Create a feature branch from develop
Implement your changes
Submit a pull request to merge back into develop
After testing and review, changes will be merged into main

A Vision for Deep Applications - Distillation, Post-training, Alignment, and Safety Research

The Claude vs Claude Debate System can be more than just a demo application, offering potential for advancing AI research across multiple critical domains. By creating structured, argumentative discourse between AI systems with traceable reasoning paths, this framework provides capabilities for studying and improving existing and frontier AI systems.

Knowledge Distillation and Model Compression

The debate format provides a powerful mechanism for knowledge distillation:

Cross-model Distillation: Debates between different model versions (e.g., Claude-3-Sonnet vs Claude-3-Opus) can identify where the larger model's superior reasoning appears, allowing targeted capture of these capabilities.
Synthetic Data Generation: Debate transcripts create a rich corpus of high-quality reasoning chains with built-in critique and improvement cycles. This synthetic data can train smaller, specialized models that retain sophisticated reasoning capabilities at a fraction of the computational cost.
Reasoning Template Extraction: The stage-based progression (preparation, opening, rebuttals, closing) provides explicit templates for different phases of analytical thinking that can be distilled into more compact models.

Post-training and Supervised Fine-tuning

The debate framework offers unique advantages for post-training:

Adversarial Improvement: By having models critique each other, the system naturally identifies weaknesses in reasoning, creating a targeted dataset of "hard cases" for fine-tuning.
Factuality Enhancement: The research integration with Perplexity creates a powerful mechanism for generating training data that couples claims with citations, teaching models to ground assertions in verifiable sources.
Multi-step Reasoning: Debates naturally involve complex chains of reasoning with rebuttals addressing potential flaws, creating ideal training examples of thorough multi-step reasoning processes.
Balance Calibration: Exposure to multiple perspectives on contentious topics helps calibrate models to recognize the legitimate arguments on different sides, improving epistemological humility.

Alignment and Safety Research

Perhaps the most promising applications are in alignment and safety:

Value Pluralism Exploration: Debates on complex ethical and philosophical ideas can map out different value systems and how they interact, helping researchers understand how models reason about normative questions.
Deception Detection: Debates with strategic incentives can reveal how models might attempt to persuade through subtle rhetorical tactics rather than honest reasoning, allowing researchers to identify and mitigate such behaviors.
Red-teaming Through Opposition: By setting up debates on sensitive topics, researchers can observe how models formulate arguments that might be concerning from a safety perspective, even when not explicitly prompted to produce harmful content.
Preference Learning: Human judging of debates can provide rich signals about what constitutes high-quality reasoning from a human perspective, offering nuanced feedback data for aligning models with human values.
Constitutional Principles Testing: Debates can test how models apply constitutional principles or axiomatic thinking when arguing for positions that might test emotional boundaries, revealing edge cases and ambiguities in constitutional AI approaches.

Research Data Collection and Analysis

The system architecture enables sophisticated data collection:

Fine-grained Instrumentation: Each debate generates structured, stage-specific data on model outputs, enabling detailed analysis of reasoning patterns across different topics and debate phases.
Comparative Evaluation: Direct comparison between positions on the same topic can facilitate nuanced evaluation of model capabilities, going beyond simple benchmarks.
Human Feedback Integration: The voting mechanism at the tail end of the workflow provides a natural channel for human feedback, creating a reinforcement learning from human feedback (RLHF) pipeline for model improvement.
Longitudinal Studies: Running debates with model versions over time enables tracking of capability evolution and alignment drift on consistent scenarios.

Future Research Integrations

To fully realize this vision, several research-oriented features can and will be implemented:

Stretch Goal for Labs - Model Trace Visualization: Adding tools to visualize attention patterns and activation values during key reasoning steps, especially when models change their stance or concede points.
Automated Logical Analysis: Implementing formal verification of argument structures to identify fallacies, contradictions, and strong inferential patterns.
Similar to LLMArena: Multi-model Tournaments: Expanding beyond Claude to create tournaments between different models (Claude, GPT, Gemini, etc.) to identify relative strengths in reasoning domains.
Interaction Structures - Specialized Debate Formats: Implementing structured debate formats like the Gricean Scorecard or Bayesian updating frameworks that enforce particular reasoning norms.
Cognitive Science Research: Partnering with cognitive scientists to compare AI debate behaviors with human debate patterns, identifying areas where alignment diverges from human reasoning.

By developing these capabilities, the Claude vs Claude Debate System could evolve from just another demonstration to a critical research infrastructure for advancing our understanding and improvement of AI systems through dialectical methods.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
config		config
modules		modules
ui		ui
utils		utils
vendor_docs		vendor_docs
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
collaboration.md		collaboration.md
debate.md		debate.md
demo_video.md		demo_video.md
http_client_error.md		http_client_error.md
mcp_or_tool_use_integration.md		mcp_or_tool_use_integration.md
minimally_viable_product.md		minimally_viable_product.md
perplexity_ask_mcp_documentation.md		perplexity_ask_mcp_documentation.md
requirements.txt		requirements.txt
second_pass_fixes.md		second_pass_fixes.md
third_pass_development.md		third_pass_development.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Claude vs Claude Debate System

DEMO LINKS

Features

System Architecture

Getting Started

Prerequisites

Installation

Running the App

Project Structure

Branching Strategy

Contributing

A Vision for Deep Applications - Distillation, Post-training, Alignment, and Safety Research

Knowledge Distillation and Model Compression

Post-training and Supervised Fine-tuning

Alignment and Safety Research

Research Data Collection and Analysis

Future Research Integrations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

sushantvema/claude_vs_claude

Folders and files

Latest commit

History

Repository files navigation

Claude vs Claude Debate System

DEMO LINKS

Features

System Architecture

Getting Started

Prerequisites

Installation

Running the App

Project Structure

Branching Strategy

Contributing

A Vision for Deep Applications - Distillation, Post-training, Alignment, and Safety Research

Knowledge Distillation and Model Compression

Post-training and Supervised Fine-tuning

Alignment and Safety Research

Research Data Collection and Analysis

Future Research Integrations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages