8000 GitHub - RumTumTum/audio-transcription: Audio Transcription Tool Using OpenAI Whisper API
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

RumTumTum/audio-transcription

Repository files navigation

Video-to-RAG: Universal Audio Transcription & AI Interaction

A comprehensive system for transcribing audio content into text and creating interactive AI personas that can engage in conversations about the content. Originally designed for spiritual and philosophical lectures, this system now supports multiple content domains including business, education, medical, legal, and general-purpose transcription.

🎯 Project Overview

This project provides both a web interface for individual file processing and a robust batch processing system for large collections of audio/video files. With domain-specific optimizations, it excels at processing specialized content including Vedanta lectures, business meetings, educational content, medical dictation, and more.

Key Components

  1. Web Application (VIdeo-Transcription/): Streamlit-based interface for uploading, transcribing, and chatting with AI personas
  2. Batch Transcription (batch_transcription/): Python workflow for processing entire directories of audio files
  3. Video Collection (videos/): Organized spiritual lecture content with existing transcriptions
  4. Output Examples (example_output_formats/): Sample outputs in various formats (TXT, JSON, SRT, VTT, TSV)

✨ Features

🎬 Video Transcription

  • Multi-format support: MP4, AVI, MOV, MKV, M4A
  • Large file handling: Up to 2GB with chunked processing
  • Optimized for spiritual content: Custom prompts for Sanskrit/Vedanta terminology
  • Multiple output formats: Plain text, timestamped, SRT subtitles, JSON

🌐 Translation & Localization

  • 130+ languages supported via Google Translate
  • Preserves formatting and timestamps during translation
  • Context-aware translation for mixed-language content

πŸ€– AI Persona Generation

  • Domain-aware analysis: Specialized persona creation for different content types
  • Analyzes speech patterns and personality traits from transcripts
  • Creates contextual personas that mimic speaker characteristics
  • Interactive chat interface with generated personas
  • Powered by Ollama for local AI processing

⚑ Batch Processing

  • Directory-based workflow: Point at folder, get transcriptions
  • Domain-specific optimization: 8 preset domains (vedanta, medical, business, etc.)
  • Real-time progress tracking: Live updates every 2 seconds with processing status
  • Parallel processing: Configurable worker threads with per-worker status
  • Resume capability: Skip already processed files
  • Pluggable engines: Local Whisper or external API
  • Comprehensive logging: Detailed progress tracking and completion statistics
  • Smart domain detection: Auto-suggest optimal domain based on content

πŸš€ Quick Start

Prerequisites

  1. Python 3.10+
  2. FFmpeg for audio/video processing
  3. Ollama for AI persona features
  4. OpenAI Whisper for transcription
# Install FFmpeg
# macOS: brew install ffmpeg
# Ubuntu: sudo apt-get install ffmpeg
# Windows: Download from https://ffmpeg.org/

# Install and start Ollama
# Visit: https://ollama.ai/
ollama serve
ollama pull mistral:instruct

Installation

# Clone the repository
git clone <repository-url>
cd video-to-rag

# Set up web application
cd VIdeo-Transcription
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

# Set up batch transcription
cd ../batch_transcription
pip install -r requirements.txt

πŸ’» Usage

Web Application

cd VIdeo-Transcription
streamlit run main.py

Navigate to http://localhost:8501 for the web interface.

Features:

  • Upload video/audio files
  • Real-time transcription with progress tracking
  • Optional translation to 130+ languages
  • AI persona generation and chat
  • Client management and transcription history

Batch Transcription

General Content

# Basic usage - auto-detects content type
python main.py /path/to/audio/files

# Explicit general domain
python main.py /path/to/audio/files --domain general

πŸ•‰οΈ Vedanta & Spiritual Content

# Optimized for Sanskrit terminology and spiritual discourse
python main.py /path/to/vedanta/lectures --domain vedanta

# High-accuracy for important teachings
python main.py /path/to/spiritual/content \
    --domain vedanta --model large

# Vedanta with custom settings
python main.py /path/to/lectures \
    --domain vedanta --workers 4 --temperature 0.0

Other Domains

# Medical transcription (high accuracy)
python main.py /path/to/medical/files --domain medical

# Business meetings
python main.py /path/to/meetings --domain business

# Educational lectures  
python main.py /path/to/classes --domain education

# Legal proceedings
python main.py /path/to/legal/files --domain legal

Advanced Usage

# List available domains with descriptions
python main.py --list-domains

# Get domain suggestion based on folder name
python main.py --suggest-domain /path/to/files

# Use external API
python main.py /path/to/files \
    --engine api --api-url https://api.example.com --api-key your-key

# Custom processing settings with real-time monitoring
python main.py /path/to/files \
    --workers 8 --model large --temperature 0.1 --no-resume

# Test with sample file (quick validation)
python main.py ./videos/testing --domain general --workers 2

Testing & Development

# Test environment variable loading
python test_environment.py

# Test audio chunking functionality
python test_chunking.py

# Test realistic chunking with large files
python test_realistic_chunking.py

# Quick test with provided sample file (20MB, ~2 minutes)
python main.py ./videos/testing --domain general

Test Data: The repository includes a test audio file (videos/testing/Three_Ai_agents_realize...wav) that's perfect for:

  • Testing chunking algorithms (20MB file)
  • Validating transcription engines
  • Quick functionality verification
  • Development workflow testing

Real-Time Progress Features

When processing large files (which can take 1+ hours), you'll see:

Progress: 2/10 (20.0%) | βœ“ 2 βœ— 0 | Time: 45.3 minutes | Processing: W1: lecture_file.wav...
  • Live progress: Updates every 2 seconds
  • Success/failure counts: Real-time completion statistics
  • Per-worker status: See which files each worker is processing
  • Time tracking: Elapsed time and estimated completion
  • Processing rate: Files per minute for time estimation

Output structure:

your_audio_directory/
β”œβ”€β”€ audio_file1.wav
β”œβ”€β”€ audio_file2.wav
β”œβ”€β”€ transcriptions/
β”‚   β”œβ”€β”€ audio_file1.txt
β”‚   └── audio_file2.txt
└── logs/
    β”œβ”€β”€ process.log
    β”œβ”€β”€ completed.txt
    └── failed.txt

Programmatic Usage

from pathlib import Path
from config import BatchTranscriptionConfig
from main import transcribe_directory

# Simple usage (general domain)
result = transcribe_directory(Path("/path/to/audio/files"))

# Domain-specific processing
config = BatchTranscriptionConfig.from_domain("vedanta")
result = transcribe_directory(Path("/path/to/vedanta/lectures"), config)

# Vedanta with custom overrides
config = BatchTranscriptionConfig.from_domain("vedanta", 
    whisper_model="large",
    processing_max_workers=8)
result = transcribe_directory(Path("/path/to/lectures"), config)

# External API configuration
config = BatchTranscriptionConfig.from_domain("business")
config.engine = "api"
config.api_base_url = "https://api.example.com"
config.api_key = "your-api-key"
result = transcribe_directory(Path("/path/to/meetings"), config)

# Using environment variables
config = BatchTranscriptionConfig.from_env("vedanta")
result = transcribe_directory(Path("/path/to/lectures"), config)

print(f"Successfully transcribed {result['completed']} files")
print(f"Failed: {result['failed']} files")

βš™οΈ Configuration

Environment Variables

# Domain and processing settings
export TRANSCRIPTION_DOMAIN=vedanta        # or 'general', 'medical', 'business', etc.
export TRANSCRIPTION_ENGINE=local          # or 'api'
export TRANSCRIPTION_MAX_WORKERS=4

# Whisper settings
export WHISPER_MODEL=turbo                  # or 'large' for higher accuracy
export WHISPER_API_BASE_URL=https://api.example.com
export WHISPER_API_KEY=your-api-key
export WHISPER_INITIAL_PROMPT="Custom prompt for specialized content"

# Ollama settings (for web app)
export OLLAMA_API_BASE=http://localhost:11434
export DEFAULT_MODEL=mistral:instruct

Domain-Optimized Settings

The system provides optimized configurations for different content types:

Vedanta Domain (--domain vedanta)

  • Model: turbo (fast) or large (accuracy)
  • Language: Forced English with Sanskrit term recognition
  • Initial Prompt: Enhanced for Sanskrit terminology and spiritual discourse
  • Temperature: 0.0 (deterministic for consistent Sanskrit terms)
  • Beam Size: 5 (quality vs speed balance)
  • Specialized Vocabulary: 40+ Sanskrit/Vedanta terms

Medical Domain (--domain medical)

  • Model: large (higher accuracy for medical terms)
  • Temperature: 0.0 (deterministic for medical precision)
  • Beam Size: 7 (higher accuracy)
  • Specialized Vocabulary: Medical terminology

Business Domain (--domain business)

  • Model: turbo (efficient for meetings)
  • Initial Prompt: Optimized for business terminology and decisions
  • Specialized Vocabulary: Business and corporate terms

General Domain (--domain general)

  • Model: turbo (balanced performance)
  • Initial Prompt: Generic high-quality transcription
  • Language: Auto-detect

πŸ“Š Project Structure

video-to-rag/
β”œβ”€β”€ main.py                       # CLI entry point
β”œβ”€β”€ batch_transcriber.py          # Main orchestrator with real-time progress
β”œβ”€β”€ transcription_engine.py       # Pluggable engines (local/API)
β”œβ”€β”€ config.py                     # Configuration with environment variable support
β”œβ”€β”€ domains.py                    # Domain-specific presets and optimizations
β”œβ”€β”€ audio_chunker.py              # Large file chunking for API compatibility
β”œβ”€β”€ utils.py                      # Helper functions
β”œβ”€β”€ vedanta_utils.py              # Vedanta-specific content enhancement
β”œβ”€β”€ requirements.txt              # Package dependencies
β”œβ”€β”€ videos/                       # Organized spiritual lecture content
β”‚   β”œβ”€β”€ testing/                  # Test files with transcriptions
β”‚   β”œβ”€β”€ COMPLETE - batch_1/       # Processed Bhagavad Gita Chapter 1
β”‚   β”œβ”€β”€ COMPLETE - batch_2a/      # Processed spiritual lectures
β”‚   β”œβ”€β”€ batch_2/                  # Additional Gita and Dharma lectures
β”‚   β”œβ”€β”€ batch_3_onwards/          # Bhagavad Gita Chapter 2 content
β”‚   └── *.wav                     # Individual spiritual lecture files
β”œβ”€β”€ .env.example                  # Environment configuration template
β”œβ”€β”€ .gitignore                    # Git exclusions for sensitive data
β”œβ”€β”€ test_*.py                     # Test scripts for functionality validation
β”œβ”€β”€ CLAUDE.md                     # Developer guidance and project context
└── README.md                     # Comprehensive project documentation

πŸ”„ Engine Architecture

The system supports pluggable transcription engines:

Local Whisper Engine

  • Uses locally installed OpenAI Whisper
  • No external dependencies or API costs
  • Full control over processing
  • Optimized for batch processing

External API Engine

  • Ready for hosted Whisper services
  • Same parameter support as local engine
  • API key authentication
  • Suitable for cloud deployment
  • Large File Chunking: Automatically splits files >100MB for API compatibility
  • Asynchronous Processing: Non-blocking workflow with webhook callbacks
  • Progress Persistence: Resume interrupted jobs across sessions

Switching engines:

# Local processing
python main.py /path --engine local

# External API processing  
python main.py /path --engine api --api-url https://api.example.com

# WhisperAPI.com integration with environment variables
export WHISPER_API_BASE_URL=https://whisperapi.com
export WHISPER_API_KEY=your-api-key
python main.py /path --engine api

πŸš€ WhisperAPI Integration Roadmap

The system is being enhanced with comprehensive external API support for processing large-scale transcription jobs. This roadmap outlines the implementation plan for WhisperAPI.com integration and similar services.

Current Status

  • βœ… Basic API Engine: Foundation for external Whisper API calls
  • βœ… Domain-Optimized Configuration: 8 specialized presets (vedanta, medical, business, etc.)
  • βœ… Real-time Progress Tracking: Multi-worker status monitoring
  • ⚠️ Large File Limitation: Current implementation requires chunking for files >100MB

Implementation Increments

πŸ”§ Increment 1: Environment & Chunking Foundation

Objective: Establish robust foundation for large file processing and API integration

Key Features:

  • Environment Variable Management: Secure API credential handling with python-dotenv
  • Audio Chunking System: FFmpeg-based splitting with configurable overlap for seamless reconstruction
  • File Size Detection: Automatic chunking trigger for files exceeding API limits
  • Chunk Metadata: Preserve timing and sequence information for accurate reassembly

Technical Implementation:

  • Add .env support with python-dotenv for secure credential management
  • Create AudioChunker class using FFmpeg for precise audio splitting
  • Implement overlap handling to prevent word cutoffs at chunk boundaries
  • Add chunk size configuration (default: 80MB with 10-second overlap)

🌐 Increment 2: WhisperAPI Client Integration

Objective: Complete WhisperAPI.com integration with parameter compatibility

Key Features:

  • API Parameter Mapping: Handle WhisperAPI limitations (no temperature, beam_size)
  • Chunked Upload Workflow: Sequential processing of large file chunks
  • Response Reconstruction: Seamless text reassembly from multiple API calls
  • Error Handling: Retry logic and graceful degradation

Technical Implementation:

  • Update ExternalWhisperEngine with WhisperAPI.com endpoint specifics
  • Implement chunk upload queue with progress tracking per chunk
  • Create text reconstruction pipeline with overlap detection
  • Add API-specific error handling and retry mechanisms

πŸ“Š Increment 3: Multi-file Progress Visualization

Objective: Enhanced progress tracking for chunked and multi-file processing

Key Features:

  • Hierarchical Progress: File-level and chunk-level progress display
  • Real-time Updates: Live status for chunked file processing
  • Processing Analytics: Chunk processing rates and time estimates
  • Visual Indicators: Clear distinction between chunked and regular files

Technical Implementation:

  • Extend progress tracking to handle chunk-level granularity
  • Create nested progress bars for chunked files
  • Add processing analytics and rate calculations
  • Implement visual indicators for different processing states

⚑ Increment 4: Asynchronous Processing Infrastructure

Objective: Non-blocking workflow with webhook support for long-running jobs

Key Features:

  • Async Job Submission: Non-blocking API calls with job tracking
  • Webhook Integration: Callback system for job completion notifications
  • Status Polling: Automatic job status checking with exponential backoff
  • Concurrent Chunk Processing: Parallel chunk uploads when supported

Technical Implementation:

  • Implement async/await patterns for API interactions
  • Create webhook server for job completion callbacks
  • Add job status polling with intelligent retry logic
  • Enable concurrent chunk processing with rate limiting

πŸ’Ύ Increment 5: Job Management & Persistence

Objective: Robust job tracking and recovery across sessions

Key Features:

  • Job Persistence: SQLite-based job state tracking across restarts
  • Resume Capability: Continue interrupted chunked file processing
  • Job History: Complete audit trail of transcription jobs
  • Failure Recovery: Automatic retry of failed chunks

Technical Implementation:

  • Create job management database schema
  • Implement job state persistence and recovery
  • Add chunk-level failure tracking and retry logic
  • Create job history and analytics dashboard

✨ Increment 6: Polish & Optimization

Objective: Production-ready features and performance optimization

Key Features:

  • Smart Chunking: Silence-aware chunk boundaries to preserve speech flow
  • Compression Optimization: Automatic audio format optimization for API efficiency
  • Cost Analytics: API usage tracking and cost estimation
  • Advanced Retry Logic: Intelligent failure categorization and retry strategies

Technical Implementation:

  • Add silence detection for optimal chunk boundaries
  • Implement audio compression with quality preservation
  • Create API usage analytics and cost tracking
  • Add advanced error categorization and retry logic

Configuration Examples

WhisperAPI.com Integration

# .env file configuration
WHISPER_API_BASE_URL=https://whisperapi.com
WHISPER_API_KEY=your-api-key-here
TRANSCRIPTION_ENGINE=api
CHUNK_SIZE_MB=80
CHUNK_OVERLAP_SECONDS=10

# Command line usage
python main.py /path/to/large/files \
    --domain vedanta \
    --engine api \
    --workers 4

Large File Processing

# Process files with automatic chunking
python main.py /path/to/1gb/files \
    --engine api \
    --domain medical \
    --chunk-size 80 \
    --chunk-overlap 10

Benefits for Different Use Cases

πŸ•‰οΈ Vedanta & Spiritual Content

  • Preserve Sanskrit 2E8E Terms: Domain-specific prompts maintained across chunks
  • Long Discourse Processing: Handle 2+ hour lectures seamlessly
  • Contextual Accuracy: Overlap ensures Sanskrit pronunciation consistency

πŸ₯ Medical & Professional

  • HIPAA Compliance: Secure API processing with audit trails
  • Large Case Files: Process extensive patient consultations
  • Terminology Preservation: Medical domain optimization across chunks

πŸ“š Educational & Research

  • Lecture Archives: Batch process semester-long course recordings
  • Research Interviews: Handle extensive qualitative data collection
  • Multi-language Support: Consistent language detection across chunks

Technical Architecture

Large File (1.5GB) β†’ AudioChunker β†’ [80MB chunks with 10s overlap]
                                           ↓
Each Chunk β†’ WhisperAPI β†’ Text Response β†’ TextReconstructor
                                           ↓
All Chunks β†’ Overlap Resolution β†’ Final Transcript β†’ Domain Post-processing

This roadmap transforms the existing domain-optimized transcription system into a cloud-ready, large-scale processing platform while preserving the specialized Vedanta and domain-specific features that make it unique.

πŸ—„οΈ Database Schema

The web application uses SQLite with these tables:

  • clients: Client management (id, name, email, created_at)
  • transcriptions: Video transcriptions (id, client_id, filename, original_text, translated_text, target_language, created_at)
  • persona_prompts: AI personas (id, transcription_id, persona_name, system_prompt, created_at)

🎯 Use Cases

πŸ•‰οΈ Spiritual & Educational Organizations

  • Vedanta Centers: Process multi-language spiritual discourses with Sanskrit term preservation
  • Universities: Transcribe philosophy and religious studies lectures
  • Ashrams: Create searchable archives of teacher discourses
  • Spiritual Publishers: Generate accurate transcripts for book production

πŸ₯ Healthcare & Medical

  • Medical Practices: Transcribe patient consultations and dictations
  • Hospitals: Process medical rounds and case discussions
  • Research: Analyze clinical interview data
  • Telemedicine: Create accurate consultation records

πŸ’Ό Business & Corporate

  • Meetings: Generate searchable meeting minutes and action items
  • Interviews: Process HR interviews and performance reviews
  • Training: Transcribe corporate training sessions
  • Customer Service: Analyze support call recordings

πŸŽ“ Education & Research

  • Universities: Transcribe lec 8A64 tures for accessibility and archives
  • Researchers: Process qualitative interview data
  • Online Learning: Create subtitles for educational videos
  • Language Studies: Analyze speech patterns across languages

βš–οΈ Legal & Professional Services

  • Law Firms: Transcribe depositions and client meetings
  • Courts: Process hearing recordings (where permitted)
  • Consultants: Document client strategy sessions
  • Compliance: Create audit trails of important discussions

🎬 Media & Content Creation

  • Podcasters: Generate show transcripts and searchable content
  • YouTubers: Create accurate subtitles and show notes
  • Journalists: Transcribe interviews for articles
  • Documentary Makers: Process interview footage

πŸ”§ Development

Running Tests

# Test batch transcription setup with domain functionality
python test_batch_transcription.py

# Test with short audio file (faster)
python main.py ./videos/testing --domain general --workers 2

# Test domain listing and suggestions
python main.py --list-domains
python main.py --suggest-domain ./videos/testing

# Test comprehensive functionality
python test_environment.py
python test_chunking.py
python test_realistic_chunking.py

Quick Setup Commands

# Install dependencies
pip install -r requirements.txt

# Verify Whisper installation
python -c "import whisper; print('Whisper available')"

# Check available domains
python main.py --list-domains

Adding New Transcription Engines

  1. Inherit from TranscriptionEngine in transcription_engine.py
  2. Implement the transcribe() method
  3. Add to the factory function create_engine()
  4. Update CLI arguments if needed

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • OpenAI Whisper for state-of-the-art transcription
  • Ollama for local AI processing
  • Streamlit for the web interface
  • FFmpeg for audio/video processing
  • Spiritual teachers whose lectures comprise the test dataset

πŸ†˜ Support

Common Issues

Whisper Installation Problems:

pip install --upgrade pip setuptools wheel
pip install openai-whisper

FFmpeg Not Found:

  • macOS: brew install ffmpeg
  • Ubuntu: sudo apt-get install ffmpeg
  • Windows: Download from https://ffmpeg.org/

Ollama Connection Issues:

# Check if Ollama is running
curl http://localhost:11434/api/version

# Start Ollama
ollama serve

Large File Processing:

  • Ensure sufficient disk space for temporary files
  • Increase timeout settings for very long recordings
  • Use chunked processing for files over 1GB

Getting Help

  1. Check the troubleshooting guide
  2. Review existing issues
  3. Create a detailed issue report with:
    • System information
    • Error messages
    • Steps to reproduce

Ready to transform your spiritual and educational content into interactive AI-powered experiences! πŸš€

About

Audio Transcription Tool Using OpenAI Whisper API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0