Video-to-RAG: Universal Audio Transcription & AI Interaction

A comprehensive system for transcribing audio content into text and creating interactive AI personas that can engage in conversations about the content. Originally designed for spiritual and philosophical lectures, this system now supports multiple content domains including business, education, medical, legal, and general-purpose transcription.

🎯 Project Overview

This project provides both a web interface for individual file processing and a robust batch processing system for large collections of audio/video files. With domain-specific optimizations, it excels at processing specialized content including Vedanta lectures, business meetings, educational content, medical dictation, and more.

Key Components

Web Application (VIdeo-Transcription/): Streamlit-based interface for uploading, transcribing, and chatting with AI personas
Batch Transcription (batch_transcription/): Python workflow for processing entire directories of audio files
Video Collection (videos/): Organized spiritual lecture content with existing transcriptions
Output Examples (example_output_formats/): Sample outputs in various formats (TXT, JSON, SRT, VTT, TSV)

✨ Features

🎬 Video Transcription

Multi-format support: MP4, AVI, MOV, MKV, M4A
Large file handling: Up to 2GB with chunked processing
Optimized for spiritual content: Custom prompts for Sanskrit/Vedanta terminology
Multiple output formats: Plain text, timestamped, SRT subtitles, JSON

🌐 Translation & Localization

130+ languages supported via Google Translate
Preserves formatting and timestamps during translation
Context-aware translation for mixed-language content

🤖 AI Persona Generation

Domain-aware analysis: Specialized persona creation for different content types
Analyzes speech patterns and personality traits from transcripts
Creates contextual personas that mimic speaker characteristics
Interactive chat interface with generated personas
Powered by Ollama for local AI processing

⚡ Batch Processing

Directory-based workflow: Point at folder, get transcriptions
Domain-specific optimization: 8 preset domains (vedanta, medical, business, etc.)
Real-time progress tracking: Live updates every 2 seconds with processing status
Parallel processing: Configurable worker threads with per-worker status
Resume capability: Skip already processed files
Pluggable engines: Local Whisper or external API
Comprehensive logging: Detailed progress tracking and completion statistics
Smart domain detection: Auto-suggest optimal domain based on content

🚀 Quick Start

Prerequisites

Python 3.10+
FFmpeg for audio/video processing
Ollama for AI persona features
OpenAI Whisper for transcription

# Install FFmpeg
# macOS: brew install ffmpeg
# Ubuntu: sudo apt-get install ffmpeg
# Windows: Download from https://ffmpeg.org/

# Install and start Ollama
# Visit: https://ollama.ai/
ollama serve
ollama pull mistral:instruct

Installation

# Clone the repository
git clone <repository-url>
cd video-to-rag

# Set up web application
cd VIdeo-Transcription
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

# Set up batch transcription
cd ../batch_transcription
pip install -r requirements.txt

💻 Usage

Web Application

cd VIdeo-Transcription
streamlit run main.py

Navigate to http://localhost:8501 for the web interface.

Features:

Upload video/audio files
Real-time transcription with progress tracking
Optional translation to 130+ languages
AI persona generation and chat
Client management and transcription history

Batch Transcription

General Content

# Basic usage - auto-detects content type
python main.py /path/to/audio/files

# Explicit general domain
python main.py /path/to/audio/files --domain general

🕉️ Vedanta & Spiritual Content

# Optimized for Sanskrit terminology and spiritual discourse
python main.py /path/to/vedanta/lectures --domain vedanta

# High-accuracy for important teachings
python main.py /path/to/spiritual/content \
    --domain vedanta --model large

# Vedanta with custom settings
python main.py /path/to/lectures \
    --domain vedanta --workers 4 --temperature 0.0

Other Domains

# Medical transcription (high accuracy)
python main.py /path/to/medical/files --domain medical

# Business meetings
python main.py /path/to/meetings --domain business

# Educational lectures  
python main.py /path/to/classes --domain education

# Legal proceedings
python main.py /path/to/legal/files --domain legal

Advanced Usage

# List available domains with descriptions
python main.py --list-domains

# Get domain suggestion based on folder name
python main.py --suggest-domain /path/to/files

# Use external API
python main.py /path/to/files \
    --engine api --api-url https://api.example.com --api-key your-key

# Custom processing settings with real-time monitoring
python main.py /path/to/files \
    --workers 8 --model large --temperature 0.1 --no-resume

# Test with sample file (quick validation)
python main.py ./videos/testing --domain general --workers 2

Testing & Development

# Test environment variable loading
python test_environment.py

# Test audio chunking functionality
python test_chunking.py

# Test realistic chunking with large files
python test_realistic_chunking.py

# Quick test with provided sample file (20MB, ~2 minutes)
python main.py ./videos/testing --domain general

Test Data: The repository includes a test audio file (videos/testing/Three_Ai_agents_realize...wav) that's perfect for:

Testing chunking algorithms (20MB file)
Validating transcription engines
Quick functionality verification
Development workflow testing

Real-Time Progress Features

When processing large files (which can take 1+ hours), you'll see:

Progress: 2/10 (20.0%) | ✓ 2 ✗ 0 | Time: 45.3 minutes | Processing: W1: lecture_file.wav...

Live progress: Updates every 2 seconds
Success/failure counts: Real-time completion statistics
Per-worker status: See which files each worker is processing
Time tracking: Elapsed time and estimated completion
Processing rate: Files per minute for time estimation

Output structure:

your_audio_directory/
├── audio_file1.wav
├── audio_file2.wav
├── transcriptions/
│   ├── audio_file1.txt
│   └── audio_file2.txt
└── logs/
    ├── process.log
    ├── completed.txt
    └── failed.txt

Programmatic Usage

from pathlib import Path
from config import BatchTranscriptionConfig
from main import transcribe_directory

# Simple usage (general domain)
result = transcribe_directory(Path("/path/to/audio/files"))

# Domain-specific processing
config = BatchTranscriptionConfig.from_domain("vedanta")
result = transcribe_directory(Path("/path/to/vedanta/lectures"), config)

# Vedanta with custom overrides
config = BatchTranscriptionConfig.from_domain("vedanta", 
    whisper_model="large",
    processing_max_workers=8)
result = transcribe_directory(Path("/path/to/lectures"), config)

# External API configuration
config = BatchTranscriptionConfig.from_domain("business")
config.engine = "api"
config.api_base_url = "https://api.example.com"
config.api_key = "your-api-key"
result = transcribe_directory(Path("/path/to/meetings"), config)

# Using environment variables
config = BatchTranscriptionConfig.from_env("vedanta")
result = transcribe_directory(Path("/path/to/lectures"), config)

print(f"Successfully transcribed {result['completed']} files")
print(f"Failed: {result['failed']} files")

⚙️ Configuration

Environment Variables

# Domain and processing settings
export TRANSCRIPTION_DOMAIN=vedanta        # or 'general', 'medical', 'business', etc.
export TRANSCRIPTION_ENGINE=local          # or 'api'
export TRANSCRIPTION_MAX_WORKERS=4

# Whisper settings
export WHISPER_MODEL=turbo                  # or 'large' for higher accuracy
export WHISPER_API_BASE_URL=https://api.example.com
export WHISPER_API_KEY=your-api-key
export WHISPER_INITIAL_PROMPT="Custom prompt for specialized content"

# Ollama settings (for web app)
export OLLAMA_API_BASE=http://localhost:11434
export DEFAULT_MODEL=mistral:instruct

Domain-Optimized Settings

The system provides optimized configurations for different content types:

Vedanta Domain (`--domain vedanta`)

Model: turbo (fast) or large (accuracy)
Language: Forced English with Sanskrit term recognition
Initial Prompt: Enhanced for Sanskrit terminology and spiritual discourse
Temperature: 0.0 (deterministic for consistent Sanskrit terms)
Beam Size: 5 (quality vs speed balance)
Specialized Vocabulary: 40+ Sanskrit/Vedanta terms

Medical Domain (`--domain medical`)

Model: large (higher accuracy for medical terms)
Temperature: 0.0 (deterministic for medical precision)
Beam Size: 7 (higher accuracy)
Specialized Vocabulary: Medical terminology

Business Domain (`--domain business`)

Model: turbo (efficient for meetings)
Initial Prompt: Optimized for business terminology and decisions
Specialized Vocabulary: Business and corporate terms

General Domain (`--domain general`)

Model: turbo (balanced performance)
Initial Prompt: Generic high-quality transcription
Language: Auto-detect

📊 Project Structure

video-to-rag/
├── main.py                       # CLI entry point
├── batch_transcriber.py          # Main orchestrator with real-time progress
├── transcription_engine.py       # Pluggable engines (local/API)
├── config.py                     # Configuration with environment variable support
├── domains.py                    # Domain-specific presets and optimizations
├── audio_chunker.py              # Large file chunking for API compatibility
├── utils.py                      # Helper functions
├── vedanta_utils.py              # Vedanta-specific content enhancement
├── requirements.txt              # Package dependencies
├── videos/                       # Organized spiritual lecture content
│   ├── testing/                  # Test files with transcriptions
│   ├── COMPLETE - batch_1/       # Processed Bhagavad Gita Chapter 1
│   ├── COMPLETE - batch_2a/      # Processed spiritual lectures
│   ├── batch_2/                  # Additional Gita and Dharma lectures
│   ├── batch_3_onwards/          # Bhagavad Gita Chapter 2 content
│   └── *.wav                     # Individual spiritual lecture files
├── .env.example                  # Environment configuration template
├── .gitignore                    # Git exclusions for sensitive data
├── test_*.py                     # Test scripts for functionality validation
├── CLAUDE.md                     # Developer guidance and project context
└── README.md                     # Comprehensive project documentation

🔄 Engine Architecture

The system supports pluggable transcription engines:

Local Whisper Engine

Uses locally installed OpenAI Whisper
No external dependencies or API costs
Full control over processing
Optimized for batch processing

External API Engine

Ready for hosted Whisper services
Same parameter support as local engine
API key authentication
Suitable for cloud deployment
Large File Chunking: Automatically splits files >100MB for API compatibility
Asynchronous Processing: Non-blocking workflow with webhook callbacks
Progress Persistence: Resume interrupted jobs across sessions

Switching engines:

# Local processing
python main.py /path --engine local

# External API processing  
python main.py /path --engine api --api-url https://api.example.com

# WhisperAPI.com integration with environment variables
export WHISPER_API_BASE_URL=https://whisperapi.com
export WHISPER_API_KEY=your-api-key
python main.py /path --engine api

🚀 WhisperAPI Integration Roadmap

The system is being enhanced with comprehensive external API support for processing large-scale transcription jobs. This roadmap outlines the implementation plan for WhisperAPI.com integration and similar services.

Current Status

✅ Basic API Engine: Foundation for external Whisper API calls
✅ Domain-Optimized Configuration: 8 specialized presets (vedanta, medical, business, etc.)
✅ Real-time Progress Tracking: Multi-worker status monitoring
⚠️ Large File Limitation: Current implementation requires chunking for files >100MB

Implementation Increments

🔧 Increment 1: Environment & Chunking Foundation

Objective: Establish robust foundation for large file processing and API integration

Key Features:

Environment Variable Management: Secure API credential handling with python-dotenv
Audio Chunking System: FFmpeg-based splitting with configurable overlap for seamless reconstruction
File Size Detection: Automatic chunking trigger for files exceeding API limits
Chunk Metadata: Preserve timing and sequence information for accurate reassembly

Technical Implementation:

Add .env support with python-dotenv for secure credential management
Create AudioChunker class using FFmpeg for precise audio splitting
Implement overlap handling to prevent word cutoffs at chunk boundaries
Add chunk size configuration (default: 80MB with 10-second overlap)

🌐 Increment 2: WhisperAPI Client Integration

Objective: Complete WhisperAPI.com integration with parameter compatibility

Key Features:

API Parameter Mapping: Handle WhisperAPI limitations (no temperature, beam_size)
Chunked Upload Workflow: Sequential processing of large file chunks
Response Reconstruction: Seamless text reassembly from multiple API calls
Error Handling: Retry logic and graceful degradation

Technical Implementation:

Update ExternalWhisperEngine with WhisperAPI.com endpoint specifics
Implement chunk upload queue with progress tracking per chunk
Create text reconstruction pipeline with overlap detection
Add API-specific error handling and retry mechanisms

📊 Increment 3: Multi-file Progress Visualization

Objective: Enhanced progress tracking for chunked and multi-file processing

Key Features:

Hierarchical Progress: File-level and chunk-level progress display
Real-time Updates: Live status for chunked file processing
Processing Analytics: Chunk processing rates and time estimates
Visual Indicators: Clear distinction between chunked and regular files

Technical Implementation:

Extend progress tracking to handle chunk-level granularity
Create nested progress bars for chunked files
Add processing analytics and rate calculations
Implement visual indicators for different processing states

⚡ Increment 4: Asynchronous Processing Infrastructure

Objective: Non-blocking workflow with webhook support for long-running jobs

Key Features:

Async Job Submission: Non-blocking API calls with job tracking
Webhook Integration: Callback system for job completion notifications
Status Polling: Automatic job status checking with exponential backoff
Concurrent Chunk Processing: Parallel chunk uploads when supported

Technical Implementation:

Implement async/await patterns for API interactions
Create webhook server for job completion callbacks
Add job status polling with intelligent retry logic
Enable concurrent chunk processing with rate limiting

💾 Increment 5: Job Management & Persistence

Objective: Robust job tracking and recovery across sessions

Key Features:

Job Persistence: SQLite-based job state tracking across restarts
Resume Capability: Continue interrupted chunked file processing
Job History: Complete audit trail of transcription jobs
Failure Recovery: Automatic retry of failed chunks

Technical Implementation:

Create job management database schema
Implement job state persistence and recovery
Add chunk-level failure tracking and retry logic
Create job history and analytics dashboard

✨ Increment 6: Polish & Optimization

Objective: Production-ready features and performance optimization

Key Features:

Smart Chunking: Silence-aware chunk boundaries to preserve speech flow
Compression Optimization: Automatic audio format optimization for API efficiency
Cost Analytics: API usage tracking and cost estimation
Advanced Retry Logic: Intelligent failure categorization and retry strategies

Technical Implementation:

Add silence detection for optimal chunk boundaries
Implement audio compression with quality preservation
Create API usage analytics and cost tracking
Add advanced error categorization and retry logic

Configuration Examples

WhisperAPI.com Integration

# .env file configuration
WHISPER_API_BASE_URL=https://whisperapi.com
WHISPER_API_KEY=your-api-key-here
TRANSCRIPTION_ENGINE=api
CHUNK_SIZE_MB=80
CHUNK_OVERLAP_SECONDS=10

# Command line usage
python main.py /path/to/large/files \
    --domain vedanta \
    --engine api \
    --workers 4

Large File Processing

# Process files with automatic chunking
python main.py /path/to/1gb/files \
    --engine api \
    --domain medical \
    --chunk-size 80 \
    --chunk-overlap 10

Benefits for Different Use Cases

🕉️ Vedanta & Spiritual Content

Preserve Sanskrit 2E8E Terms: Domain-specific prompts maintained across chunks
Long Discourse Processing: Handle 2+ hour lectures seamlessly
Contextual Accuracy: Overlap ensures Sanskrit pronunciation consistency

🏥 Medical & Professional

HIPAA Compliance: Secure API processing with audit trails
Large Case Files: Process extensive patient consultations
Terminology Preservation: Medical domain optimization across chunks

📚 Educational & Research

Lecture Archives: Batch process semester-long course recordings
Research Interviews: Handle extensive qualitative data collection
Multi-language Support: Consistent language detection across chunks

Technical Architecture

Large File (1.5GB) → AudioChunker → [80MB chunks with 10s overlap]
                                           ↓
Each Chunk → WhisperAPI → Text Response → TextReconstructor
                                           ↓
All Chunks → Overlap Resolution → Final Transcript → Domain Post-processing

This roadmap transforms the existing domain-optimized transcription system into a cloud-ready, large-scale processing platform while preserving the specialized Vedanta and domain-specific features that make it unique.

🗄️ Database Schema

The web application uses SQLite with these tables:

clients: Client management (id, name, email, created_at)
transcriptions: Video transcriptions (id, client_id, filename, original_text, translated_text, target_language, created_at)
persona_prompts: AI personas (id, transcription_id, persona_name, system_prompt, created_at)

🎯 Use Cases

🕉️ Spiritual & Educational Organizations

Vedanta Centers: Process multi-language spiritual discourses with Sanskrit term preservation
Universities: Transcribe philosophy and religious studies lectures
Ashrams: Create searchable archives of teacher discourses
Spiritual Publishers: Generate accurate transcripts for book production

🏥 Healthcare & Medical

Medical Practices: Transcribe patient consultations and dictations
Hospitals: Process medical rounds and case discussions
Research: Analyze clinical interview data
Telemedicine: Create accurate consultation records

💼 Business & Corporate

Meetings: Generate searchable meeting minutes and action items
Interviews: Process HR interviews and performance reviews
Training: Transcribe corporate training sessions
Customer Service: Analyze support call recordings

🎓 Education & Research

Universities: Transcribe lec 8A64 tures for accessibility and archives
Researchers: Process qualitative interview data
Online Learning: Create subtitles for educational videos
Language Studies: Analyze speech patterns across languages

⚖️ Legal & Professional Services

Law Firms: Transcribe depositions and client meetings
Courts: Process hearing recordings (where permitted)
Consultants: Document client strategy sessions
Compliance: Create audit trails of important discussions

🎬 Media & Content Creation

Podcasters: Generate show transcripts and searchable content
YouTubers: Create accurate subtitles and show notes
Journalists: Transcribe interviews for articles
Documentary Makers: Process interview footage

🔧 Development

Running Tests

# Test batch transcription setup with domain functionality
python test_batch_transcription.py

# Test with short audio file (faster)
python main.py ./videos/testing --domain general --workers 2

# Test domain listing and suggestions
python main.py --list-domains
python main.py --suggest-domain ./videos/testing

# Test comprehensive functionality
python test_environment.py
python test_chunking.py
python test_realistic_chunking.py

Quick Setup Commands

# Install dependencies
pip install -r requirements.txt

# Verify Whisper installation
python -c "import whisper; print('Whisper available')"

# Check available domains
python main.py --list-domains

Adding New Transcription Engines

Inherit from TranscriptionEngine in transcription_engine.py
Implement the transcribe() method
Add to the factory function create_engine()
Update CLI arguments if needed

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

OpenAI Whisper for state-of-the-art transcription
Ollama for local AI processing
Streamlit for the web interface
FFmpeg for audio/video processing
Spiritual teachers whose lectures comprise the test dataset

🆘 Support

Common Issues

Whisper Installation Problems:

pip install --upgrade pip setuptools wheel
pip install openai-whisper

FFmpeg Not Found:

macOS: brew install ffmpeg
Ubuntu: sudo apt-get install ffmpeg
Windows: Download from https://ffmpeg.org/

Ollama Connection Issues:

# Check if Ollama is running
curl http://localhost:11434/api/version

# Start Ollama
ollama serve

Large File Processing:

Ensure sufficient disk space for temporary files
Increase timeout settings for very long recordings
Use chunked processing for files over 1GB

Getting Help

Check the troubleshooting guide
Review existing issues
Create a detailed issue report with:
- System information
- Error messages
- Steps to reproduce

Ready to transform your spiritual and educational content into interactive AI-powered experiences! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
audio_chunker.py		audio_chunker.py
batch_transcriber.py		batch_transcriber.py
config.py		config.py
domains.py		domains.py
main.py		main.py
test_batch_transcription.py		test_batch_transcription.py
test_chunking.py		test_chunking.py
test_environment.py		test_environment.py
test_realistic_chunking.py		test_realistic_chunking.py
transcription_engine.py		transcription_engine.py
utils.py		utils.py
vedanta_utils.py		vedanta_utils.py

RumTumTum/audio-transcription

Folders and files

Latest commit

History

Repository files navigation

Video-to-RAG: Universal Audio Transcription & AI Interaction

🎯 Project Overview

Key Components

✨ Features

🎬 Video Transcription

🌐 Translation & Localization

🤖 AI Persona Generation

⚡ Batch Processing

🚀 Quick Start

Prerequisites

Installation

💻 Usage

Web Application

Batch Transcription

General Content

🕉️ Vedanta & Spiritual Content

Other Domains

Advanced Usage

Testing & Development

Real-Time Progress Features

Programmatic Usage

⚙️ Configuration

Environment Variables

Domain-Optimized Settings

Vedanta Domain (--domain vedanta)

Medical Domain (--domain medical)

Business Domain (--domain business)

General Domain (--domain general)

📊 Project Structure

🔄 Engine Architecture

Local Whisper Engine

External API Engine

🚀 WhisperAPI Integration Roadmap

Current Status

Implementation Increments

🔧 Increment 1: Environment & Chunking Foundation

🌐 Increment 2: WhisperAPI Client Integration

📊 Increment 3: Multi-file Progress Visualization

⚡ Increment 4: Asynchronous Processing Infrastructure

💾 Increment 5: Job Management & Persistence

✨ Increment 6: Polish & Optimization

Configuration Examples

WhisperAPI.com Integration

Large File Processing

Benefits for Different Use Cases

🕉️ Vedanta & Spiritual Content

🏥 Medical & Professional

📚 Educational & Research

Technical Architecture

🗄️ Database Schema

🎯 Use Cases

🕉️ Spiritual & Educational Organizations

🏥 Healthcare & Medical

💼 Business & Corporate

🎓 Education & Research

⚖️ Legal & Professional Services

🎬 Media & Content Creation

🔧 Development

Running Tests

Quick Setup Commands

Adding New Transcription Engines

Contributing

📄 License

🙏 Acknowledgments

🆘 Support

Common Issues

Getting Help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Vedanta Domain (`--domain vedanta`)

Medical Domain (`--domain medical`)

Business Domain (`--domain business`)

General Domain (`--domain general`)

Packages