A comprehensive system for transcribing audio content into text and creating interactive AI personas that can engage in conversations about the content. Originally designed for spiritual and philosophical lectures, this system now supports multiple content domains including business, education, medical, legal, and general-purpose transcription.
This project provides both a web interface for individual file processing and a robust batch processing system for large collections of audio/video files. With domain-specific optimizations, it excels at processing specialized content including Vedanta lectures, business meetings, educational content, medical dictation, and more.
- Web Application (
VIdeo-Transcription/
): Streamlit-based interface for uploading, transcribing, and chatting with AI personas - Batch Transcription (
batch_transcription/
): Python workflow for processing entire directories of audio files - Video Collection (
videos/
): Organized spiritual lecture content with existing transcriptions - Output Examples (
example_output_formats/
): Sample outputs in various formats (TXT, JSON, SRT, VTT, TSV)
- Multi-format support: MP4, AVI, MOV, MKV, M4A
- Large file handling: Up to 2GB with chunked processing
- Optimized for spiritual content: Custom prompts for Sanskrit/Vedanta terminology
- Multiple output formats: Plain text, timestamped, SRT subtitles, JSON
- 130+ languages supported via Google Translate
- Preserves formatting and timestamps during translation
- Context-aware translation for mixed-language content
- Domain-aware analysis: Specialized persona creation for different content types
- Analyzes speech patterns and personality traits from transcripts
- Creates contextual personas that mimic speaker characteristics
- Interactive chat interface with generated personas
- Powered by Ollama for local AI processing
- Directory-based workflow: Point at folder, get transcriptions
- Domain-specific optimization: 8 preset domains (vedanta, medical, business, etc.)
- Real-time progress tracking: Live updates every 2 seconds with processing status
- Parallel processing: Configurable worker threads with per-worker status
- Resume capability: Skip already processed files
- Pluggable engines: Local Whisper or external API
- Comprehensive logging: Detailed progress tracking and completion statistics
- Smart domain detection: Auto-suggest optimal domain based on content
- Python 3.10+
- FFmpeg for audio/video processing
- Ollama for AI persona features
- OpenAI Whisper for transcription
# Install FFmpeg
# macOS: brew install ffmpeg
# Ubuntu: sudo apt-get install ffmpeg
# Windows: Download from https://ffmpeg.org/
# Install and start Ollama
# Visit: https://ollama.ai/
ollama serve
ollama pull mistral:instruct
# Clone the repository
git clone <repository-url>
cd video-to-rag
# Set up web application
cd VIdeo-Transcription
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
# Set up batch transcription
cd ../batch_transcription
pip install -r requirements.txt
cd VIdeo-Transcription
streamlit run main.py
Navigate to http://localhost:8501
for the web interface.
Features:
- Upload video/audio files
- Real-time transcription with progress tracking
- Optional translation to 130+ languages
- AI persona generation and chat
- Client management and transcription history
# Basic usage - auto-detects content type
python main.py /path/to/audio/files
# Explicit general domain
python main.py /path/to/audio/files --domain general
# Optimized for Sanskrit terminology and spiritual discourse
python main.py /path/to/vedanta/lectures --domain vedanta
# High-accuracy for important teachings
python main.py /path/to/spiritual/content \
--domain vedanta --model large
# Vedanta with custom settings
python main.py /path/to/lectures \
--domain vedanta --workers 4 --temperature 0.0
# Medical transcription (high accuracy)
python main.py /path/to/medical/files --domain medical
# Business meetings
python main.py /path/to/meetings --domain business
# Educational lectures
python main.py /path/to/classes --domain education
# Legal proceedings
python main.py /path/to/legal/files --domain legal
# List available domains with descriptions
python main.py --list-domains
# Get domain suggestion based on folder name
python main.py --suggest-domain /path/to/files
# Use external API
python main.py /path/to/files \
--engine api --api-url https://api.example.com --api-key your-key
# Custom processing settings with real-time monitoring
python main.py /path/to/files \
--workers 8 --model large --temperature 0.1 --no-resume
# Test with sample file (quick validation)
python main.py ./videos/testing --domain general --workers 2
# Test environment variable loading
python test_environment.py
# Test audio chunking functionality
python test_chunking.py
# Test realistic chunking with large files
python test_realistic_chunking.py
# Quick test with provided sample file (20MB, ~2 minutes)
python main.py ./videos/testing --domain general
Test Data: The repository includes a test audio file (videos/testing/Three_Ai_agents_realize...wav
) that's perfect for:
- Testing chunking algorithms (20MB file)
- Validating transcription engines
- Quick functionality verification
- Development workflow testing
When processing large files (which can take 1+ hours), you'll see:
Progress: 2/10 (20.0%) | β 2 β 0 | Time: 45.3 minutes | Processing: W1: lecture_file.wav...
- Live progress: Updates every 2 seconds
- Success/failure counts: Real-time completion statistics
- Per-worker status: See which files each worker is processing
- Time tracking: Elapsed time and estimated completion
- Processing rate: Files per minute for time estimation
Output structure:
your_audio_directory/
βββ audio_file1.wav
βββ audio_file2.wav
βββ transcriptions/
β βββ audio_file1.txt
β βββ audio_file2.txt
βββ logs/
βββ process.log
βββ completed.txt
βββ failed.txt
from pathlib import Path
from config import BatchTranscriptionConfig
from main import transcribe_directory
# Simple usage (general domain)
result = transcribe_directory(Path("/path/to/audio/files"))
# Domain-specific processing
config = BatchTranscriptionConfig.from_domain("vedanta")
result = transcribe_directory(Path("/path/to/vedanta/lectures"), config)
# Vedanta with custom overrides
config = BatchTranscriptionConfig.from_domain("vedanta",
whisper_model="large",
processing_max_workers=8)
result = transcribe_directory(Path("/path/to/lectures"), config)
# External API configuration
config = BatchTranscriptionConfig.from_domain("business")
config.engine = "api"
config.api_base_url = "https://api.example.com"
config.api_key = "your-api-key"
result = transcribe_directory(Path("/path/to/meetings"), config)
# Using environment variables
config = BatchTranscriptionConfig.from_env("vedanta")
result = transcribe_directory(Path("/path/to/lectures"), config)
print(f"Successfully transcribed {result['completed']} files")
print(f"Failed: {result['failed']} files")
# Domain and processing settings
export TRANSCRIPTION_DOMAIN=vedanta # or 'general', 'medical', 'business', etc.
export TRANSCRIPTION_ENGINE=local # or 'api'
export TRANSCRIPTION_MAX_WORKERS=4
# Whisper settings
export WHISPER_MODEL=turbo # or 'large' for higher accuracy
export WHISPER_API_BASE_URL=https://api.example.com
export WHISPER_API_KEY=your-api-key
export WHISPER_INITIAL_PROMPT="Custom prompt for specialized content"
# Ollama settings (for web app)
export OLLAMA_API_BASE=http://localhost:11434
export DEFAULT_MODEL=mistral:instruct
The system provides optimized configurations for different content types:
- Model:
turbo
(fast) orlarge
(accuracy) - Language: Forced English with Sanskrit term recognition
- Initial Prompt: Enhanced for Sanskrit terminology and spiritual discourse
- Temperature: 0.0 (deterministic for consistent Sanskrit terms)
- Beam Size: 5 (quality vs speed balance)
- Specialized Vocabulary: 40+ Sanskrit/Vedanta terms
- Model:
large
(higher accuracy for medical terms) - Temperature: 0.0 (deterministic for medical precision)
- Beam Size: 7 (higher accuracy)
- Specialized Vocabulary: Medical terminology
- Model:
turbo
(efficient for meetings) - Initial Prompt: Optimized for business terminology and decisions
- Specialized Vocabulary: Business and corporate terms
- Model:
turbo
(balanced performance) - Initial Prompt: Generic high-quality transcription
- Language: Auto-detect
video-to-rag/
βββ main.py # CLI entry point
βββ batch_transcriber.py # Main orchestrator with real-time progress
βββ transcription_engine.py # Pluggable engines (local/API)
βββ config.py # Configuration with environment variable support
βββ domains.py # Domain-specific presets and optimizations
βββ audio_chunker.py # Large file chunking for API compatibility
βββ utils.py # Helper functions
βββ vedanta_utils.py # Vedanta-specific content enhancement
βββ requirements.txt # Package dependencies
βββ videos/ # Organized spiritual lecture content
β βββ testing/ # Test files with transcriptions
β βββ COMPLETE - batch_1/ # Processed Bhagavad Gita Chapter 1
β βββ COMPLETE - batch_2a/ # Processed spiritual lectures
β βββ batch_2/ # Additional Gita and Dharma lectures
β βββ batch_3_onwards/ # Bhagavad Gita Chapter 2 content
β βββ *.wav # Individual spiritual lecture files
βββ .env.example # Environment configuration template
βββ .gitignore # Git exclusions for sensitive data
βββ test_*.py # Test scripts for functionality validation
βββ CLAUDE.md # Developer guidance and project context
βββ README.md # Comprehensive project documentation
The system supports pluggable transcription engines:
- Uses locally installed OpenAI Whisper
- No external dependencies or API costs
- Full control over processing
- Optimized for batch processing
- Ready for hosted Whisper services
- Same parameter support as local engine
- API key authentication
- Suitable for cloud deployment
- Large File Chunking: Automatically splits files >100MB for API compatibility
- Asynchronous Processing: Non-blocking workflow with webhook callbacks
- Progress Persistence: Resume interrupted jobs across sessions
Switching engines:
# Local processing
python main.py /path --engine local
# External API processing
python main.py /path --engine api --api-url https://api.example.com
# WhisperAPI.com integration with environment variables
export WHISPER_API_BASE_URL=https://whisperapi.com
export WHISPER_API_KEY=your-api-key
python main.py /path --engine api
The system is being enhanced with comprehensive external API support for processing large-scale transcription jobs. This roadmap outlines the implementation plan for WhisperAPI.com integration and similar services.
- β Basic API Engine: Foundation for external Whisper API calls
- β Domain-Optimized Configuration: 8 specialized presets (vedanta, medical, business, etc.)
- β Real-time Progress Tracking: Multi-worker status monitoring
β οΈ Large File Limitation: Current implementation requires chunking for files >100MB
Objective: Establish robust foundation for large file processing and API integration
Key Features:
- Environment Variable Management: Secure API credential handling with python-dotenv
- Audio Chunking System: FFmpeg-based splitting with configurable overlap for seamless reconstruction
- File Size Detection: Automatic chunking trigger for files exceeding API limits
- Chunk Metadata: Preserve timing and sequence information for accurate reassembly
Technical Implementation:
- Add
.env
support withpython-dotenv
for secure credential management - Create
AudioChunker
class using FFmpeg for precise audio splitting - Implement overlap handling to prevent word cutoffs at chunk boundaries
- Add chunk size configuration (default: 80MB with 10-second overlap)
Objective: Complete WhisperAPI.com integration with parameter compatibility
Key Features:
- API Parameter Mapping: Handle WhisperAPI limitations (no temperature, beam_size)
- Chunked Upload Workflow: Sequential processing of large file chunks
- Response Reconstruction: Seamless text reassembly from multiple API calls
- Error Handling: Retry logic and graceful degradation
Technical Implementation:
- Update
ExternalWhisperEngine
with WhisperAPI.com endpoint specifics - Implement chunk upload queue with progress tracking per chunk
- Create text reconstruction pipeline with overlap detection
- Add API-specific error handling and retry mechanisms
Objective: Enhanced progress tracking for chunked and multi-file processing
Key Features:
- Hierarchical Progress: File-level and chunk-level progress display
- Real-time Updates: Live status for chunked file processing
- Processing Analytics: Chunk processing rates and time estimates
- Visual Indicators: Clear distinction between chunked and regular files
Technical Implementation:
- Extend progress tracking to handle chunk-level granularity
- Create nested progress bars for chunked files
- Add processing analytics and rate calculations
- Implement visual indicators for different processing states
Objective: Non-blocking workflow with webhook support for long-running jobs
Key Features:
- Async Job Submission: Non-blocking API calls with job tracking
- Webhook Integration: Callback system for job completion notifications
- Status Polling: Automatic job status checking with exponential backoff
- Concurrent Chunk Processing: Parallel chunk uploads when supported
Technical Implementation:
- Implement async/await patterns for API interactions
- Create webhook server for job completion callbacks
- Add job status polling with intelligent retry logic
- Enable concurrent chunk processing with rate limiting
Objective: Robust job tracking and recovery across sessions
Key Features:
- Job Persistence: SQLite-based job state tracking across restarts
- Resume Capability: Continue interrupted chunked file processing
- Job History: Complete audit trail of transcription jobs
- Failure Recovery: Automatic retry of failed chunks
Technical Implementation:
- Create job management database schema
- Implement job state persistence and recovery
- Add chunk-level failure tracking and retry logic
- Create job history and analytics dashboard
Objective: Production-ready features and performance optimization
Key Features:
- Smart Chunking: Silence-aware chunk boundaries to preserve speech flow
- Compression Optimization: Automatic audio format optimization for API efficiency
- Cost Analytics: API usage tracking and cost estimation
- Advanced Retry Logic: Intelligent failure categorization and retry strategies
Technical Implementation:
- Add silence detection for optimal chunk boundaries
- Implement audio compression with quality preservation
- Create API usage analytics and cost tracking
- Add advanced error categorization and retry logic
# .env file configuration
WHISPER_API_BASE_URL=https://whisperapi.com
WHISPER_API_KEY=your-api-key-here
TRANSCRIPTION_ENGINE=api
CHUNK_SIZE_MB=80
CHUNK_OVERLAP_SECONDS=10
# Command line usage
python main.py /path/to/large/files \
--domain vedanta \
--engine api \
--workers 4
# Process files with automatic chunking
python main.py /path/to/1gb/files \
--engine api \
--domain medical \
--chunk-size 80 \
--chunk-overlap 10
- Preserve Sanskrit 2E8E Terms: Domain-specific prompts maintained across chunks
- Long Discourse Processing: Handle 2+ hour lectures seamlessly
- Contextual Accuracy: Overlap ensures Sanskrit pronunciation consistency
- HIPAA Compliance: Secure API processing with audit trails
- Large Case Files: Process extensive patient consultations
- Terminology Preservation: Medical domain optimization across chunks
- Lecture Archives: Batch process semester-long course recordings
- Research Interviews: Handle extensive qualitative data collection
- Multi-language Support: Consistent language detection across chunks
Large File (1.5GB) β AudioChunker β [80MB chunks with 10s overlap]
β
Each Chunk β WhisperAPI β Text Response β TextReconstructor
β
All Chunks β Overlap Resolution β Final Transcript β Domain Post-processing
This roadmap transforms the existing domain-optimized transcription system into a cloud-ready, large-scale processing platform while preserving the specialized Vedanta and domain-specific features that make it unique.
The web application uses SQLite with these tables:
- clients: Client management (id, name, email, created_at)
- transcriptions: Video transcriptions (id, client_id, filename, original_text, translated_text, target_language, created_at)
- persona_prompts: AI personas (id, transcription_id, persona_name, system_prompt, created_at)
- Vedanta Centers: Process multi-language spiritual discourses with Sanskrit term preservation
- Universities: Transcribe philosophy and religious studies lectures
- Ashrams: Create searchable archives of teacher discourses
- Spiritual Publishers: Generate accurate transcripts for book production
- Medical Practices: Transcribe patient consultations and dictations
- Hospitals: Process medical rounds and case discussions
- Research: Analyze clinical interview data
- Telemedicine: Create accurate consultation records
- Meetings: Generate searchable meeting minutes and action items
- Interviews: Process HR interviews and performance reviews
- Training: Transcribe corporate training sessions
- Customer Service: Analyze support call recordings
- Universities: Transcribe lec 8A64 tures for accessibility and archives
- Researchers: Process qualitative interview data
- Online Learning: Create subtitles for educational videos
- Language Studies: Analyze speech patterns across languages
- Law Firms: Transcribe depositions and client meetings
- Courts: Process hearing recordings (where permitted)
- Consultants: Document client strategy sessions
- Compliance: Create audit trails of important discussions
- Podcasters: Generate show transcripts and searchable content
- YouTubers: Create accurate subtitles and show notes
- Journalists: Transcribe interviews for articles
- Documentary Makers: Process interview footage
# Test batch transcription setup with domain functionality
python test_batch_transcription.py
# Test with short audio file (faster)
python main.py ./videos/testing --domain general --workers 2
# Test domain listing and suggestions
python main.py --list-domains
python main.py --suggest-domain ./videos/testing
# Test comprehensive functionality
python test_environment.py
python test_chunking.py
python test_realistic_chunking.py
# Install dependencies
pip install -r requirements.txt
# Verify Whisper installation
python -c "import whisper; print('Whisper available')"
# Check available domains
python main.py --list-domains
- Inherit from
TranscriptionEngine
intranscription_engine.py
- Implement the
transcribe()
method - Add to the factory function
create_engine()
- Update CLI arguments if needed
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Whisper for state-of-the-art transcription
- Ollama for local AI processing
- Streamlit for the web interface
- FFmpeg for audio/video processing
- Spiritual teachers whose lectures comprise the test dataset
Whisper Installation Problems:
pip install --upgrade pip setuptools wheel
pip install openai-whisper
FFmpeg Not Found:
- macOS:
brew install ffmpeg
- Ubuntu:
sudo apt-get install ffmpeg
- Windows: Download from https://ffmpeg.org/
Ollama Connection Issues:
# Check if Ollama is running
curl http://localhost:11434/api/version
# Start Ollama
ollama serve
Large File Processing:
- Ensure sufficient disk space for temporary files
- Increase timeout settings for very long recordings
- Use chunked processing for files over 1GB
- Check the troubleshooting guide
- Review existing issues
- Create a detailed issue report with:
- System information
- Error messages
- Steps to reproduce
Ready to transform your spiritual and educational content into interactive AI-powered experiences! π