Starred repositories
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Generative models for conditional audio generation
A family of state-of-the-art Transformer-based audio codecs for low-bitrate high-quality audio coding.
An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.
Codec for paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
Codebase for 'Scaling Rich Style-Prompted Text-to-Speech Datasets'
A collection of datasets for the purpose of emotion recognition/detection in speech.
InspireMusic: A toolkit designed for music, song, and audio generation
TTSAudioNormalizer is a specialized tool for TTS data production, featuring descriptive statistical analysis of audio loudness and loudness normalization operations.
hexisyztem / CosyVoice
Forked from FunAudioLLM/CosyVoiceMulti-lingual large voice generation model, providing inference, training and deployment full-stack ability.
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Speech-To-Text forced-alignment Speech processing Universal PERformance Benchmark
Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice
Omni SenseVoice: High-Speed Speech Recognition with words timestamps 🗣️🎯
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
The repo provides information about KeSpeech dataset.
Noise supression using deep filtering
AI powered speech denoising and enhancement
Daily tracking of awesome audio papers, including music generation, zero-shot tts, asr, audio generation
A generative speech model for daily dialogue.
SALMONN family: A suite of advanced multi-modal LLMs