Stars
Voice Activity Detector(VAD) from TEN: low-latency, high-performance and lightweight
ACE-Step: A Step Towards Music Generation Foundation Model
Multi-Scale Neural Audio Codec (SNAC) compresses audio into discrete codes at a low bitrate
A TTS model capable of generating ultra-realistic dialogue in one pass.
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
MAGI-1: Autoregressive Video Generation at Scale
Train your AI self, amplify you, bridge the world
[ICML 2025] Gaussian Mixture Flow Matching Models (GMFlow)
Official Pytorch Implementation for "DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion" (AAAI 2024)
[ICASSP 2025] "FLowHigh: Towards efficient and high-quality audio super-resolution with single-step flow matching"
[ICASSP 2024] TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models
利用AI大模型,一键生成高清短视频 Generate short videos with one click using AI LLM.
The official repo of NBC & SpatialNet for multichannel speech separation, denoising, and dereverberation
Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS …
Transcription, forced alignment, and audio indexing with OpenAI's Whisper
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics rec…
Cantonese Grapheme-to-Phoneme Converter based on GitYCC/g2pW
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
[INTERSPEECH 2024] EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
Text to speech alignment using CTC forced alignment
Repository for training models for music source separation.
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"