Lists (4)
Sort Name ascending (A-Z)
Starred repositories
Ke-Omni-R is an advanced audio reasoning model and achieved SOTA on MMAU
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
A Conversational Speech Generation Model
UniCodec: a unified audio codec with a single codebook to support multi-domain audio data, including speech, music, and sound
A low-bitrate single-codebook 16 kHz speech codec based on focal modulation
Unified automatic quality assessment for speech, music, and sound.
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.
Fully open reproduction of DeepSeek-R1
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
Codec for paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
Align Anything: Training All-modality Model with Feedback
Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)
PyTorch Implementation of AudioLCM (ACM-MM'24): a efficient and high-quality text-to-audio generation with latent consistency model.
LibriSpeech-Long is a benchmark dataset for long-form speech generation and processing. Released as part of "Long-Form Speech Generation with Spoken Language Models" (arXiv 2024).
Official implementation of the paper "BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec"
An Open-Sourced LLM-empowered Foundation TTS System
[ACL 2024] Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer
Ultra-low-bitrate Speech Codec for Speech Language Modeling Applications
Awesome Neural Codec Models, Text-to-Speech Synthesizers & Speech Language Models
SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM