Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.

Jupyter Notebook 2,933 221 Updated May 15, 2025

SesameAILabs / csm

A Conversational Speech Generation Model

Python 13,229 1,253 Updated Mar 27, 2025

pipecat-ai / smart-turn

Python 722 32 Updated Apr 18, 2025

Jiang-Yidi / UniCodec

UniCodec: a unified audio codec with a single codebook to support multi-domain audio data, including speech, music, and sound

118 2 Updated Feb 28, 2025

lmxue / Audio-FLAN

Audio-FLAN

150 4 Updated Mar 6, 2025

lucadellalib / focalcodec

A low-bitrate single-codebook 16 kHz speech codec based on focal modulation

Python 86 10 Updated Feb 12, 2025

hkust-nlp / simpleRL-reason

Simple RL training for reasoning

Python 3,562 265 Updated Apr 10, 2025

facebookresearch / audiobox-aesthetics

Unified automatic quality assessment for speech, music, and sound.

Python 484 31 Updated May 1, 2025

Unakar / Logic-RL

Reproduce R1 Zero on Logic Puzzle

Python 2,337 155 Updated Mar 20, 2025

multimodal-art-projection / YuE

YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open

Python 4,969 544 Updated May 15, 2025

RAGEN-AI / RAGEN

RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.

Python 1,814 129 Updated May 13, 2025

huggingface / open-r1

Fully open reproduction of DeepSeek-R1

Python 24,421 2,250 Updated May 15, 2025

OpenBMB / MiniCPM-o

MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone

Python 19,428 1,402 Updated May 15, 2025

zhenye234 / X-Codec-2.0

Codec for paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis

Python 265 32 Updated Mar 12, 2025

PKU-Alignment / align-anything

Align Anything: Training All-modality Model with Feedback

Jupyter Notebook 3,675 430 Updated May 1, 2025

liutaocode / TTS-arxiv-daily

Automatically Update Text-to-speech (TTS) Papers Daily using Github Actions (Update Every 12th hours)

Python 433 24 Updated May 15, 2025

Text-to-Audio / AudioLCM

PyTorch Implementation of AudioLCM (ACM-MM'24): a efficient and high-quality text-to-audio generation with latent consistency model.

Python 1,097 153 Updated Apr 3, 2025

google-deepmind / librispeech-long

LibriSpeech-Long is a benchmark dataset for long-form speech generation and processing. Released as part of "Long-Form Speech Generation with Spoken Language Models" (arXiv 2024).

65 1 Updated Dec 28, 2024