Stars
[CVPR 2025] Official implementation for "Empowering LLMs to Understand and Generate Complex Vector Graphics" https://arxiv.org/abs/2412.11102
A song aesthetic evaluation toolkit trained on SongEval.
ACE-Step: A Step Towards Music Generation Foundation Model
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
verl: Volcano Engine Reinforcement Learning for LLMs
Finetune Qwen3, Llama 4, TTS, DeepSeek-R1 & Gemma 3 LLMs 2x faster with 70% less memory! 🦥
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
official implementation of paper ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification
✨✨Latest Advances on Multimodal Large Language Models
[ACL 2025 Main] UniCodec: a unified audio codec with a single codebook to support multi-domain audio data, including speech, music, and sound
Open-Sora: Democratizing Efficient Video Production for All
VideoSys: An easy and efficient system for video generation
TensorZero creates a feedback loop for optimizing LLM applications — turning production data into smarter, faster, and cheaper models.
InspireMusic: Music, Song, Audio Generation.
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Multilingual Voice Understanding Model
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
Speech, Language, Audio, Music Processing with Large Language Model
Audio synthesis, processing, & analysis platform for iOS, macOS and tvOS
INSPIRE: Instruction-based Multi-Task Speech and Audio Processing Benchmark