-
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen)
- https://www.zhangxueyao.com/
Highlights
- Pro
Stars
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, Qwen2-Audio, Ovis2, InternVL3, Llava, GLM4…
Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.
The TTSDS benchmark evaluates synthetic speech quality by considering prosody, speaker identity, and intelligibility, comparing these factors with real speech and noise datasets.
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
Janus-Series: Unified Multimodal Understanding and Generation Models
A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
Benchmark data and code for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
A song aesthetic evaluation toolkit trained on SongEval.
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Speech Editing and Synthesis
ACE-Step: A Step Towards Music Generation Foundation Model
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
A collection of high-quality public recordings of Bach's sonatas and partitas for solo violin (BWV 1001–1006)
PyTorch Implementation of StyleSinger(AAAI 2024): Style Transfer for Out-of-Domain Singing Voice Synthesis
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
PyTorch Implementation of TCSinger(EMNLP 2024): Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
Di♪♪Rhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
A Conversational Speech Generation Model
Understanding R1-Zero-Like Training: A Critical Perspective
A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5
Ongoing research training transformer models at scale
A Singing Style Conversion Framework Based On Audio Infilling
Measuring Massive Multitask Language Understanding | ICLR 2021