- Shanghai, China
- https://jiaxin-ye.github.io/
Lists (8)
Sort Name ascending (A-Z)
Stars
A curated list of Video to Audio Generation
Huggingface Implementation of AV-HuBERT on the MuAViC Dataset
Famous Vision Language Models and Their Architectures
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
MAGI-1: Autoregressive Video Generation at Scale
ICML 2024 "From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation"
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
[CVPR 2025] The First Investigation of CoT Reasoning in Image Generation
Unified automatic quality assessment for speech, music, and sound.
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling
The official repository of SpeechCraft dataset, a large-scale expressive bilingual speech dataset with natural language descriptions.
Official Pytorch Implementation of MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation (ICLR 2025)
Ego4d dataset repository. Download the dataset, visualize, extract features & example usage of the dataset
Code, Dataset, Samples for the NeurIPS paper “ Tell What You Hear From What You See -- Video to Audio Generation Through Text”
Official Code for "Rethinking Diffusion Model in High Dimension"
A very simple GRPO implement for reproducing r1-like LLM thinking.
A set of functions for supervised feature learning/classification of mental states from EEG based on "EEG images" idea.
An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Pytorch port of Google Research's VGGish model used for extracting audio features.
LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning
MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Explore the Multimodal “Aha Moment” on 2B Model
The first Large Audio Language Model that enables native in-depth thinking, which is trained on large-scale audio Chain-of-Thought data.