Starred repositories
TEMPURA enables video-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of untrimmed videos.
Re-implementation of pi0 vision-language-action (VLA) model from Physical Intelligence
Speech-to-text transcription VST3/ARA plugin
SAPIEN Manipulation Skill Framework, an open source GPU parallelized robotics simulator and benchmark, led by Hillbot, Inc.
[Lumina Embodied AI Community] 具身智能技术指南 Embodied-AI-Guide
Implementation of π₀, the robotic foundation model architecture proposed by Physical Intelligence
Audio Dataset for training CLAP and other models
adefossez / demucs
Forked from facebookresearch/demucsCode for the paper Hybrid Spectrogram and Waveform Source Separation
An easy-to-use, fast, and easily integrable tool for evaluating audio LLM
Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice
An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning
Production First and Production Ready End-to-End Speech Recognition Toolkit
ACL 2025: Synthetic data generation pipelines for text-rich images.
SimVQ: Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
The official repository of SpeechCraft dataset, a large-scale expressive bilingual speech dataset with natural language descriptions.
My learning notes/codes for ML SYS.
LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
An aggregation of human motion understanding research.
A spoken question answering dataset on SQUAD
Official PyTorch Implementation for Paper "No More Adam: Learning Rate Scaling at Initialization is All You Need"