Stars
[CVPR 2025 Best Paper Nomination] FoundationStereo: Zero-Shot Stereo Matching
RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning
Training a transformer to generate cursive handwriting
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Depth Any Video with Scalable Synthetic Data (ICLR 2025)
DIAMOND (DIffusion As a Model Of eNvironment Dreams) is a reinforcement learning agent trained in a diffusion world model. NeurIPS 2024 Spotlight.
Python Computer Vision & Video Analytics Framework With Batteries Included
rtop, a performance monitor for the Rockchips RK3566/68/88
rtop, a performance monitor for the Rockchips RK3566/68/88
Light-weight framework for Objects AI-detection with Live-Cameras (USB/IP) and Telegram-bot notifications. Use Yolo or adjust for you own AI-models support and catch the best shot!
Theia: Distilling Diverse Vision Foundation Models for Robot Learning
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
Fast and flexible image augmentation library. Paper about the library: https://www.mdpi.com/2078-2489/11/2/125
Benchmarking Generalized Out-of-Distribution Detection
[ECCV2024] Video Foundation A491 Models & Data for Multimodal Understanding
DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data (NeurIPS 2023 Spotlight) / / / / When Does Perceptual Alignment Benefit Vision Representations? (NeurIPS 2024)
Official Implementation of CVPR24 highlight paper: Matching Anything by Segmenting Anything
Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
A simple demo of yolov5s running on rk3588/3588s using Python (about 72 frames). / 一个使用Python在rk3588/3588s上运行的yolov5s简单demo(大约72帧/s)。
A simple demo of yolov5s running on rk3588/3588s using c++ (about 142 frames). / 一个使用c++在rk3588/3588s上运行的yolov5s简单demo(142帧/s)。
Official Code for DragGAN (SIGGRAPH 2023)
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable…
Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
DAMO-YOLO: a fast and accurate object detection method with some new techs, including NAS backbones, efficient RepGFPN, ZeroHead, AlignedOTA, and distillation enhancement.
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.