Stars
This repository provides data for the VAW dataset as described in the CVPR 2021 paper titled "Learning to Predict Visual Attributes in the Wild" and the ECCV 2022 paper titled "Improving Closed and…
A benchmark dataset for GRES and GREC [CVPR2023 Highlight]
The first attempt to replicate o3-like visual clue-tracking reasoning capabilities.
An Open-source RL System from ByteDance Seed and Tsinghua AIR
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
[ACL2025 Findings] Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
verl: Volcano Engine Reinforcement Learning for LLMs
Witness the aha moment of VLM with less than $3.
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, Qwen2-Audio, Ovis2, InternVL3, Llava, GLM4…
ICCV 2023 (Oral) Open-domain Visual Entity Recognition Towards Recognizing Millions of Wikipedia Entities
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
A curated list of papers and resources related to Described Object Detection, Open-Vocabulary/Open-World Object Detection and Referring Expression Comprehension. Updated frequently and pull request…
GPT4V-level open-source multi-modal model based on Llama3-8B
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Label Studio is a multi-type data labeling and annotation tool with standardized output format
a state-of-the-art-level open visual language model | 多模态预训练模型
⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end…
VGGFace implementation with Keras Framework
Pretrained Pytorch face detection (MTCNN) and facial recognition (InceptionResnet) models
A clean version (wash list) of MS-Celeb-1M face dataset, containing 6,464,018 face images of 94,682 celebrities
GIPHY's Open-Source Celebrity Detection Deep Learning Model
Dataset with 5 million images depicting human-made and natural landmarks spanning 200 thousand classes.
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
Official repo for VGen: a holistic video generation ecosystem for video generation building on diffusion models
Official code and data of "3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset"
Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch
Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)