Stars
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding. (WACV2025)
(CVPR 2025 highlight✨) Official repository of paper "LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models"
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning
The official implement of "VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning"
This repo contains the code for 1D tokenizer and generator
Official PyTorch implementation for "Large Language Diffusion Models"
Pytorch implementation of MaskGIT: Masked Generative Image Transformer (https://arxiv.org/pdf/2202.04200.pdf)
(AAAI 2025) Official PyTorch implementation of paper "SAUGE: Taming SAM for Uncertainty-Aligned Multi-Granularity Edge Detection".
Denoising Diffusion Probabilistic Models
Image-to-Image Translation in PyTorch
A PyTorch library for implementing flow matching algorithms, featuring continuous and discrete flow matching implementations. It includes practical examples for both text and image modalities.
This is a repository for listing papers on scene graph generation and application.
[ICLR'25] Official code for "Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models"
[CVPR 2025 🔥] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
Official implementation of Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement.
Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
UniMD: Towards Unifying Moment retrieval and temporal action Detection
Uncertainty-aware Fine-tuning of Segmentation Foundation Models (NeurIPS 2024).
Tips for releasing research code in Machine Learning (with official NeurIPS 2020 recommendations)
[CVPR2024] GSVA: Generalized Segmentation via Multimodal Large Language Models
This repository is for the first survey on SAM & SAM2 for Videos.
[EMNLP 2022] Official Pytorch code for "Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval"
[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval