-
The University of Hong Kong
- liheyoung.github.io
Stars
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
[CVPR'25 Highlight] Official repository of Sonata: Self-Supervised Learning of Reliable Point Representations
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
A curated list of awesome papers on visual reconstructions from brain activity.
Zero-Shot Monocular Depth Completion with Guided Diffusion
[CVPR 2025] Video Depth without Video Models
[NeurIPS 2024] official code release for our paper "Revisiting the Integration of Convolution and Attention for Vision Backbone".
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
[TPAMI 2025] UniMatch V2: Pushing the Limit of Semi-Supervised Semantic Segmentation
[CVPR 2024 Extension] 160K volumes (42M slices) datasets, new segmentation datasets, 31M-1.2B pre-trained models, various pre-training recipes, 50+ downstream tasks implementation
[CVPR 2025 Highlight] DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"
Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs
Official repository for "AM-RADIO: Reduce All Domains Into One"
An open source implementation of CLIP.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Upgraded repo includes more capabilities, converted the cmd .py scripts to function more intuitively, added 147 different depth output colour map methods, introduced batch image as well as video pr…
VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning
A flexible and efficient codebase for training visually-conditioned language models (VLMs)
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-contex…
🌊 Images to → 3D Parallax effect video. A free and open source ImmersityAI alternative
[CVPR 2024] Probing the 3D Awareness of Visual Foundation Models
Muggled DPT: Depth estimation without the magic
[CVPR 2024] VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis
[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
[IEEE TKDE] Open-Domain Semi-Supervised Learning via Glocal Cluster Structure Exploitation
(ECCV 2024) Code for V-IRL: Grounding Virtual Intelligence in Real Life