Starred repositories
Recipes to scale inference-time compute of open models
Efficient Triton Kernels for LLM Training
Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.
Image augmentation for machine learning experiments.
[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
[ICLR 2024] Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
Recent LLM-based CV and related works. Welcome to comment/contribute!
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
Mora: More like Sora for Generalist Video Generation
Implementation of paper - YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
OpenMMLab YOLO series toolbox and benchmark. Implemented RTMDet, RTMDet-Rotated,YOLOv5, YOLOv6, YOLOv7, YOLOv8,YOLOX, PPYOLOE, etc.
YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/
[CVPR 2024] Real-Time Open-Vocabulary Object Detection
【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
[ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
[ECCV 2024] Official GitHub repository for the paper "LingoQA: Visual Question Answering for Autonomous Driving"