Stars
The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"
Official code for paper: [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster.
Paper, Code and Resources for Speech Language Model and End2End Speech Dialogue System.
Janus-Series: Unified Multimodal Understanding and Generation Models
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Baichuan-Omni: Towards Capable Open-source Omni-modal LLM 🌊
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
This is a Phi Family of SLMs book for getting started with Phi Models. Phi a family of open sourced AI models developed by Microsoft. Phi models are the most capable and cost-effective small langua…
A Survey on Benchmarks of Multimodal Large Language Models
MuCR is a benchmark designed to evaluate Multimodal Large Language Models' (MLLMs) ability to discern causal links across modalities
Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining"
ReLE中文大模型能力评测(持续更新):目前已囊括257个大模型,覆盖chatgpt、gpt-4.1、o4-mini、谷歌gemini-2.5、Claude、智谱GLM-Z1、文心一言、qwen-max、百川、讯飞星火、商汤senseChat、minimax等商用模型, 以及DeepSeek-R1-0528、qwq-32b、deepseek-v3、qwen3、llama4、phi-4、glm…
FinRobot: An Open-Source AI Agent Platform for Financial Analysis using LLMs 🚀 🚀 🚀
Dataset and Code for our ACL 2024 paper: "Multimodal Table Understanding". We propose the first large-scale Multimodal IFT and Pre-Train Dataset for table understanding and develop a generalist tab…
An open-source implementation for training LLaVA-NeXT.
Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
A PyTorch native platform for training generative AI models
An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
Codes and Datasets for the Paper: Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction
Vary-tiny codebase upon LAVIS (for training from scratch)and a PDF image-text pairs data (about 600k including English/Chinese)
A flexible and efficient codebase for training visually-conditioned language models (VLMs)
[ACM'MM 2024 Oral] Official code for "OneChart: Purify the Chart Structural Extraction via One Auxiliary Token"
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer