Computer Science > Computer Vision and Pattern Recognition

arXiv:2409.04429 (cs)

[Submitted on 6 Sep 2024 (v1), last revised 23 Oct 2024 (this version, v2)]

Title:VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Authors:Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu

View PDF HTML (experimental)

Abstract:VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.

Comments:	Code: this https URL. The first two authors contributed equally to this work
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2409.04429 [cs.CV]
	(or arXiv:2409.04429v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2409.04429

Submission history

From: Yecheng Wu [view email]
[v1] Fri, 6 Sep 2024 17:49:56 UTC (18,115 KB)
[v2] Wed, 23 Oct 2024 16:42:06 UTC (18,115 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators