Computer Science > Computation and Language

arXiv:2401.14688 (cs)

[Submitted on 26 Jan 2024 (v1), last revised 18 Jun 2024 (this version, v3)]

Title:Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

Authors:Xiaojun Wu, Dixiang Zhang, Ruyi Gan, Junyu Lu, Ziwei Wu, Renliang Sun, Jiaxing Zhang, Pingjian Zhang, Yan Song

Abstract:Recent advancements in text-to-image models have significantly enhanced image generation capabilities, yet a notable gap of open-source models persists in bilingual or Chinese language support. To address this need, we present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model which is developed by extending the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training. This approach includes the efficient expansion of vocabulary by integrating the most frequently used Chinese characters into CLIP's tokenizer and embedding layers, coupled with an absolute position encoding expansion. Additionally, we enrich text prompts by large vision-language model, leading to better images captions and possess higher visual quality. These enhancements are subsequently applied to downstream text-to-image models. Our empirical results indicate that the developed CLIP model excels in bilingual image-text this http URL, the bilingual image generation capabilities of Taiyi-Diffusion-XL surpass previous models. This research leads to the development and open-sourcing of the Taiyi-Diffusion-XL model, representing a notable advancement in the field of image generation, particularly for Chinese language applications. This contribution is a step forward in addressing the need for more diverse language support in multimodal research. The model and demonstration are made publicly available at \href{this https URL}, fostering further research and collaboration in this domain.

Comments:	Taiyi-Diffusion-XL Tech Report
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2401.14688 [cs.CL]
	(or arXiv:2401.14688v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2401.14688

Submission history

From: Xiaojun Wu [view email]
[v1] Fri, 26 Jan 2024 07:17:50 UTC (46,890 KB)
[v2] Mon, 27 May 2024 03:13:52 UTC (47,057 KB)
[v3] Tue, 18 Jun 2024 03:17:18 UTC (36,821 KB)

Computer Science > Computation and Language

Title:Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators