Abstract:With the development of generative adversarial networks, synthesizing images from text descriptions has become an active research area. However, text descriptions used for image generation are often in English, and the generated objects are mostly faces, flowers, birds, etc. Few studies have been conducted on the generation of Chinese paintings with Chinese descriptions. The text-to-image task often requires a large number of labeled image--text pairs, which is expensive and boring. The advance of vision-language pre-training enables an image generation process guided by an optimized way, which significantly reduces the demand for annotated datasets and computational resources. In this paper, a multi-domain VQGAN model is proposed to generate Chinese paintings in multiple domains. Further, a vision-language pre-training model WenLan is used to calculate the distance loss between the generated images and the text descriptions. The semantic consistency between images and text is achieved by optimizing the hidden space variables as the input of multi-domain VQGAN. An ablation study is conducted to compare different variants of our multi-domain VQGAN in terms of the FID and R-precision metrics. We also conduct a user study to further show the effectiveness of our proposed model. The extensive results demonstrate that our proposed multi-domain VQGAN model outperforms all the competitors in terms of image quality and text-image semantic consistency.