Diff-Font: Diffusion Model for Robust One-Shot Font Generation

Haibin He¹^na1,
Xinyuan Chen^2,3^na1,
Chaoyue Wang¹,
Juhua Liu ORCID: orcid.org/0000-0002-3907-8820¹,
Bo Du¹,
Dacheng Tao⁴ &
…
Qiao Yu^3,5

881 Accesses
Explore all metrics

Abstract

Font generation presents a significant challenge due to the intricate details needed, especially for languages with complex ideograms and numerous characters, such as Chinese and Korean. Although various few-shot (or even one-shot) font generation methods have been introduced, most of them rely on GAN-based image-to-image translation frameworks that still face (i) unstable training issues, (ii) limited fidelity in replicating font styles, and (iii) imprecise generation of complex characters. To tackle these problems, we propose a unified one-shot font generation framework called Diff-Font, based on the diffusion model. In particular, we approach font generation as a conditional generation task, where the content of characters is managed through predefined embedding tokens and the desired font style is extracted from a one-shot reference image. For glyph-rich characters such as Chinese and Korean, we incorporate additional inputs for strokes or components as fine-grained conditions. Owing to the proposed diffusion training process, these three types of information can be effectively modeled, resulting in stable training. Simultaneously, the integrity of character structures can be learned and preserved. To the best of our knowledge, Diff-Font is the first work to utilize a diffusion model for font generation tasks. Comprehensive experiments demonstrate that Diff-Font outperforms prior font generation methods in both high-fidelity font style replication and the generation of intricate characters. Our method achieves state-of-the-art results in both qualitative and quantitative aspects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

CD-Font: One-Shot Font Generation via Conditional Diffusion Model with Disentangled Guidance

KuzushijiDiffuser: Japanese Kuzushiji Font Generation with FontDiffuser

Few-Shot Compositional Font Generation with Dual Memory

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1), 1–18.
Google Scholar
Baek, K., Choi, Y., Uh, Y., Yoo, J., & Shim, H. (2021). Rethinking the truly unsupervised image-to-image translation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14154–14163).
Cha, J., Chun, S., Lee, G., Lee, B., Kim, S., & Lee, H. (2020). Few-shot compositional font generation with dual memory. In European conference on computer vision (pp. 735–751). Springer.
Cheng, S. I., Chen, Y. J., Chiu, W. C., Tseng, H. Y., & Lee, H. Y. (2023). Adaptively-realistic image generation from stroke and sketch with diffusion model. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 4054–4062).
Choi, J., Kim, S., Jeong, Y., Gwon, Y., & Yoon, S. (2021). Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv:2108.02938
Cui, Y., Che, W., Liu, T., Qin, B., & Yang, Z. (2021). Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3504–3514.
Article Google Scholar
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.
Google Scholar
Gao, Y., Guo, Y., Lian, Z., Tang, Y., & Xiao, J. (2019). Artistic glyph image synthesis via one-stage few-shot learning. ACM Transactions on Graphics (TOG), 38(6), 1–12.
Article Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Article MathSciNet Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 30.
Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. arXiv:2207.12598
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Google Scholar
Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision (pp. 1501–1510).
Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV) (pp. 172–189).
Isola, P., Zhu, J. Y., Zhou, T., Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134).
Jiang, Y., Lian, Z., Tang, Y., & Xiao, J. (2019). Scfont: Structure-guided chinese font generation via deep stacked networks. In Proceedings of the AAAI conference on artificial intelligence (pp. 4015–4022).
Kancharagunta, K. B., & Dubey, S. R. (2019). Csgan: Cyclic-synthesized generative adversarial networks for image-to-image transformation. arXiv:1901.03554
Kim, T., Cha, M., Kim, H., Lee, J. K., & Kim, J. (2017). Learning to discover cross-domain relations with generative adversarial networks. In International conference on machine learning, PMLR (pp. 1857–1865).
Kong, Y., Luo, C., Ma, W., Zhu, Q., Zhu, S., Yuan, N., & Jin, L. (2022). Look closer to supervise better: One-shot font generation via component-based discriminator. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13482–13491).
Li, B., Xue, K., Liu, B., & Lai, Y. K. (2022). VQBB: Image-to-image translation with vector quantized Brownian bridge. arXiv:2205.07680
Liu, M. Y., & Tuzel, O. (2016). Coupled generative adversarial networks. Advances in Neural Information Processing Systems, 29.
Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. Advances in Neural Information Processing Systems, 30.
Liu, M. Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., & Kautz, J. (2019). Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10551–10560).
Liu, N., Li, S., Du, Y., Torralba, A., & Tenenbaum, J. B. (2022). Compositional visual generation with composable diffusion models. arXiv:2206.01714
Liu, X., Park, D. H., Azadi, S., Zhang, G., Chopikyan, A., Hu, Y., Shi, H., Rohrbach, A., & Darrell, T. (2021). More control for free! image synthesis with semantic diffusion guidance. arXiv:2112.05744
Mirza, M. & Osindero, S. (2014). Conditional generative adversarial nets. In CoRR.
Nair, N. G., Bandara, W. G. C., Patel, V. M. (2022). Image generation with multimodal priors using denoising diffusion probabilistic models. arXiv:2206.05039
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741
Park, S., Chun, S., Cha, J., Lee, B., & Shim, H. (2021a) Few-shot font generation with localized style representations and factorization. In Proceedings of the AAAI conference on artificial intelligence (pp. 2393–2402).
Park, S., Chun, S., Cha, J., Lee, B., & Shim, H. (2021b) Multiple heads are better than one: Few-shot font generation with multiple localized experts. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13900–13909).
Park, S., Chun, S., Cha, J., Lee, B., & Shim, H. (2022). Few-shot font generation with weakly supervised localized representations. IEEE Transactions on Pattern Analysis and Machine Intelligence 1–17.
Ramesh, A., Dhariwal, P., Nichol. A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684–10695).
Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., & Norouzi, M. (2022a). Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings (pp. 1–10).
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., Salimans, T., & Ho, J. (2022b). Photorealistic text-to-image diffusion models with deep language understanding. arXiv:2205.11487
Sasaki, H., Willcocks, C. G., & Breckon, T. P. (2021). Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models. arXiv:2104.05358
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, PMLR (pp. 2256–2265).
Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models.
Tang, L., Cai, Y., Liu, J., Hong, Z., Gong, M., Fan, M., Han, J., Liu, J., Ding, E., & Wang, J. (2022). Few-shot font generation by learning fine-grained local styles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7895–7904).
Tian, Y. (2017). zi2zi: Master Chinese calligraphy with conditional adversarial networks. Internet https://github com/kaonashi-tyc/zi2zi, 3.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., & Rodriguez, A. (2023). Llama: Open and efficient foundation language models. arXiv:2302.13971
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Article Google Scholar
Wen, Q., Li, S., Han, B., & Yuan, Y. (2021). Zigan: Fine-grained Chinese calligraphy font generation via a few-shot style transfer approach. In Proceedings of the 29th ACM international conference on multimedia (pp. 621–629).
Wolleb, J., Sandkühler, R., Bieder, F., & Cattin, P. C. (2022). The swiss army knife for image-to-image translation: Multi-task diffusion models. arXiv:2204.02641
Xie, Y., Chen, X., Sun, L., & Lu, Y. (2021). Dg-font: Deformable generative networks for unsupervised font generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5130–5140).
Yang, X., Xie, D., & Wang, X. (2018). Crossing-domain generative adversarial networks for unsupervised multi-domain image-to-image translation. In Proceedings of the 26th ACM international conference on multimedia (pp. 374–382).
Yi, Z., Zhang, H., Tan, P., & Gong, M. (2017). Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision (pp. 2849–2857).
Zeng, J., Chen, Q., Liu, Y., Wang, M., & Yao, Y. (2021). Strokegan: Reducing mode collapse in Chinese font generation via stroke encoding. In Proceedings of the AAAI conference on artificial intelligence (pp. 3270–3277).
Zhang, L., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. arXiv:2302.05543
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018a). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586–595).
Zhang, Y., Zhang, Y., & Cai, W. (2018b). Separating style and content for generalized style transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8447–8455).
Zhao, M., Bao, F., Li, C., & Zhu, J. (2022). Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. arXiv:2207.06635
Zhu, J.Y., Park, T., Isola, P., & Efros, A.A. (2017a). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2232).
Zhu, J. Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017b). Toward multimodal image-to-image translation. Advances in Neural Information Processing Systems, 30.

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2023YFC2705700, in part by the National Natural Science Foundation of China under Grants U23B2048, 62076186, 62225113, and 62102150, and in part by the Innovative Research Group Project of Hubei Province under Grant 2024AFA017. The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.

Author information

Haibin He and Xinyuan Chen have contributed equally to this work.

Authors and Affiliations

School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China
Haibin He, Chaoyue Wang, Juhua Liu & Bo Du
School of Computer Science, Wuhan University, Wuhan, China
Xinyuan Chen
Shanghai AI Laboratory, Shanghai, 202150, China
Xinyuan Chen & Qiao Yu
The College of Computing & Data Science at Nanyang Technological University, #32 Block N4 #02a-014, 50 Nanyang Avenue, Singapore, 639798, Singapore
Dacheng Tao
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
Qiao Yu

Authors

Haibin He
View author publications
You can also search for this author in PubMed Google Scholar
Xinyuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chaoyue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Juhua Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Du
View author publications
You can also search for this author in PubMed Google Scholar
Dacheng Tao
View author publications
You can also search for this author in PubMed Google Scholar
Qiao Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chaoyue Wang or Juhua Liu.

Additional information

Communicated by Seon Joo Kim.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

He, H., Chen, X., Wang, C. et al. Diff-Font: Diffusion Model for Robust One-Shot Font Generation. Int J Comput Vis 132, 5372–5386 (2024). https://doi.org/10.1007/s11263-024-02137-0

Download citation

Received: 07 May 2023
Accepted: 31 May 2024
Published: 13 June 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s11263-024-02137-0

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CD-Font: One-Shot Font Generation via Conditional Diffusion Model with Disentangled Guidance

KuzushijiDiffuser: Japanese Kuzushiji Font Generation with FontDiffuser

Few-Shot Compositional Font Generation with Dual Memory

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Diff-Font: Diffusion Model for Robust One-Shot Font Generation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CD-Font: One-Shot Font Generation via Conditional Diffusion Model with Disentangled Guidance

KuzushijiDiffuser: Japanese Kuzushiji Font Generation with FontDiffuser

Few-Shot Compositional Font Generation with Dual Memory

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now