[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Diff-Font: Diffusion Model for Robust One-Shot Font Generation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Font generation presents a significant challenge due to the intricate details needed, especially for languages with complex ideograms and numerous characters, such as Chinese and Korean. Although various few-shot (or even one-shot) font generation methods have been introduced, most of them rely on GAN-based image-to-image translation frameworks that still face (i) unstable training issues, (ii) limited fidelity in replicating font styles, and (iii) imprecise generation of complex characters. To tackle these problems, we propose a unified one-shot font generation framework called Diff-Font, based on the diffusion model. In particular, we approach font generation as a conditional generation task, where the content of characters is managed through predefined embedding tokens and the desired font style is extracted from a one-shot reference image. For glyph-rich characters such as Chinese and Korean, we incorporate additional inputs for strokes or components as fine-grained conditions. Owing to the proposed diffusion training process, these three types of information can be effectively modeled, resulting in stable training. Simultaneously, the integrity of character structures can be learned and preserved. To the best of our knowledge, Diff-Font is the first work to utilize a diffusion model for font generation tasks. Comprehensive experiments demonstrate that Diff-Font outperforms prior font generation methods in both high-fidelity font style replication and the generation of intricate characters. Our method achieves state-of-the-art results in both qualitative and quantitative aspects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1), 1–18.

    Google Scholar 

  • Baek, K., Choi, Y., Uh, Y., Yoo, J., & Shim, H. (2021). Rethinking the truly unsupervised image-to-image translation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14154–14163).

  • Cha, J., Chun, S., Lee, G., Lee, B., Kim, S., & Lee, H. (2020). Few-shot compositional font generation with dual memory. In European conference on computer vision (pp. 735–751). Springer.

  • Cheng, S. I., Chen, Y. J., Chiu, W. C., Tseng, H. Y., & Lee, H. Y. (2023). Adaptively-realistic image generation from stroke and sketch with diffusion model. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 4054–4062).

  • Choi, J., Kim, S., Jeong, Y., Gwon, Y., & Yoon, S. (2021). Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv:2108.02938

  • Cui, Y., Che, W., Liu, T., Qin, B., & Yang, Z. (2021). Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3504–3514.

    Article  Google Scholar 

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  • Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.

    Google Scholar 

  • Gao, Y., Guo, Y., Lian, Z., Tang, Y., & Xiao, J. (2019). Artistic glyph image synthesis via one-stage few-shot learning. ACM Transactions on Graphics (TOG), 38(6), 1–12.

    Article  Google Scholar 

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.

    Article  MathSciNet  Google Scholar 

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, 30.

  • Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance. arXiv:2207.12598

  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

    Google Scholar 

  • Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision (pp. 1501–1510).

  • Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV) (pp. 172–189).

  • Isola, P., Zhu, J. Y., Zhou, T., Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134).

  • Jiang, Y., Lian, Z., Tang, Y., & Xiao, J. (2019). Scfont: Structure-guided chinese font generation via deep stacked networks. In Proceedings of the AAAI conference on artificial intelligence (pp. 4015–4022).

  • Kancharagunta, K. B., & Dubey, S. R. (2019). Csgan: Cyclic-synthesized generative adversarial networks for image-to-image transformation. arXiv:1901.03554

  • Kim, T., Cha, M., Kim, H., Lee, J. K., & Kim, J. (2017). Learning to discover cross-domain relations with generative adversarial networks. In International conference on machine learning, PMLR (pp. 1857–1865).

  • Kong, Y., Luo, C., Ma, W., Zhu, Q., Zhu, S., Yuan, N., & Jin, L. (2022). Look closer to supervise better: One-shot font generation via component-based discriminator. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13482–13491).

  • Li, B., Xue, K., Liu, B., & Lai, Y. K. (2022). VQBB: Image-to-image translation with vector quantized Brownian bridge. arXiv:2205.07680

  • Liu, M. Y., & Tuzel, O. (2016). Coupled generative adversarial networks. Advances in Neural Information Processing Systems, 29.

  • Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. Advances in Neural Information Processing Systems, 30.

  • Liu, M. Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., & Kautz, J. (2019). Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10551–10560).

  • Liu, N., Li, S., Du, Y., Torralba, A., & Tenenbaum, J. B. (2022). Compositional visual generation with composable diffusion models. arXiv:2206.01714

  • Liu, X., Park, D. H., Azadi, S., Zhang, G., Chopikyan, A., Hu, Y., Shi, H., Rohrbach, A., & Darrell, T. (2021). More control for free! image synthesis with semantic diffusion guidance. arXiv:2112.05744

  • Mirza, M. & Osindero, S. (2014). Conditional generative adversarial nets. In CoRR.

  • Nair, N. G., Bandara, W. G. C., Patel, V. M. (2022). Image generation with multimodal priors using denoising diffusion probabilistic models. arXiv:2206.05039

  • Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. (2021). Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741

  • Park, S., Chun, S., Cha, J., Lee, B., & Shim, H. (2021a) Few-shot font generation with localized style representations and factorization. In Proceedings of the AAAI conference on artificial intelligence (pp. 2393–2402).

  • Park, S., Chun, S., Cha, J., Lee, B., & Shim, H. (2021b) Multiple heads are better than one: Few-shot font generation with multiple localized experts. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13900–13909).

  • Park, S., Chun, S., Cha, J., Lee, B., & Shim, H. (2022). Few-shot font generation with weakly supervised localized representations. IEEE Transactions on Pattern Analysis and Machine Intelligence 1–17.

  • Ramesh, A., Dhariwal, P., Nichol. A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684–10695).

  • Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., & Norouzi, M. (2022a). Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings (pp. 1–10).

  • Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., Salimans, T., & Ho, J. (2022b). Photorealistic text-to-image diffusion models with deep language understanding. arXiv:2205.11487

  • Sasaki, H., Willcocks, C. G., & Breckon, T. P. (2021). Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models. arXiv:2104.05358

  • Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, PMLR (pp. 2256–2265).

  • Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models.

  • Tang, L., Cai, Y., Liu, J., Hong, Z., Gong, M., Fan, M., Han, J., Liu, J., Ding, E., & Wang, J. (2022). Few-shot font generation by learning fine-grained local styles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7895–7904).

  • Tian, Y. (2017). zi2zi: Master Chinese calligraphy with conditional adversarial networks. Internet https://github com/kaonashi-tyc/zi2zi, 3.

  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., & Rodriguez, A. (2023). Llama: Open and efficient foundation language models. arXiv:2302.13971

  • Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.

    Article  Google Scholar 

  • Wen, Q., Li, S., Han, B., & Yuan, Y. (2021). Zigan: Fine-grained Chinese calligraphy font generation via a few-shot style transfer approach. In Proceedings of the 29th ACM international conference on multimedia (pp. 621–629).

  • Wolleb, J., Sandkühler, R., Bieder, F., & Cattin, P. C. (2022). The swiss army knife for image-to-image translation: Multi-task diffusion models. arXiv:2204.02641

  • Xie, Y., Chen, X., Sun, L., & Lu, Y. (2021). Dg-font: Deformable generative networks for unsupervised font generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5130–5140).

  • Yang, X., Xie, D., & Wang, X. (2018). Crossing-domain generative adversarial networks for unsupervised multi-domain image-to-image translation. In Proceedings of the 26th ACM international conference on multimedia (pp. 374–382).

  • Yi, Z., Zhang, H., Tan, P., & Gong, M. (2017). Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision (pp. 2849–2857).

  • Zeng, J., Chen, Q., Liu, Y., Wang, M., & Yao, Y. (2021). Strokegan: Reducing mode collapse in Chinese font generation via stroke encoding. In Proceedings of the AAAI conference on artificial intelligence (pp. 3270–3277).

  • Zhang, L., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. arXiv:2302.05543

  • Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018a). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586–595).

  • Zhang, Y., Zhang, Y., & Cai, W. (2018b). Separating style and content for generalized style transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8447–8455).

  • Zhao, M., Bao, F., Li, C., & Zhu, J. (2022). Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. arXiv:2207.06635

  • Zhu, J.Y., Park, T., Isola, P., & Efros, A.A. (2017a). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2232).

  • Zhu, J. Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A., Wang, O., & Shechtman, E. (2017b). Toward multimodal image-to-image translation. Advances in Neural Information Processing Systems, 30.

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2023YFC2705700, in part by the National Natural Science Foundation of China under Grants U23B2048, 62076186, 62225113, and 62102150, and in part by the Innovative Research Group Project of Hubei Province under Grant 2024AFA017. The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chaoyue Wang or Juhua Liu.

Additional information

Communicated by Seon Joo Kim.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, H., Chen, X., Wang, C. et al. Diff-Font: Diffusion Model for Robust One-Shot Font Generation. Int J Comput Vis 132, 5372–5386 (2024). https://doi.org/10.1007/s11263-024-02137-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-024-02137-0

Keywords

Navigation