LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Ning Yu¹³,
Chia-Chih Chen¹³,
Zeyuan Chen¹³,
Rui Meng¹³,
Gang Wu¹³,
Paul Josel¹³,
Juan Carlos Niebles¹³,
Caiming Xiong¹³ &
…
Ran Xu¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15078))

Included in the following conference series:

European Conference on Computer Vision

94 Accesses

Abstract

Graphic layout designs play an essential role in visual communication. Yet handcrafting layout designs is skill-demanding, time-consuming, and non-scalable to batch production. Generative models emerge to make design automation scalable but it remains non-trivial to produce designs that comply with designers’ multimodal desires, i.e., constrained by background images and driven by foreground content. We propose LayoutDETR that inherits the high quality and realism from generative modeling, while reformulating content-aware requirements as a detection problem: we learn to detect in a background image the reasonable locations, scales, and spatial relations for multimodal foreground elements in a layout. Our solution sets a new state-of-the-art performance for layout generation on public benchmarks and on our newly-curated ad banner dataset. We integrate our solution into a graphical system that facilitates user studies, and show that users prefer our designs over baselines by significant margins. Code, models, dataset, and demos are available at GitHub.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 109.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 131.43; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

PosterLlama: Bridging Design Ability of Language Model to Content-Aware Layout Generation

BLT: Bidirectional Layout Transformer for Controllable Layout Generation

LayoutFlow: Flow Matching for Layout Generation

References

https://github.com/PaddlePaddle/PaddleOCR
Arroyo, D.M., Postels, J., Tombari, F.: Variational transformer networks for layout generation. In: CVPR (2021)
Google Scholar
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans. In: ICLR (2018)
Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. In: ICLR (2019)
Google Scholar
Cao, Y., et al.: Geometry aligned variational transformer for image-conditioned layout generation. Multimedia (2022)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Google Scholar
Carlier, A., Danelljan, M., Alahi, A., Timofte, R.: Deepsvg: a hierarchical generative network for vector graphics animation. NeurIPS (2020)
Google Scholar
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments (2020)
Google Scholar
Chen, G., Xie, P., Dong, J., Wang, T.: Understanding programmatic creative: The role of ai. J. Advertising (2019)
Google Scholar
Chen, X., Mishra, N., Rohaninejad, M., Abbeel, P.: Pixelsnail: an improved autoregressive generative model. In: ICML (2018)
Google Scholar
Cheng, C.Y., Huang, F., Li, G., Li, Y.: Play: parametrically conditioned layout generation using latent diffusion. In: ICML (2023)
Google Scholar
Dai, Z., Cai, B., Lin, Y., Chen, J.: Up-detr: unsupervised pre-training for object detection with transformers. In: CVPR (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Feng, W., et al.: Layoutgpt: compositional visual planning and generation with large language models. In: NeurIPS (2024)
Google Scholar
Girshick, R.: Fast r-cnn. In: ICCV (2015)
Google Scholar
Goodfellow, I., et al.: Generative adversarial networks. In: NeurIPS (2014)
Google Scholar
Guo, S., et al.: Vinci: an intelligent graphic design system for generating advertising posters. In: CHI (2021)
Google Scholar
Gupta, K., Lazarow, J., Achille, A., Davis, L.S., Mahadevan, V., Shrivastava, A.: Layouttransformer: layout generation and completion with self-attention. In: ICCV (2021)
Google Scholar
Haykin, S., Network, N.: A comprehensive foundation. Neural networks (2004)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)
Google Scholar
Horita, D., Inoue, N., Kikuchi, K., Yamaguchi, K., Aizawa, K.: Retrieval-augmented layout transformer for content-aware layout generation. arXiv (2023)
Google Scholar
Hsu, H.Y., He, X., Peng, Y., Kong, H., Zhang, Q.: Posterlayout: a new benchmark and approach for content-aware visual-textual presentation layout. CVPR (2023)
Google Scholar
Hui, M., Zhang, Z., Zhang, X., Xie, W., Wang, Y., Lu, Y.: Unifying layout generation with a decoupled diffusion model. In: CVPR (2023)
Google Scholar
Hussain, Z., et al.: Automatic understanding of image and video advertisements. In: CVPR (2017)
Google Scholar
Inoue, N., Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Layoutdm: discrete diffusion model for controllable layout generation. In: CVPR (2023)
Google Scholar
Jiang, Z., et al.: Unilayout: taming unified sequence-to-sequence transformers for graphic layout generation. arXiv (2022)
Google Scholar
Jiang, Z., et al.: Layoutformer++: conditional graphic layout generation via constraint serialization and decoding space restriction. In: CVPR (2023)
Google Scholar
Jiang, Z., Sun, S., Zhu, J., Lou, J.G., Zhang, D.: Coarse-to-fine generative modeling for graphic layouts. In: AAAI (2022)
Google Scholar
Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: Layoutvae: stochastic scene layout generation from a label set. In: ICCV (2019)
Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. In: ICLR (2018)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR (2020)
Google Scholar
Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Constrained graphic layout generation via latent optimization. In: ACM MM (2021)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2013)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
Google Scholar
Kong, X., Jiang, L., Chang, H., Zhang, H., Hao, Y., Gong, H., Essa, I.: Blt: bidirectional layout transformer for controllable layout generation. In: ECCV (2022)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeurIPS (2012)
Google Scholar
Landa, R.: Graphic design solutions/robin landa. Wadsworth, Boston (2010)
Google Scholar
Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: ICML (2016)
Google Scholar
Lee, H.Y., et al.: Neural design network: graphic layout generation with constraints. In: ECCV (2020)
Google Scholar
Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., Liu, C.: Vitgan: training gans with vision transformers. In: ICLR (2022)
Google Scholar
Levi, E., Brosh, E., Mykhailych, M., Perez, M.: Dlt: conditioned layout generation with joint discrete-continuous diffusion layout transformer. In: ICCV (2023)
Google Scholar
Li, G., Baechler, G., Tragut, M., Li, Y.: Learning to denoise raw mobile ui layouts for improving datasets at scale. In: CHI (2022)
Google Scholar
Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: Layoutgan: generating graphic layouts with wireframe discriminators. In: ICLR (2019)
Google Scholar
Li, J., Yang, J., Zhang, J., Liu, C., Wang, C., Xu, T.: Attribute-conditioned layout gan for automatic graphic design. TVCG (2020)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Google Scholar
Li, Z., et al.: Planning and rendering: Towards end-to-end product poster generation. arXiv (2023)
Google Scholar
Lin, J., et al.: A parse-then-place approach for generating graphic layouts from textual descriptions. In: ICCV (2023)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Google Scholar
Liu, B., Zhu, Y., Song, K., Elgammal, A.: Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. In: ICLR (2020)
Google Scholar
Lok, S., Feiner, S.: A survey of automated layout techniques for information presentations. In: Proceedings of SmartGraphics (2001)
Google Scholar
Nauata, N., Chang, K.H., Cheng, C.Y., Mori, G., Furukawa, Y.: House-gan: relational generative adversarial networks for graph-constrained house layout generation. In: ECCV (2020)
Google Scholar
Nauata, N., Hosseini, S., Chang, K.H., Chu, H., Cheng, C.Y., Furukawa, Y.: House-gan++: Generative adversarial layout refinement network towards intelligent computational agent for professional architects. In: CVPR (2021)
Google Scholar
Nguyen, D.D., Nepal, S., Kanhere, S.S.: Diverse multimedia layout generation with multi choice learning. Multimedia (2021)
Google Scholar
Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with pixelcnn decoders. In: NeurIPS (2016)
Google Scholar
Patil, A.G., Ben-Eliezer, O., Perel, O., Averbuch-Elor, H.: Read: Recursive autoencoders for document layout generation. In: CVPR Workshops (2020)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)
Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. CVPR (2019)
Google Scholar
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML (2014)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Salimans, T., Karpathy, A., Chen, X., Kingma, D.P.: Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. ICLR (2017)
Google Scholar
Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: scaling stylegan to large diverse datasets. In: SIGGRAPH (2022)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP (2013)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Google Scholar
Stribley, M.: Rules of composition all designers live by. Retrieved May (2016)
Google Scholar
Suvorov, R., et al.: Resolution-robust large mask inpainting with fourier convolutions. In: WACV (2022)
Google Scholar
Tang, Z., Wu, C., Li, J., Duan, N.: Layoutnuwa: revealing the hidden layout expertise of large language models. In: ICLR (2024)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
Google Scholar
Wang, Y., et al.: Aesthetic text logo synthesis via content-aware layout inferring. In: CVPR (2022)
Google Scholar
Xie, Y., Huang, D., Wang, J., Lin, C.Y.: Canvasemb: learning layout representation with large-scale pre-training for graphic design. Multimedia (2021)
Google Scholar
Yamaguchi, K.: Canvasvae: learning to generate vector graphic documents. In: ICCV (2021)
Google Scholar
Yu, F., Liu, K., Zhang, Y., Zhu, C., Xu, K.: Partnet: recursive part decomposition network for fine-grained and hierarchical shape segmentation. In: CVPR (2019)
Google Scholar
Yu, J., et al.: Vector-quantized image modeling with improved vqgan. In: ICLR (2022)
Google Scholar
Yu, N., Li, K., Zhou, P., Malik, J., Davis, L., Fritz, M.: Inclusive gan: improving data and minority coverage in generative models. In: ECCV (2020)
Google Scholar
Yu, N., et al.: Dual contrastive loss and attention for gans. In: ICCV (2021)
Google Scholar
Zhang, J., Guo, J., Sun, S., Lou, J.G., Zhang, D.: Layoutdiffusion: improving graphic layout generation by discrete diffusion probabilistic models. In: ICCV (2023)
Google Scholar
Zheng, X., Qiao, X., Cao, Y., Lau, R.W.: Content-aware generative modeling of graphic design layouts. TOG (2019)
Google Scholar
Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: ICDAR (2019)
Google Scholar
Zhou, M., Xu, C., Ma, Y., Ge, T., Jiang, Y., Xu, W.: Composition-aware graphic layout gan for visual-textual presentation designs. IJCAI (2022)
Google Scholar

Download references

Acknowledgments

We thank Shu Zhang, Silvio Savarese, Abigail Kutruff, Brian Brechbuhl, Elham Etemad, and Amrutha Krishnan from Salesforce for constructive advice.

Author information

Authors and Affiliations

Salesforce Research, San Francisco, USA
Ning Yu, Chia-Chih Chen, Zeyuan Chen, Rui Meng, Gang Wu, Paul Josel, Juan Carlos Niebles, Caiming Xiong & Ran Xu

Authors

Ning Yu
View author publications
You can also search for this author in PubMed Google Scholar
Chia-Chih Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zeyuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Rui Meng
View author publications
You can also search for this author in PubMed Google Scholar
Gang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Paul Josel
View author publications
You can also search for this author in PubMed Google Scholar
Juan Carlos Niebles
View author publications
You can also search for this author in PubMed Google Scholar
Caiming Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Ran Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ning Yu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 18049 KB)

Supplementary material 2 (mov 82611 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, N. et al. (2025). LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15078. Springer, Cham. https://doi.org/10.1007/978-3-031-72661-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-72661-3_10
Published: 27 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72660-6
Online ISBN: 978-3-031-72661-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

PosterLlama: Bridging Design Ability of Language Model to Content-Aware Layout Generation

BLT: Bidirectional Layout Transformer for Controllable Layout Generation

LayoutFlow: Flow Matching for Layout Generation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 18049 KB)

Supplementary material 2 (mov 82611 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

PosterLlama: Bridging Design Ability of Language Model to Content-Aware Layout Generation

BLT: Bidirectional Layout Transformer for Controllable Layout Generation

LayoutFlow: Flow Matching for Layout Generation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 18049 KB)

Supplementary material 2 (mov 82611 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation