Abstract
Graphic layout designs play an essential role in visual communication. Yet handcrafting layout designs is skill-demanding, time-consuming, and non-scalable to batch production. Generative models emerge to make design automation scalable but it remains non-trivial to produce designs that comply with designers’ multimodal desires, i.e., constrained by background images and driven by foreground content. We propose LayoutDETR that inherits the high quality and realism from generative modeling, while reformulating content-aware requirements as a detection problem: we learn to detect in a background image the reasonable locations, scales, and spatial relations for multimodal foreground elements in a layout. Our solution sets a new state-of-the-art performance for layout generation on public benchmarks and on our newly-curated ad banner dataset. We integrate our solution into a graphical system that facilitates user studies, and show that users prefer our designs over baselines by significant margins. Code, models, dataset, and demos are available at GitHub.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arroyo, D.M., Postels, J., Tombari, F.: Variational transformer networks for layout generation. In: CVPR (2021)
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans. In: ICLR (2018)
Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. In: ICLR (2019)
Cao, Y., et al.: Geometry aligned variational transformer for image-conditioned layout generation. Multimedia (2022)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Carlier, A., Danelljan, M., Alahi, A., Timofte, R.: Deepsvg: a hierarchical generative network for vector graphics animation. NeurIPS (2020)
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments (2020)
Chen, G., Xie, P., Dong, J., Wang, T.: Understanding programmatic creative: The role of ai. J. Advertising (2019)
Chen, X., Mishra, N., Rohaninejad, M., Abbeel, P.: Pixelsnail: an improved autoregressive generative model. In: ICML (2018)
Cheng, C.Y., Huang, F., Li, G., Li, Y.: Play: parametrically conditioned layout generation using latent diffusion. In: ICML (2023)
Dai, Z., Cai, B., Lin, Y., Chen, J.: Up-detr: unsupervised pre-training for object detection with transformers. In: CVPR (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Feng, W., et al.: Layoutgpt: compositional visual planning and generation with large language models. In: NeurIPS (2024)
Girshick, R.: Fast r-cnn. In: ICCV (2015)
Goodfellow, I., et al.: Generative adversarial networks. In: NeurIPS (2014)
Guo, S., et al.: Vinci: an intelligent graphic design system for generating advertising posters. In: CHI (2021)
Gupta, K., Lazarow, J., Achille, A., Davis, L.S., Mahadevan, V., Shrivastava, A.: Layouttransformer: layout generation and completion with self-attention. In: ICCV (2021)
Haykin, S., Network, N.: A comprehensive foundation. Neural networks (2004)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)
Horita, D., Inoue, N., Kikuchi, K., Yamaguchi, K., Aizawa, K.: Retrieval-augmented layout transformer for content-aware layout generation. arXiv (2023)
Hsu, H.Y., He, X., Peng, Y., Kong, H., Zhang, Q.: Posterlayout: a new benchmark and approach for content-aware visual-textual presentation layout. CVPR (2023)
Hui, M., Zhang, Z., Zhang, X., Xie, W., Wang, Y., Lu, Y.: Unifying layout generation with a decoupled diffusion model. In: CVPR (2023)
Hussain, Z., et al.: Automatic understanding of image and video advertisements. In: CVPR (2017)
Inoue, N., Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Layoutdm: discrete diffusion model for controllable layout generation. In: CVPR (2023)
Jiang, Z., et al.: Unilayout: taming unified sequence-to-sequence transformers for graphic layout generation. arXiv (2022)
Jiang, Z., et al.: Layoutformer++: conditional graphic layout generation via constraint serialization and decoding space restriction. In: CVPR (2023)
Jiang, Z., Sun, S., Zhu, J., Lou, J.G., Zhang, D.: Coarse-to-fine generative modeling for graphic layouts. In: AAAI (2022)
Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: Layoutvae: stochastic scene layout generation from a label set. In: ICCV (2019)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. In: ICLR (2018)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR (2020)
Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Constrained graphic layout generation via latent optimization. In: ACM MM (2021)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2013)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
Kong, X., Jiang, L., Chang, H., Zhang, H., Hao, Y., Gong, H., Essa, I.: Blt: bidirectional layout transformer for controllable layout generation. In: ECCV (2022)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeurIPS (2012)
Landa, R.: Graphic design solutions/robin landa. Wadsworth, Boston (2010)
Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: ICML (2016)
Lee, H.Y., et al.: Neural design network: graphic layout generation with constraints. In: ECCV (2020)
Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., Liu, C.: Vitgan: training gans with vision transformers. In: ICLR (2022)
Levi, E., Brosh, E., Mykhailych, M., Perez, M.: Dlt: conditioned layout generation with joint discrete-continuous diffusion layout transformer. In: ICCV (2023)
Li, G., Baechler, G., Tragut, M., Li, Y.: Learning to denoise raw mobile ui layouts for improving datasets at scale. In: CHI (2022)
Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: Layoutgan: generating graphic layouts with wireframe discriminators. In: ICLR (2019)
Li, J., Yang, J., Zhang, J., Liu, C., Wang, C., Xu, T.: Attribute-conditioned layout gan for automatic graphic design. TVCG (2020)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Li, Z., et al.: Planning and rendering: Towards end-to-end product poster generation. arXiv (2023)
Lin, J., et al.: A parse-then-place approach for generating graphic layouts from textual descriptions. In: ICCV (2023)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Liu, B., Zhu, Y., Song, K., Elgammal, A.: Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. In: ICLR (2020)
Lok, S., Feiner, S.: A survey of automated layout techniques for information presentations. In: Proceedings of SmartGraphics (2001)
Nauata, N., Chang, K.H., Cheng, C.Y., Mori, G., Furukawa, Y.: House-gan: relational generative adversarial networks for graph-constrained house layout generation. In: ECCV (2020)
Nauata, N., Hosseini, S., Chang, K.H., Chu, H., Cheng, C.Y., Furukawa, Y.: House-gan++: Generative adversarial layout refinement network towards intelligent computational agent for professional architects. In: CVPR (2021)
Nguyen, D.D., Nepal, S., Kanhere, S.S.: Diverse multimedia layout generation with multi choice learning. Multimedia (2021)
Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with pixelcnn decoders. In: NeurIPS (2016)
Patil, A.G., Ben-Eliezer, O., Perel, O., Averbuch-Elor, H.: Read: Recursive autoencoders for document layout generation. In: CVPR Workshops (2020)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. CVPR (2019)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML (2014)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Salimans, T., Karpathy, A., Chen, X., Kingma, D.P.: Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. ICLR (2017)
Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: scaling stylegan to large diverse datasets. In: SIGGRAPH (2022)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP (2013)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Stribley, M.: Rules of composition all designers live by. Retrieved May (2016)
Suvorov, R., et al.: Resolution-robust large mask inpainting with fourier convolutions. In: WACV (2022)
Tang, Z., Wu, C., Li, J., Duan, N.: Layoutnuwa: revealing the hidden layout expertise of large language models. In: ICLR (2024)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
Wang, Y., et al.: Aesthetic text logo synthesis via content-aware layout inferring. In: CVPR (2022)
Xie, Y., Huang, D., Wang, J., Lin, C.Y.: Canvasemb: learning layout representation with large-scale pre-training for graphic design. Multimedia (2021)
Yamaguchi, K.: Canvasvae: learning to generate vector graphic documents. In: ICCV (2021)
Yu, F., Liu, K., Zhang, Y., Zhu, C., Xu, K.: Partnet: recursive part decomposition network for fine-grained and hierarchical shape segmentation. In: CVPR (2019)
Yu, J., et al.: Vector-quantized image modeling with improved vqgan. In: ICLR (2022)
Yu, N., Li, K., Zhou, P., Malik, J., Davis, L., Fritz, M.: Inclusive gan: improving data and minority coverage in generative models. In: ECCV (2020)
Yu, N., et al.: Dual contrastive loss and attention for gans. In: ICCV (2021)
Zhang, J., Guo, J., Sun, S., Lou, J.G., Zhang, D.: Layoutdiffusion: improving graphic layout generation by discrete diffusion probabilistic models. In: ICCV (2023)
Zheng, X., Qiao, X., Cao, Y., Lau, R.W.: Content-aware generative modeling of graphic design layouts. TOG (2019)
Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: ICDAR (2019)
Zhou, M., Xu, C., Ma, Y., Ge, T., Jiang, Y., Xu, W.: Composition-aware graphic layout gan for visual-textual presentation designs. IJCAI (2022)
Acknowledgments
We thank Shu Zhang, Silvio Savarese, Abigail Kutruff, Brian Brechbuhl, Elham Etemad, and Amrutha Krishnan from Salesforce for constructive advice.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, N. et al. (2025). LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15078. Springer, Cham. https://doi.org/10.1007/978-3-031-72661-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-72661-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72660-6
Online ISBN: 978-3-031-72661-3
eBook Packages: Computer ScienceComputer Science (R0)