Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.05710 (cs)

[Submitted on 11 Aug 2024]

Title:Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Authors:Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li

View PDF HTML (experimental)

Abstract:This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps. In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Concurrently, integrating mediator tokens simplifies the attention module's complexity to a linear scale, enhancing the efficiency of global attention processes. Additionally, we propose a time-step dynamic mediator token adjustment mechanism that further decreases the required computational FLOPs for generation, simultaneously facilitating the generation of high-quality images within the constraints of varied inference budgets. Extensive experiments demonstrate that the proposed method can improve the generated image quality while also reducing the inference cost of diffusion transformers. When integrated with the recent work SiT, our method achieves a state-of-the-art FID score of 2.01. The source code is available at this https URL.

Comments:	ECCV 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2408.05710 [cs.CV]
	(or arXiv:2408.05710v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.05710

Submission history

From: Yifan Pu [view email]
[v1] Sun, 11 Aug 2024 07:01:39 UTC (2,498 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators