[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

A Task Is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15116))

Included in the following conference series:

  • 217 Accesses

Abstract

Advancing image inpainting is challenging as it requires filling user-specified regions for various intents, such as background filling and object synthesis. Existing approaches focus on either context-aware filling or object synthesis using text descriptions. However, achieving both tasks simultaneously is challenging due to differing training strategies. To overcome this challenge, we introduce PowerPaint, the first high-quality and versatile inpainting model that excels in multiple inpainting tasks. First, we introduce learnable task prompts along with tailored fine-tuning strategies to guide the model’s focus on different inpainting targets explicitly. This enables PowerPaint to accomplish various inpainting tasks by utilizing different task prompts, resulting in state-of-the-art performance. Second, we demonstrate the versatility of the task prompt in PowerPaint by showcasing its effectiveness as a negative prompt for object removal. Moreover, we leverage prompt interpolation techniques to enable controllable shape-guided object inpainting, enhancing the model’s applicability in shape-guided applications. Finally, we conduct extensive experiments and applications to verify the effectiveness of PowerPaint. We release our codes and models on our project page: https://powerpaint.github.io/.

J. Zhuang—Work done during an internship in Shanghai Artificial Intelligence Lab.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 49.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 64.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Adobe firefly (2023). https://firefly.adobe.com/

  2. Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)

    Google Scholar 

  3. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: Patchmatch: a randomized correspondence algorithm for structural image editing. TOG 28(3), 24:1–24:11 (2009)

    Google Scholar 

  4. Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: SIGGRAPH. pp. 417–424 (2000)

    Google Scholar 

  5. Cao, C., Cai, Y., Dong, Q., Wang, Y., Fu, Y.: Leftrefill: filling right canvas based on left reference through generalized text-to-image diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7705–7715 (2024)

    Google Scholar 

  6. Cheng, Y.C., Lin, C.H., Lee, H.Y., Ren, J., Tulyakov, S., Yang, M.H.: Inout: diverse image outpainting via GAN inversion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11431–11440 (2022)

    Google Scholar 

  7. Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. TIP 13(9), 1200–1212 (2004)

    Google Scholar 

  8. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Advance in Neural Information Processing System, vol. 34, pp. 8780–8794 (2021)

    Google Scholar 

  9. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

  10. Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS, pp. 2672–2680 (2014)

    Google Scholar 

  11. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  12. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advance in Neural Information Processing System, vol. 33, pp. 6840–6851 (2020)

    Google Scholar 

  13. Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications (2021)

    Google Scholar 

  14. Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1931–1941 (2023)

    Google Scholar 

  15. Kuznetsova, A., et al.: The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. 128(7), 1956–1981 (2020). https://doi.org/10.1007/s11263-020-01316-z

    Article  Google Scholar 

  16. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models (2023)

    Google Scholar 

  17. Li, Y., Liu, S., Yang, J., Yang, M.H.: Generative face completion. In: CVPR, pp. 3911–3919 (2017)

    Google Scholar 

  18. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014 Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  19. Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11461–11471 (2022)

    Google Scholar 

  20. Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: Edgeconnect: generative image inpainting with adversarial edge learning. In: ICCVW (2019)

    Google Scholar 

  21. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR, pp. 2536–2544 (2016)

    Google Scholar 

  22. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  23. Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)

    Google Scholar 

  24. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  25. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)

    Google Scholar 

  26. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advance in Neural Information Processing System, vol. 35, pp. 36479–36494 (2022)

    Google Scholar 

  27. Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: Advance in Neural Information Processing System, vol. 35, pp. 25278–25294 (2022)

    Google Scholar 

  28. Sun, Y., Liu, Y., Tang, Y., Pei, W., Chen, K.: Anycontrol: create your artwork with versatile control on text-to-image generation. arXiv preprint arXiv:2406.18958 (2024)

  29. Suvorov, R., et al.: Resolution-robust large mask inpainting with Fourier convolutions. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2149–2159 (2022)

    Google Scholar 

  30. Tang, J., et al.: Make-it-vivid: dressing your animatable biped cartoon characters from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6243–6253 (2024)

    Google Scholar 

  31. Teterwak, P., et al.: Boundless: generative adversarial networks for image extension. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10521–10530 (2019)

    Google Scholar 

  32. Wang, S., et al.: Imagen editor and editbench: advancing and evaluating text-guided image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18359–18369 (2023)

    Google Scholar 

  33. Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: Smartbrush: text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22428–22437 (2023)

    Google Scholar 

  34. Yang, S., Chen, X., Liao, J.: Uni-paint: a unified framework for multimodal image inpainting with pretrained diffusion model. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 3190–3199 (2023)

    Google Scholar 

  35. Yang, Z., Dong, J., Liu, P., Yang, Y., Yan, S.: Very long natural scenery image prediction by outpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10561–10570 (2019)

    Google Scholar 

  36. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: CVPR, pp. 5505–5514 (2018)

    Google Scholar 

  37. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: ICCV, pp. 4471–4480 (2019)

    Google Scholar 

  38. Zeng, Y., Fu, J., Chao, H., Guo, B.: Learning pyramid-context encoder network for high-quality image inpainting. In: CVPR, pp. 1486–1494 (2019)

    Google Scholar 

  39. Zeng, Y., Fu, J., Chao, H., Guo, B.: Aggregated contextual transformations for high-resolution image inpainting. IEEE Trans. Visual Comput. Graph. 29(7), 3266–3280 (2022)

    Article  Google Scholar 

  40. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)

    Google Scholar 

  41. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595 (2018)

    Google Scholar 

  42. Zhang, Y., Xing, Z., Zeng, Y., Fang, Y., Chen, K.: Pia: your personalized image animator via plug-and-play modules in text-to-image models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7747–7756 (2024)

    Google Scholar 

  43. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2017)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by the National Key R&D Program of China (No. 2022YFB4701400/4701402, No. 2022ZD0161600), SSTIC Grant (KJZD2023092311510 6012, KJZD20230923114916032), and Beijing Key Lab of Networked Multimedia.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chun Yuan or Kai Chen .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12336 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhuang, J., Zeng, Y., Liu, W., Yuan, C., Chen, K. (2025). A Task Is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15116. Springer, Cham. https://doi.org/10.1007/978-3-031-73636-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73636-0_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73635-3

  • Online ISBN: 978-3-031-73636-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics