[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-3-031-72920-1_26guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation

Published: 01 October 2024 Publication History

Abstract

Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.

References

[1]
Amazon mechanical turk (2005). https://www.mturk.com/
[2]
Anil, R., et al.: Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023)
[3]
Bai, Y., et al.: Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)
[4]
Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023)
[5]
Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 (2023)
[6]
Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards. In: ICLR (2024)
[7]
Dai, X., et al.: Emu: enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023)
[8]
Deng, F., Wang, Q., Wei, W., Grundmann, M., Hou, T.: PRDP: proximal reward difference prediction for large-scale reward finetuning of diffusion models. In: CVPR (2024)
[9]
Dong, H., et al.: RAFT: reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767 (2023)
[10]
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[11]
Fan, Y., Lee, K.: Optimizing DDPM sampling with shortcut fine-tuning. In: ICML (2023)
[12]
Fan, Y., et al.: DPOK: reinforcement learning for fine-tuning text-to-image diffusion models. In: NeurIPS (2023)
[13]
Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: CVPR (2020)
[14]
Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: SVDiff: compact parameter space for diffusion fine-tuning. In: CVPR (2023)
[15]
Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611 (2022)
[16]
He, H., et al.: Learning profitable NFT image diffusions via multiple visual-policy guided reinforcement learning. arXiv preprint arXiv:2306.11731 (2023)
[17]
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
[18]
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
[19]
Hosu V, Lin H, Sziranyi T, and Saupe D KonIQ-10k: an ecologically valid database for deep learning of blind image quality assessment TIP 2020 29 4041-4056
[20]
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
[21]
Jeong, Y., et al.: The power of sound (TPoS): audio reactive video generation with stable diffusion. In: ICCV (2023)
[22]
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)
[23]
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: multi-scale image quality transformer. In: ICCV (2021)
[24]
Ke, J., Ye, K., Yu, J., Wu, Y., Milanfar, P., Yang, F.: Vila: learning image aesthetics from user comments with vision-language pretraining. In: CVPR (2023)
[25]
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
[26]
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569 (2023)
[27]
Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)
[28]
Lee, S.H., et al.: Soundini: sound-guided diffusion for natural video editing. arXiv preprint arXiv:2304.06818 (2023)
[29]
Li, Y., et al.: Gligen: open-set grounded text-to-image generation. In: CVPR (2023)
[30]
Lin, X., Yang, Z., Zhang, X., Zhang, Q.: Pareto set learning for expensive multi-objective optimization. In: NeurIPS (2022)
[31]
Mannor, S., Shimkin, N.: The steering approach for multi-criteria reinforcement learning. In: NeurIPS (2001)
[32]
Miettinen K Nonlinear Multiobjective Optimization 1999 New York Springer
[33]
Murray, N., Marchesotti, L., Perronnin, F.: Ava: a large-scale database for aesthetic visual analysis. In: CVPR (2012)
[34]
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
[35]
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
[36]
Rame, A., et al.: Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In: NeurIPS (2023)
[37]
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
[38]
Richardson, E., Goldberg, K., Alaluf, Y., Cohen-Or, D.: Conceptlab: creative generation using diffusion prior constraints. arXiv preprint arXiv:2308.02669 (2023)
[39]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
[40]
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)
[41]
Saharia, C., et al.: Palette: image-to-image diffusion models. In: SIGGRAPH (2022)
[42]
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
[43]
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
[44]
Serra, A., Carrara, F., Tesconi, M., Falchi, F.: The emotions of the crowd: learning image sentiment from tweets via cross-modal distillation. In: ECAI (2023)
[45]
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
[46]
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
[47]
Tesauro, G., et al.: Managing power consumption and performance of computing systems using reinforcement learning. In: NeurIPS (2007)
[48]
Tu, Z., et al.: Maxvit: multi-axis vision transformer. In: ECCV (2022)
[49]
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
[50]
Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Human preference score: better aligning text-to-image models with human preference. In: ICCV (2023)
[51]
Xu, J., et al.: Imagereward: learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023)
[52]
Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.: From patches to pictures (PAQ-2-PIQ): mapping the perceptual space of picture quality. In: CVPR (2020)
[53]
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)
[54]
Yu, L., et al.: Magvit: masked generative video transformer. In: CVPR (2023)
[55]
Zhou, Y., Liu, B., Zhu, Y., Yang, X., Chen, C., Xu, J.: Shifted diffusion for text-to-image generation. In: CVPR (2023)

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVIII
Sep 2024
583 pages
ISBN:978-3-031-72919-5
DOI:10.1007/978-3-031-72920-1
  • Editors:
  • Aleš Leonardis,
  • Elisa Ricci,
  • Stefan Roth,
  • Olga Russakovsky,
  • Torsten Sattler,
  • Gül Varol

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 October 2024

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media