More Web Proxy on the site http://driver.im/

Article

Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation

Authors:

Seung Hyun Lee,

Feng YangAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVIII

Pages 462 - 478

https://doi.org/10.1007/978-3-031-72920-1_26

Published: 01 October 2024 Publication History

Abstract

Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.

References

[1]

Amazon mechanical turk (2005). https://www.mturk.com/

[2]

Anil, R., et al.: Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023)

[3]

Bai, Y., et al.: Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)

[4]

Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023)

[5]

Chang, H., et al.: Muse: text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 (2023)

[6]

Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards. In: ICLR (2024)

[7]

Dai, X., et al.: Emu: enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023)

[8]

Deng, F., Wang, Q., Wei, W., Grundmann, M., Hou, T.: PRDP: proximal reward difference prediction for large-scale reward finetuning of diffusion models. In: CVPR (2024)

[9]

Dong, H., et al.: RAFT: reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767 (2023)

[10]

Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

[11]

Fan, Y., Lee, K.: Optimizing DDPM sampling with shortcut fine-tuning. In: ICML (2023)

[12]

Fan, Y., et al.: DPOK: reinforcement learning for fine-tuning text-to-image diffusion models. In: NeurIPS (2023)

[13]

Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: CVPR (2020)

[14]

Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: SVDiff: compact parameter space for diffusion fine-tuning. In: CVPR (2023)

[15]

Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611 (2022)

[16]

He, H., et al.: Learning profitable NFT image diffusions via multiple visual-policy guided reinforcement learning. arXiv preprint arXiv:2306.11731 (2023)

[17]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

[18]

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

[19]

Hosu V, Lin H, Sziranyi T, and Saupe D KonIQ-10k: an ecologically valid database for deep learning of blind image quality assessment TIP 2020 29 4041-4056

Digital Library

[20]

Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

[21]

Jeong, Y., et al.: The power of sound (TPoS): audio reactive video generation with stable diffusion. In: ICCV (2023)

[22]

Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)

[23]

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: multi-scale image quality transformer. In: ICCV (2021)

[24]

Ke, J., Ye, K., Yu, J., Wu, Y., Milanfar, P., Yang, F.: Vila: learning image aesthetics from user comments with vision-language pretraining. In: CVPR (2023)

[25]

Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

[26]

Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569 (2023)

[27]

Lee, K., et al.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)

[28]

Lee, S.H., et al.: Soundini: sound-guided diffusion for natural video editing. arXiv preprint arXiv:2304.06818 (2023)

[29]

Li, Y., et al.: Gligen: open-set grounded text-to-image generation. In: CVPR (2023)

[30]

Lin, X., Yang, Z., Zhang, X., Zhang, Q.: Pareto set learning for expensive multi-objective optimization. In: NeurIPS (2022)

[31]

Mannor, S., Shimkin, N.: The steering approach for multi-criteria reinforcement learning. In: NeurIPS (2001)

[32]

Miettinen K Nonlinear Multiobjective Optimization 1999 New York Springer

[33]

Murray, N., Marchesotti, L., Perronnin, F.: Ava: a large-scale database for aesthetic visual analysis. In: CVPR (2012)

[34]

Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

[35]

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

[36]

Rame, A., et al.: Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In: NeurIPS (2023)

[37]

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

[38]

Richardson, E., Goldberg, K., Alaluf, Y., Cohen-Or, D.: Conceptlab: creative generation using diffusion prior constraints. arXiv preprint arXiv:2308.02669 (2023)

[39]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

[40]

Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)

[41]

Saharia, C., et al.: Palette: image-to-image diffusion models. In: SIGGRAPH (2022)

[42]

Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)

[43]

Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)

[44]

Serra, A., Carrara, F., Tesconi, M., Falchi, F.: The emotions of the crowd: learning image sentiment from tweets via cross-modal distillation. In: ECAI (2023)

[45]

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)

[46]

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

[47]

Tesauro, G., et al.: Managing power consumption and performance of computing systems using reinforcement learning. In: NeurIPS (2007)

[48]

Tu, Z., et al.: Maxvit: multi-axis vision transformer. In: ECCV (2022)

[49]

Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

[50]

Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Human preference score: better aligning text-to-image models with human preference. In: ICCV (2023)

[51]

Xu, J., et al.: Imagereward: learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023)

[52]

Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.: From patches to pictures (PAQ-2-PIQ): mapping the perceptual space of picture quality. In: CVPR (2020)

[53]

Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)

[54]

Yu, L., et al.: Magvit: masked generative video transformer. In: CVPR (2023)

[55]

Zhou, Y., Liu, B., Zhu, Y., Yang, X., Chen, C., Xu, J.: Shifted diffusion for text-to-image generation. In: CVPR (2023)

Index Terms

Parrot: Pareto-Optimal Multi-reward Reinforcement Learning Framework for Text-to-Image Generation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation
  2. Machine learning
    1. Learning paradigms
      1. Reinforcement learning
        Sequential decision making
      2. Supervised learning
        Ranking
        Supervised learning by regression
    2. Machine learning approaches
      1. Bio-inspired approaches
      2. Neural networks
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Index terms have been assigned to the content through auto-classification.

Recommendations

Distributional pareto-optimal multi-objective reinforcement learning
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

Multi-objective reinforcement learning (MORL) has been proposed to learn control policies over multiple competing objectives with each possible preference over returns. However, current MORL algorithms fail to account for distributional preferences over ...
Multi-objective reinforcement learning through continuous pareto manifold approximation

Many real-world control applications, from economics to robotics, are characterized by the presence of multiple conflicting objectives. In these problems, the standard concept of optimality is replaced by Pareto-optimality and the goal is to find the ...
Multi-objective reinforcement learning using sets of pareto dominating policies

Many real-world problems involve the optimization of multiple, possibly conflicting objectives. Multi-objective reinforcement learning (MORL) is a generalization of standard reinforcement learning where the scalar reward signal is extended to multiple ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVIII

Sep 2024

583 pages

ISBN:978-3-031-72919-5

DOI:10.1007/978-3-031-72920-1

Editors:
Aleš Leonardis
University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
Technical University of Darmstadt, Darmstadt, Hessen, Germany
,
Olga Russakovsky
Princeton University, Palo Alto, CA, USA
,
Torsten Sattler
Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 October 2024

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents