[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation

Published: 01 January 2024 Publication History

Abstract

Generative text-to-image models have gained great popularity among the public for their powerful capability to generate high-quality images based on natural language prompts. However, developing effective prompts for desired images can be challenging due to the complexity and ambiguity of natural language. This research proposes <italic>PromptMagician</italic>, a visual analysis system that helps users explore the image results and refine the input prompts. The backbone of our system is a prompt recommendation model that takes user prompts as input, retrieves similar prompt-image pairs from DiffusionDB, and identifies special (important and relevant) prompt keywords. To facilitate interactive prompt refinement, <italic>PromptMagician</italic> introduces a multi-level visualization for the cross-modal embedding of the retrieved images and recommended keywords, and supports users in specifying multiple criteria for personalized exploration. Two usage scenarios, a user study, and expert interviews demonstrate the effectiveness and usability of our system, suggesting it facilitates prompt engineering and improves the creativity support of the generative text-to-image model.

References

[1]
P. Achlioptas, M. Ovsjanikov, K. Haydarov, M. Elhoseiny, and L. J. Guibas. Artemis: Affective language for visual art. In Proc. CVPR, pp. 11569–11579. IEEE/CVF, Piscataway, 2021. 6.
[2]
D. Bertucci et al. Dendromap: Visual exploration of large-scale image datasets for machine learning with treemaps. IEEE Transactions on Visualization and Computer Graphics, 29 (1): pp. 320–330, 2022. 2,5.
[3]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., Language models are few-shot learners. In Proc. NeurIPS, vol. 33, pp. 1877–1901, 2020. 2, 3.
[4]
T. Büring, J. Gerken, and H. Reiterer. User interaction with scatterplots on small screens - A comparative evaluation of geometric-semantic zoom and fisheye distortion. IEEE Transactions on Visualization and Computer Graphics, 12 (5): pp. 829–836, 2006. 6.
[5]
P. J. Chambon, C. Bluethgen, C. P. Langlotz, and A. Chaudhari. Adapting pretrained vision-language foundational models to medical imaging domains. arXiv, 2022. 1.
[6]
C. Chen, J. Wu, X. Wang, S. Xiang, S.-H. Zhang, Q. Tang, and S. Liu. Towards better caption supervision for object detection. IEEE Transactions on Visualization and Computer Graphics, 28 (4): pp. 1941–1954, 2022. 2.
[7]
Z. Chen and H. Xia. Crossdata: Leveraging text-data connections for authoring data documents. In Proc. CHI, pp. 95:1–95:15. ACM, NY, 2022. 7.
[8]
M. Cheon, S. Yoon, B. Kang, and J. Lee. Perceptual image quality assessment with transformers. In Proc. CVPR, pp. 433–442. IEEE/CVF, Piscataway, 2021. 6.
[9]
E. Cherry and C. Latulipe. Quantifying the creativity support of digital tools through the creativity support index. in ACM TOCHI, 21 (4): pp. 1–25, 2014. 8.
[10]
J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT, pp. 4171–4186. ACL, Stroudsburg, 2019. 3.
[11]
Y. Feng, J. Chen, K. Huang, J. K. Wong, H. Ye, W. Zhang, R. Zhu, X. Luo, and W. Chen. iPoet: interactive painting poetry creation with visual multimodal analysis. Journal of Visualization, 25 (3): pp. 671–685, Jun 2022. 9.
[12]
Y. Feng, X. Wang, B. Pan, K. K. Wong, Y. Ren, S. Liu, Z. Yan, Y. Ma, H. Qu, and W. Chen. XNLI: Explaining and diagnosing nli-based visual data analysis. IEEE Transactions on Visualization and Computer Graphics, pp. 1–14, 2023. 7.
[13]
T. Gao, M. Dontcheva, E. Adar, Z. Liu, and K. G. Karahalios. Datatone: Managing ambiguity in natural language interfaces for data visualization. In Proc. UIST, pp. 489–500. ACM, NY, 2015. 7.
[14]
T. Gao, A. Fisch, and D. Chen. Making pre-trained language models better few-shot learners. In Proc. ACL/UIJCNLP, pp. 3816–3830. ACL, Stroudsburg, 2021. 2.
[15]
Y. Hao, Z. Chi, L. Dong, and F. Wei. Optimizing prompts for text-to-image generation. arXiv, 2022. 7.
[16]
J. He, X. Wang, K. K. Wong, X. Huang, C. Chen, Z. Chen, F. Wang, M. Zhu, and H. Qu. Videopro: A visual analytics approach for interactive video programming. arXiv, 2023. 2.
[17]
J. Ho et al. Imagen video: High definition video generation with diffusion models. arXiv, 2022. 1.
[18]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Proc. NeurIPS, vol. 33, pp. 6840–6851, 2020. 3.
[19]
Z. Jiang, F. F. Xu, J. Araki, and G. Neubig. How can we know what language models know? Transactions of the ACL, 8: pp. 423–438, 2020. 2.
[20]
B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. In Proc. EMNLP, pp. 3045–3059. ACL, Stroudsburg, 2021. 2.
[21]
P. P. Liang, Y. Lyu, G. Chhablani, N. Jain, Z. Deng, X. Wang, L.-P. Morency, and R. Salakhutdinov. Multiviz: Towards visualizing and understanding multimodal models. In Proc. ICLR, 2022. 2.
[22]
Y. Lin, K. Wong, Y. Wang, R. Zhang, B. Dong, H. Qu, and Q. Zheng. Taxthemis: Interactive mining and exploration of suspicious tax evasion groups. IEEE Transactions on Visualization and Computer Graphics, 27 (2): pp. 849–859, 2021. 5.
[23]
V. Liu and L. B. Chilton. Design guidelines for prompt engineering text-to-image generative models. In Proc. CHI, pp. 1–23, 2022. 1, 2, 4.
[24]
V. Liu, H. Qiao, and L. Chilton. Opal: Multimodal image generation for news illustration. In Proc. UIST, pp. 384:1–384:23. ACM, NY, 2022. 2.
[25]
W. J. Longabaugh. Combing the hairball with biofabric: a new approach for visualization of large networks. BMC bioinformatics, 13 (1): pp. 1–16, 2012. 6.
[26]
E. Loper and S. Bird. Nltk: The natural language toolkit. arXiv, p. cs.CL/0205028, 2002. 5.
[27]
F. D. Luca, M. I. Hossain, S. G. Kobourov, and K. Börner. Multi-level tree based approach for interactive graph visualization with semantic zoom. arXiv, 2019. 6.
[28]
S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. in Proc. NeurIPS, 30: pp. 4765–4774, 2017. 2.
[29]
S. L'Yi, Q. Wang, F. Lekschas, and N. Gehlenborg. Gosling: A grammar-based toolkit for scalable and interactive genomics data visualization. IEEE Transactions on Visualization and Computer Graphics, 28 (1): pp. 140–150, 2022. 6.
[30]
C. Ma, C. Yang, X. Yang, and M. Yang. Learning a no-reference quality metric for single-image super-resolution. Comput. Vis. Image Underst., 158: pp. 1–16, 2017. 6.
[31]
J. MacQueen. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability, pp. 281–297. University of California Los Angeles LA USA, 1967. 4.
[32]
M. Mori, K. F. MacDorman, and N. Kageki. The uncanny valley [from the field]. in IEEE Robotics & Automation Magazine, 19 (2): pp. 98–100, 2012. 9.
[33]
A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In Proc. ICML, vol. 162, pp. 16784–16804. PMLR, 2022. 1,2,3, 9.
[34]
OpenAI. GPT-4 technical report. arXiv, pp. 2303–08774, 2023. 9.
[35]
J. Oppenlaender. Prompt engineering for text-based generative art. arXiv, 2022. 2.
[36]
J. Oppenlaender. A taxonomy of prompt modifiers for text-to-image generation. arXiv, 2022. 2, 4.
[37]
L. Ouyang et al. Training language models to follow instructions with human feedback. arXiv, 2022. 2.
[38]
X. Pan, A. Tewari, T. Leimkühler, L. Liu, A. Meka, and C. Theobalt. Drag your gan: Interactive point-based manipulation on the generative image manifold. In Proc. SIGGRAPH. ACM, NY, 2023. 9.
[39]
R. Panda, J. Zhang, H. Li, J. Lee, X. Lu, and A. K. Roy-Chowdhury. Contemplating visual emotions: Understanding and overcoming dataset bias. In Proc. ECCV, pp. 594–612. Springer, Berlin, 2018. 6.
[40]
K. Perlin and D. Fox. Pad: an alternative approach to the computer interface. In Proc. SIGGRAPH, pp. 57–64. ACM, NY, 1993. 6.
[41]
A. Radford et al. Learning transferable visual models from natural language supervision. In Proc. ICML, vol. 139, pp. 8748–8763. PMLR, 2021. 2,4,6.
[42]
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv, 2022. 1,2,3, 9.
[43]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, pp. 10684–10695. IEEE, Piscataway, 2022. 1,2,3,9.
[44]
T. L. Scao and A. M. Rush. How many data points is a prompt worth? In Proc. NAACL-HLT, pp. 2627–2636. ACL, Stroudsburg, 2021. 1.
[45]
T. Schick and H. Schütze. It's not just size that matters: Small language models are also few-shot learners. In Proc. NAACL-HLT, pp. 2339–2352. ACL, Stroudsburg, 2021. 2.
[46]
T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proc. EMNLP, pp. 4222–4235. ACL, Stroudsburg, 2020. 2.
[47]
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML, vol. 37, pp. 2256–2265. PMLR, [online]. available: JMLR.org, 2015. 3.
[48]
J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In Proc. ICLR. [online]. available: OpenReview.net, 2021. 3.
[49]
K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28 (1): pp. 11–21, 1972. 2.
[50]
A. Srinivasan and J. Stasko. How to ask what to say?: Strategies for evaluating natural language interfaces for data visualization. IEEE CG&A, 40 (4): pp. 96–103, 2020. 7.
[51]
H. Strobelt, A. Webson, V. Sanh, B. Hoover, J. Beyer, H. Pfister, and A. M. Rush. Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE Transactions on Visualization and Computer Graphics, 29 (1): pp. 1146–1156, 2022. 2.
[52]
L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9 (11), 2008. 4.
[53]
J. Wang, K. C. Chan, and C. C. Loy. Exploring clip for assessing the look and feel of images. In Proc. AAAI. AAAI Press, Menlo Park, CA, 2022. 6.
[54]
X. Wang, F. Cheng, Y. Wang, K. Xu, J. Long, H. Lu, and H. Qu. Interactive data analysis with next-step natural language query recommendation. arXiv, 2022. 2.
[55]
X. Wang, J. He, Z. Jin, M. Yang, Y. Wang, and H. Qu. M2lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics, 28 (1): pp. 802–812, 2021. 2.
[56]
X. Wang, Z. Wu, W. Huang, Y. Wei, Z. Huang, M. Xu, and W. Chen. Vis+ ai: integrating visualization with artificial intelligence for efficient data analysis. Frontiers of Computer Science, 17 (6): pp. 1–12, 2023. 9.
[57]
Y. Wang, S. Shen, and B. Y. Lim. Reprompt: Automatic prompt editing to refine ai-generative art towards precise expressions. arXiv, 2023. 2.
[58]
Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models. arXiv, 2022. 1,2,3, 4,5.
[59]
L. Weng, M. Zhu, K. K. Wong, S. Liu, J. Sun, H. Zhu, D. Han, and W. Chen. Towards an understanding and explanation for mixed-initiative artificial scientific text detection. arXiv, 2023. 2.
[60]
K. K. Wong, X. Wang, Y. Wang, J. He, R. Zhang, and H. Qu. Anchorage: Visual analysis of satisfaction in customer service videos via anchor events. IEEE Transactions on Visualization and Computer Graphics, pp. 1–13, 2023. 2.
[61]
T. Wu, E. Jiang, A. Donsbach, J. Gray, A. Molina, M. Terry, and C. J. Cai. Promptchainer: Chaining large language model prompts through visual programming. In Proc. CHI, pp. 1–10. ACM, NY, 2022. 2.
[62]
T. Wu, M. Terry, and C. J. Cai. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proc. CHI, pp. 385:1–385:22. ACM, NY, 2022. 2.
[63]
J. Xia, L. Huang, W. Lin, X. Zhao, J. Wu, Y. Chen, Y. Zhao, and W. Chen. Interactive visual cluster analysis by contrastive dimensionality reduction. IEEE Transactions on Visualization and Computer Graphics, 29 (1): pp. 734–744, 2022. 2.
[64]
X. Xie, X. Cai, J. Zhou, N. Cao, and Y. Wu. A semantic-based method for visualizing large image collections. IEEE Transactions on Visualization and Computer Graphics, 25 (7): pp. 2362–2377, 2018., 5.
[65]
J. Yang, J. Fan, D. Hubball, Y. Gao, H. Luo, W. Ribarsky, and M. Ward. Semantic image browser: Bridging information visualization with automated intelligent image analysis. In Proc. VAST, pp. 191–198. IEEE Computer Society, Los Alamitos, 2006. 2.
[66]
L. Yang, C. Xiong, J. K. Wong, A. Wu, and H. Qu. Explaining with examples: Lessons learned from crowdsourced introductory description of information visualizations. IEEE Transactions on Visualization and Computer Graphics, 29 (3): pp. 1638–1650, 2023. 7.
[67]
P. Ye, J. Kumar, L. Kang, and D. S. Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In Proc. CVPR, pp. 1098–1105. IEEE/CVF, Piscataway, 2012. 6.
[68]
H. Zeng, X. Wang, Y. Wang, A. Wu, T.-C. Pong, and H. Qu. Gesturelens: Visual analysis of gestures in presentation videos. IEEE Transactions on Visualization and Computer Graphics, 2022. 2.
[69]
H. Zeng, X. Wang, A. Wu, Y. Wang, Q. Li, A. Endert, and H. Qu. Emoco: Visual analysis of emotion coherence in presentation videos. IEEE Transactions on Visualization and Computer Graphics, 26 (1): pp. 927–937, 2019. 2.
[70]
H. Zhang et al. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proc. ICCV. IEEE, Piscataway, 2017. 3.
[71]
L. Zhang and M. Agrawala. Adding conditional control to text-to-image diffusion models. arXiv, 2023. 9.
[72]
L. Zhang, L. Zhang, X. Mou, and D. Zhang. FSIM: A feature similarity index for image quality assessment. in IEEE TIP, 20 (8): pp. 2378–2386, 2011. 6.
[73]
W. Zhang, J. K. Wong, Y. Chen, A. Jia, L. Wang, J.-W. Zhang, L. Cheng, and W. Chen. Scrolltimes: Tracing the provenance of paintings as a window into history. arXiv, 2023. 2.
[74]
W. Zhang, J. K. Wong, X. Wang, Y. Gong, R. Zhu, K. Liu, Z. Yan, S. Tan, H. Qu, S. Chen, and W. Chen. Cohortva: A visual analytic system for interactive exploration of cohorts based on historical data. IEEE Transactions on Visualization and Computer Graphics, 29 (1): pp. 756–766, 2023. 5.
[75]
W. Zhang, J.-W. Zhang, K. K. Wong, Y. Wang, Y. Feng, L. Wang, and W. Chen. Computational approaches for traditional chinese painting: From the “six principles of painting” perspective. arXiv, 2023. 9.
[76]
J. Zhou, X. Wang, J. K. Wong, H. Wang, Z. Wang, X. Yang, X. Yan, H. Feng, H. Qu, H. Ying, and W. Chen. Dpviscreator: Incorporating pattern constraints to privacy-preserving visualizations via differential privacy. IEEE Transactions on Visualization and Computer Graphics, 29 (1): pp. 809–819, 2023. 3.
[77]
H. Zhu, M. Zhu, Y. Feng, D. Cai, Y. Hu, S. Wu, X. Wu, and W. Chen. Visualizing large-scale high-dimensional data via hierarchical embedding of knn graphs. Visual Informatics, 5 (2): pp. 51–59, 2021. 2.
[78]
M. Zhu, P. Pan, W. Chen, and Y. Yang. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In Proc. CVPR, pp. 5802–5810. IEEE/CVF, Piscataway, 2019. 1,3.

Cited By

View all
  • (2024)MemoVis: A GenAI-Powered Tool for Creating Companion Reference Images for 3D Design FeedbackACM Transactions on Computer-Human Interaction10.1145/369468131:5(1-41)Online publication date: 4-Sep-2024
  • (2024)StyleFactory: Towards Better Style Alignment in Image Creation through Style-Strength-Based Control and EvaluationProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676370(1-15)Online publication date: 13-Oct-2024
  • (2024)PromptCharm: Text-to-Image Generation through Multi-modal Prompting and RefinementProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642803(1-21)Online publication date: 11-May-2024
  • Show More Cited By

Index Terms

  1. PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image IEEE Transactions on Visualization and Computer Graphics
          IEEE Transactions on Visualization and Computer Graphics  Volume 30, Issue 1
          Jan. 2024
          1456 pages

          Publisher

          IEEE Educational Activities Department

          United States

          Publication History

          Published: 01 January 2024

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 05 Mar 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)MemoVis: A GenAI-Powered Tool for Creating Companion Reference Images for 3D Design FeedbackACM Transactions on Computer-Human Interaction10.1145/369468131:5(1-41)Online publication date: 4-Sep-2024
          • (2024)StyleFactory: Towards Better Style Alignment in Image Creation through Style-Strength-Based Control and EvaluationProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676370(1-15)Online publication date: 13-Oct-2024
          • (2024)PromptCharm: Text-to-Image Generation through Multi-modal Prompting and RefinementProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642803(1-21)Online publication date: 11-May-2024
          • (2024)IntentTuner: An Interactive Framework for Integrating Human Intentions in Fine-tuning Text-to-Image Generative ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642165(1-18)Online publication date: 11-May-2024
          • (2024)ModalChorus: Visual Probing and Alignment of Multi-Modal Embeddings via Modal Fusion MapIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345638731:1(294-304)Online publication date: 9-Sep-2024
          • (2024)An Empirical Evaluation of the GPT-4 Multimodal Language Model on Visualization Literacy TasksIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345615531:1(1105-1115)Online publication date: 10-Sep-2024
          • (2024)Visual Analytics for Efficient Image Exploration and User-Guided Image CaptioningIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338851430:6(2875-2887)Online publication date: 16-Apr-2024
          • (2024)Visualization for Trust in Machine Learning Revisited: The State of the Field in 2023IEEE Computer Graphics and Applications10.1109/MCG.2024.336088144:3(99-113)Online publication date: 1-May-2024

          View Options

          View options

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media