The Cultivated Practices of Text-to-Image Generation

Jonas Oppenlaender ORCID: orcid.org/0000-0002-2342-1540⁴

961 Accesses

Abstract

Humankind is entering a novel creative era in which anybody can synthesise digital information using generative artificial intelligence (AI). Text-to-image generation, in particular, has become vastly popular and millions of practitioners produce AI-generated images and AI art online. This chapter first gives an overview of the key developments that enabled a healthy co-creative online ecosystem around text-to-image generation to rapidly emerge, followed by a high-level description of key elements in this ecosystem. A particular focus is placed on prompt engineering, a creative practice that has been embraced by the AI art community. It is then argued that the emerging co-creative ecosystem constitutes an intelligent system on its own—a system that both supports human creativity, but also potentially entraps future generations and limits future development efforts in AI. The chapter discusses the potential risks and dangers of cultivating this co-creative ecosystem, such as the bias inherent in today’s training data, potential quality degradation in future image generation systems due to synthetic data becoming common place, and the potential long-term effects of text-to-image generation on people’s imagination, ambitions, and development.

You have full access to this open access chapter, Download chapter PDF

Human-AI Co-creation: Evaluating the Impact of Large-Scale Text-to-Image Generative Models on the Creative Process

Innovative AI-Powered Image Generator: Converting Text into Images with OpenAI

Ambrosinus-Toolkit Plugin: Artificial Intelligence Text-to-Image Generative Models Through Grasshopper

Keywords

Introduction

Generative artificial intelligence (AI) has taken the world by storm. Using deep generative models, anybody can conjure up digital information from short descriptive text prompts. Text-to-image synthesis, in particular, has become a popular means for generating digital images (Crowson et al., 2022; Rombach et al., 2022). Millions of people use generative systems and text-to-image services available online, such as Midjourney,^{Footnote 1} Stable Diffusion (Rombach et al., 2021), and DALL-E 2 (Ramesh et al., 2022), both for professional and recreational uses. With this powerful generative technology at our fingertips, humankind is ushering into a new era—an era in which visual imagery no longer necessarily reflects the effort put into creating the imagery (Oppenlaender, 2022).

This chapter first gives an overview of the key technical developments that enabled a co-creative ecosystem around text-to-image generation to rapidly emerge and expand in 2021 and 2022. This is followed by a high-level description of key elements in the ecosystem and its practices. Focus is placed on prompt engineering, a method and creative practice that has proven useful in a broad set of application areas, but has been particularly embraced by the community of text-to-image generation practitioners. It is then argued that the creative online ecosystem constitutes an intelligent system of its own—a system that both enables yet also potentially limits the creative potential of future generations of humans and machine learning (ML) systems. In the chapter, the author discusses some potential risks and dangers of cultivating this co-creative ecosystem. Risks include: the threat of bias due to Western ways of seeing that are encoded in training data; quality degradation due to synthetic data being used for training future generative systems; and the potential long-term effects of text-to-image generation on people’s creativity, imagination, and development.

Background on Text-to-Image Generation

The history of computer-generated art and “generative art” (Boden & Edmonds, 2009; Galanter, 2016) goes back to the first experiments with AI (Cohen, 1979). Looking back, the first attempts to synthesise images from text were humbling, but already showed great promise. The synthetic images presented by Mansimov et al. (2016), for instance, were tiny in size (e.g., a 32 × 32 pixel resolution image of a “green school bus”). Today, text-guided synthesis of images has made a giant leap towards becoming a mainstream phenomenon (Olson, 2022). Within less than a year, Midjourney’s Discord community has grown to over 10 million users, making Midjourney the largest Discord community to date.^{Footnote 2} Besides more powerful graphics processing units (GPUs), a few particular important inventions advanced the field of text-to-image generation. This section gives a brief overview aiming to explain the recent technical developments that enabled and fuelled the meteoric rise of text-to-image generation.

The invention of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) was a watershed moment in advancing image generation. GANs are a type of deep learning architecture consisting of two antagonistic parts: a generator and discriminator. During training, the generator presents the discriminator with synthetic images. The discriminator judges these images and the process is iteratively continued until the discriminator cannot tell the synthetic images apart from real images, such as the images utilised in the training data. Using a text-conditioned GAN architecture, Reed et al. (2016) pioneered the approach of synthesising images from text. The approach was extended in January 2021 with OpenAI’s DALL-E (Ramesh et al., 2022), a neural network trained on text-image pairs. DALL-E was able to synthesise images from text captions for a wide range of concepts expressible in natural language. In parallel, OpenAI presented CLIP (Radford et al., 2021), a contrastive language-vision model originally conceived for the task of classifying images. CLIP was trained on a large corpus of image pairs and text scraped from the World Wide Web. Due to the large size of its training data, the CLIP model has learned a wide variety of visual concepts from natural language supervision. This proved useful for tasks that visually associate language with images. The CLIP model and its training corpus were, however, not released by OpenAI, which spurred efforts to replicate CLIP and its training data.

It was the release of the weights of CLIP in January 2021 that resulted in immense technical progress in the field of AI-generated imagery. The CLIP weights found their first significant application in an image generation system called “The Big Sleep” by Ryan Murdoch (Colton et al., 2021; Murdock & Wang, 2021). In Murdoch’s architecture, the generator is a model called BigGAN, and CLIP is used to guide the generation process with text. This inspired Katherine Crowson to connect a more powerful neural network (VQGAN) with CLIP (Crowson et al., 2022). The VQGAN–CLIP architecture became very popular in 2021 and instrumental to advancing the emerging field of text-to-image generation (Crowson et al., 2022). The source code of VQGAN–CLIP was available online, and many generative architectures for synthesising digital images and artworks have since been developed based on the work by Murdoch and Crowson. GANs were later superseded by diffusion-based systems (Dhariwal & Nichol, 2021). Diffusion models are a class of ML models that learn through the introduction of incremental noise into the training data, with the objective of subsequently reversing the noising process and restoring the original image. Once trained, these models are capable of utilising the learned denoising methods to synthesise novel, noise-free images from random input.

Today, practitioners can choose from a large variety of diffusion-based generative systems. Some of these systems are available as open source, such as Stable Diffusion (Rombach et al., 2021), others are available as online services, such as Midjourney and DALL-E 2. Due to the low barrier of entry and high ease-of-use, Colab notebooks contributed to a democratisation of digital art production. Anybody can create digital images and artworks with text-to-image generation systems (Oppenlaender, 2022), which establishes parallels of the novel technology with photography.

Text-to-Image Generation as the New Photography

As a novel phenomenon and emerging technology, text-to-image generation—and generative AI in general—can be compared to past disruptive and transformative technologies. For instance, the invention of Gutenberg’s printing press reduced the cost of printing, revolutionised the spread of knowledge, and had a profound impact on human development (Eisenstein, 1980). Generative AI could have similar transformative effects on society. This section briefly discusses the parallels between the invention of photography and text-to-image generation, followed by some current criticisms of text-to-image generation.

Parallels Between Photography and Text-to-Image Generation

When photography was invented, critics argued against the new technology. Photography’s early critics saw the novel technology not as a new medium for creative expression, but as a direct threat to the livelihood of artists. As Hertzmann (2018) points out, “[m]any artists were dismissive of photography, and saw it as a threat to ‘real art’”. For instance, upon seeing a demonstration of the daguerreotype technique, the painter Paul Delaroche declared: “From today, painting is dead!” (Hertzmann, 2018). Photography, as a mechanical way of capturing reality, was seen in direct competition to realism, an art style that aims to depict nature accurately and in great detail. But over the years, photography has evolved into a medium for artistic expression, a medium that allows for the creation of unique images that were previously unimaginable. Ultimately, the invention of photography caused painters to innovate their craft, moving from realism to abstraction (Hertzmann, 2018, 2020).

Text-to-image generation, as a novel artistic medium (Marche, 2022), offers equally exciting new possibilities for artistic expression and creativity. The emerging technology has the potential to revolutionise the way we work creatively and may fundamentally impact our relationship to digital media. In many ways, text-to-image generation is the new photography, democratising access to creative expression that was previously limited to highly talented creative individuals. Text-to-image generation, however, also poses many complex socio-technical challenges with no easy solution. While it is undeniable that AI algorithms can generate fascinating images, it is important to consider the potential criticisms of this technology.

Criticisms About Text-to-Image Generation

Today’s concerns and criticisms about text-to-image generation resemble the ones that were being raised when photography was invented (Hertzmann, 2018). This chapter summarises some of the key concerns and criticisms about generative AI, and text-to-image generation in particular. As we continue to develop and explore generative AI, it will be crucial to address these concerns and ensure that the technology is used in a responsible and ethical way.

One of the main criticisms about text-to-image generation is that it threatens to automate and replace human cognitive and creative work. Generative AI is much faster and cheaper than human work, and it does not require much skill and effort to prompt generative systems (Oppenlaender, 2022). This could lead companies to stop hiring or contracting knowledge workers and creatives, such as illustrators, designers, and artists (Mok, 2023; Oppenlaender et al., 2023).

Another concern brought forth against the technology of text-to-image generation is that it is built on top of centuries of human-made creative works. Generative AI is data hungry (Goldberg, 2023). Whether it is literature used for training large language models or images scraped from the Web, generative models are trained on vast datasets scraped from the World Wide Web. This “Web-scale” data enables generative models to learn patterns in the data. Critics of the technology argue that there are insurmountable legal issues concerning the use of the training data. Proponents, on the other hand, argue that using scraped training data falls under the U.S. fair use doctrine (Justia, 2022). The fair use doctrine has opened a loophole in which commercial organisations are funding development efforts in non-profit organisations, which are allowed to train models and scrape data under fair and academic use.

These commercial organisations are essentially using non-profit organisations as a shield from litigation and have been accused of data laundering (Baio, 2022) and deceptive trade practices (Justia, 2023). For instance, Microsoft has received criticism for training its Co-Pilot system on GitHub (Kuhn, 2022). GitHub is a large repository of software licensed under various terms, including copyleft and MIT licenses. These licenses mandate that any modifications and derivatives of the software must be free or distributed under the same license, respectively. Co-Pilot cannot—by design—guarantee that the license is adhered to and due to the opaqueness of the neural network, it is not possible to trace the generated information to its source. Similar concerns were raised about MegaFace, a large database of facial images scraped from Flickr (Nech & Kemelmacher-Shlizerman, 2017). The database was used by numerous commercial organisations to advance the state of facial recognition for various purposes, including surveillance (Harvey & LaPlace, 2021). Legal battles between generative AI firms and content creators, such as the class action lawsuits against Microsoft and OpenAI (Vincent, 2022b) and the lawsuit of Getty Images against Stability AI (Justia, 2023), are also battles about the future direction of the creative industries.

The emergence of generative AI as a novel and impactful technology is expected to significantly disrupt the current status quo, creating frictions and challenges in relation to existing societal norms and policy frameworks (Toews, 2022). Generative AI is a direct threat to some business models. For instance, Google’s search engine business may become affected if consumer preferences shift from search engines to query-answering language models. Stackoverflow is another company that may be heavily impacted by generative AI. Answers provided by generative AI undermine Stackoverflow’s community, which consists of human volunteers providing human-written answers to human-written questions. The company has banned generative AI from its websites (Stackoverflow.com, 2022). A similar ban was instated and legal action has been taken by the stock photography website Getty Images (Justia, 2023; Vincent, n.d. 2022a). The above examples highlight the cross-sectional and complex impact of generative AI on a variety of businesses and entire industries.

Another criticism about generative AI is that it may violate people’s privacy. For instance, diffusion models have been shown to memorise and replicate training data (Carlini et al., 2023; Somepalli et al., 2022). That means diffusion models could reproduce near-exact matches of instances found in the training data. This could not only have copyright implications, but also lead to the unintentional release of private data. For instance, LAION—the large dataset used to train Stable Diffusion—was shown to contain medical images of patients included in the dataset without the consent of the patient (Edwards, 2022a). Memorisation is a fundamental issue in diffusion models, and may even be necessary for generative models to generalise (Feldman, 2019).

On a less technical level, another criticism about text-to-image generation is that it could erode human creativity and artistic expression. Critics argue that by relying on algorithms and ML to generate images, we are losing touch with the human elements that make art so special. Critics worry that AI art may become a sterile and impersonal medium, devoid of the emotional depth and individuality that characterises great human art (Oppenlaender et al., 2023). Another related concern is that AI art could lead to the homogenisation of artistic styles and forms. Because generative models are trained on vast datasets of images, they tend to produce art that is similar to what has come before. Midjourney,^{Footnote 3} for instance, produces images with a recognisable style. Current generative systems “lack a concept of novelty regarding how their product differs from previously created ones” (Zammit et al., 2022, p. 1). This could lead to a situation where all AI art begins to look the same, with little variation or originality.

Another concern is that text-to-image generation could be used to create false or misleading images (Oppenlaender et al., 2023). With the ability to generate highly realistic images, AI algorithms could be used to create fake photographs or other forms of visual media for the purpose of spreading misinformation. This could have serious implications for the reliability of information and the trustworthiness of digital media that we encounter online.

Even proponents of text-to-image generation have voiced concerns about the technology. Given the improvements in the latest versions of text-to-image models, some practitioners of AI art have complained that it is becoming “too easy” to conjure up images from text prompts (Edwards, 2022b; Oppenlaender, 2022). This raises the interesting question about the optimal skill level for text-to-image generation and AI-generated art. If the skill level is too difficult, users will not be able to communicate their intent to the model. The result could be that users become frustrated. On the other hand, if the skill level is too easy, users will not have a strong sense of ownership and will feel that the generated images do not reflect their intent. Rather, the sentiment will be that the images are merely retrieved from pre-made collection of images. The current generative systems vary in this respect. Stable Diffusion, for instance, requires more effort being placed on writing prompts than Midjourney. This effort put into writing prompts is part of the novel practice of prompt engineering, which is discussed in the following section.

The Creative Practice of Prompt Engineering

Prompt engineering (Liu & Chilton, 2022; Oppenlaender, 2023)—or prompting for short—is an interaction pattern in which ML models are given text as inputs (Brown et al., 2020). Prompt engineering is a paradigm shift in how ML models are adapted to various downstream tasks. Instead of retraining or fine-tuning the model, the model is prompted with context. In zero-shot prompting (Kojima et al., 2022), the user directly prompts the generative model, whereas in few-shot prompting (Brown et al., 2020), the user first provides a few examples to give context to the model. Zero-shot prompting has offered an ideal application ground in AI-generated art. In the context of text-to-image generation, prompt engineering means that “carefully selected and composed sentences are used to achieve a certain visual style in the synthesized image” (Rombach et al., 2022, p. 2).

Prompt engineering is not an exact engineering science as found in science, technology, engineering, and mathematics (STEM). Rather, its origins are within the online community of practitioners of text-to-image generation and AI artists who practice prompt engineering to exercise their craft. Prompt engineering is iterative and resembles a conversation with the text-to-image system. A practitioner typically will type a prompt, observe the outcome, and adapt the prompt to improve the outcome.

The online community around text-to-image generation found that the aesthetic qualities and subjective attractiveness of images can be modified by adding certain keywords to prompts (Oppenlaender, 2023). By adding such modifiers to an input prompt, one seeks to direct the text-to-image model to produce images in a certain style or with a certain quality. Knowing what prompt modifiers work best for a given subject term is often the result of the practitioner’s iterative and persistent experimentation (Kim, 2022; Liu & Chilton, 2022). Community-based resources have been created as education materials about the practice of prompt engineering (Oppenlaender, 2022, 2023).

Supported by these numerous guidelines and learning resources, anybody can materialise seemingly creative artifacts with generative AI, whether it’s textual (e.g., poems and essays written with OpenAI’s GPT language models) or visual (e.g., photographic images with Midjourney and Stable Diffusion), with little to no understanding of the underlying technologies. As generative AI matures and gains more prominence in our daily lives, it will become more important for us to be able to communicate effectively with the AI without having to resort to technical jargon and complicated keywords as is currently the case with prompt engineering. Critics of prompt engineering, therefore, point out its many limitations:

Some text-to-image generation models accept only a limited number of tokens (e.g., 75 tokens). Tokens are pieces of text that roughly correspond to words in the prompt. Unless one picks a subject well-represented in the training data (e.g., Leonardo da Vinci’s Mona Lisa), it is difficult to describe a subject in detail to the generative AI with only a limited number of tokens.
Even if there is an unlimited number of tokens to describe an imagined image, it is not humanly possible to describe a subject in every minute detail. In practice, the generative model will need to fill in the blanks. This, in turn, can lead the practitioner to either settle for an image that is “good enough” or abandon the pursuit of the originally envisioned image (Oppenlaender, 2022). On the other hand, the mismatch between the imagined output and the generated image can also spark creativity in the human user (Epstein et al., 2022).
The keywords that humans use to describe a subject may not correspond to the concepts that the neural network has learned during its training process. Certain tokens in the prompt could also have unintended side-effects, for instance, on the style of an image, and concepts can “leak” from the prompt into the image (Rassin et al., 2022). It is, for example, not uncommon that parts of the prompts will surface in generated images as text, even if that was not intended by the user.
Each time a generative model is updated, the practice of prompt engineering is heavily affected and practitioners have to relearn their craft (Wang, 2022).

The latter is supported by a wealth of learning resources that have emerged online to support practitioners of text-to-image generation in learning the craft of prompt engineering. Together with communities and tools and services, these resources are the main pillars of the co-creative ecosystem of text-to-image generation.

Human–computer Co-creative Ecosystem of Text-to-Image Generation

Prompt engineering is a practice embedded in the greater online creative ecosystem of text-to-image generation that consists of communities, learning resources, and tools and services (Oppenlaender, 2022). In this ecosystem, communities, learning resources, and tools come together to form a dynamic and interactive environment where creativity and technology intersect. This convergence empowers users to explore and push the boundaries of art and expression through the innovative use of text-to-image AI technologies.

Communities Dedicated to Text-to-Image Generation

Online communities have emerged around text-guided generative art. These communities provide a platform for individuals to share their creative prompts and works of art, or even collaborate on generating new images collectively (Kantosalo & Takala, 2020). The Midjourney community has emerged as a cultural hub contributing to the proliferation of prompt engineering, due to providing an accessible service and having an ear for the needs of its community members. Further examples of online communities include /r/MediaSynthesis on Reddit, and numerous other communities hosted on Discord. The latter in particular has proven to be an effective tool for community formation due to its ability to facilitate the creation of chat-based online communities, and the possibility for community members to directly interact with the image generation systems via chat.

Online communities act as a fertile ground for learning the skills necessary to use text-to-image generation systems creatively. The online communities of enthusiastic creators who share their prompts and practices serve as a rich environment for novice practitioners to gain knowledge from other members of the community in order to surmount the challenges of prompt engineering. However, the distributed and sequential nature of communication within these communities presents a challenge for practitioners seeking to attain specific learning objectives, such as the study of prompt modifiers. As a result, there is a growing trend among practitioners and online communities to establish dedicated learning resources that cater to the learning needs of novices.

Dedicated Learning Resources

Community members have created various learning resources related to prompt engineering, including resources aimed at specific learning objectives, such as style experimentation (Durant, 2021; Gabha, 2022a) and teaching prompt engineering (e.g., the DALL-E prompt book by Parsons (2022)). Some of these resources adopt a similar approach to the systematic experimentation conducted by Liu and Chilton (2022), presenting results of their experiments in tabular format. Examples of these resources include the artist studies by Durant (2021), “MisterRuffian’s Latent Artist and Modifier Encyclopedia” (Saini, 2022), and the list of Disco Diffusion modifiers by Gabha (2022b). Hub pages are another type of resource that have been established to improve the discoverability of the rapidly growing field of text-to-image generation. These hub pages function as indexes, compiling links to Google Colaboratory (Colab) notebooks and improving accessibility to the quickly growing number of resources. Two examples of hub pages are Miranda’s list of VQGAN-CLIP implementations^{Footnote 4} and pharmapsychotic’s “Tools and Resources for AI Art”.^{Footnote 5}

Dedicated Tools and Services

A growing number of interfaces, tools, and services are emerging to support practitioners in practicing text-to-image generation. For instance, a wide variety of text-to-image generation systems are available as open source in executable notebooks. Colab^{Footnote 6} in particular, has proven instrumental to the early growth and popularity of text-to-image generation. Colab is an online service that allows anybody to execute Python-based code and ML models for free. This is a growing ecosystem of tools and services that assist in making text-to-image generation more accessible to non-technically minded practitioners. The ecosystem also acts as a catalyst that draws in new practitioners and advances the field as a whole. It is the specific combination of people, technology, services, tools, and resources that formed a healthy co-creative ecosystem for the text-to-image art community to thrive.

The Risks and Dangers of Cultivating the Co-creative Ecosystem

As the generative AI revolution advances, generative AI is becoming infused into many software applications and creative tools. For instance, users of Adobe Photoshop can create images from text using extensions based on Stable Diffusion (Alfaraj, 2022; Stability AI, 2022), and Microsoft’s Co-Pilot (GitHub Inc., 2021) has become an indispensable tool for many software developers. Generative AI is also explored in the field of generative design (Matejka et al., 2018; Mountstephens & Teo, 2020) and architecture (Paananen et al., 2023). Given the emerging ubiquity of generative models, such as large language models (Brown et al., 2020; Devlin et al., 2019; Radford et al., 2019; Raffel et al., 2022) and other foundation-scale models (Bommasani et al., 2021), it is foreseeable that even more creative work will be completed with support of generative AI in the future.

These generative models and, thus, the tools and applications built with them, are based on deep learning. The results of deep neural networks are difficult to interpret and understand by both laypeople and experts (Lipton, 2018; Poursabzi-Sangdeh et al., 2021). With the emerging ubiquity of generative technologies, we are at risk of creating “systems of opaque systems” ingrained with difficult to understand, potentially flawed, and biased logic. This section discusses three potential dangers of cultivating the emerging co-creative ecosystem.

Bias in Training Data and Generative Models

Current approaches to training generative AI rely on vast datasets collected from the World Wide Web. For example, “The Pile” (Gao et al., 2020) is a popular data source for training language models. LAION (Schuhmann et al., 2021, 2022) is a dataset based on CommonCrawl,^{Footnote 7} a large dataset released by a non-profit organisation that periodically scrapes data from the World Wide Web. LAION-5B contains over 5 billion text-image pairs, with text from the “alt” attributes of HTML image elements (Schuhmann et al., 2022). These big datasets are used for training multi-modal generative models.

It is known that data on the Web is biased (Baeza-Yates, 2018). Web-based data may contain content that violates human preferences (Korbak et al., 2023). Web-based datasets may also encode Western ways of seeing due to over-representation of certain viewpoints in the data, creating intersectional issues reflecting discrimination and privilege. For instance, the English language is over-represented in LAION-5B, with 2.3 billion of the 5.85 billion image-text pairs in the dataset being in English language, and the rest representing more than 100 other languages or texts that cannot be assigned to a language (Schuhmann et al., 2022).

Generative models trained on this biased data may repeat, and in some cases, amplify undesirable biases, such as demographic biases (Bender et al., 2021; Danks & London, 2017; Salminen et al., 2020; van der Wal et al., 2022; Williams et al., 2018). Birhane et al. (2021) found that LAION contains many troublesome and explicit images and text pairs, including “rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content” (p. 1). Qu et al. (2023) found that text-to-image generators create a substantial number of unsafe images, including sexually explicit, violent, disturbing, hateful, and political images, which could be used to spread hateful memes. This problematic content may surface in downstream applications and cause harm (Monroe, 2021). That is one reason why deep neural networks are difficult to audit (Perez et al., 2022) and harm is often only discovered during everyday use (Shen et al., 2021).

A Flood of Synthetic Data

Synthetic images are being shared en masse on social media. This flood of synthetic data (Olson, 2022) raises concerns that synthetic imagery could taint the training data of future generations of text-to-image models and perpetuate the weaknesses of current text-to-image systems. For instance, current text-to-image models struggle with human anatomy, in particular the accurate depiction of human hands. Data quality has been shown to be of equal importance as the size of the training data (Hoffmann et al., 2022). Using low-quality images as training data could result in further degradation in the quality of future generative models. Shumailov et al. (2023) call this phenomenon the “curse of recursion”.

Models trained on data generated by prior generative models may degenerate and forget the underlying data distribution, leading to learned behavior with limited variance (Shumailov et al., 2023). Some proponents of generative AI argue that future generative models will learn to associate a new category of images with synthetic imagery, and that we can simply prompt the generative model not to produce images that look like they were generated by AI. However, this assumes that we can find a way to prompt the generative model to avoid the specific class of AI-generated images, which may or may not be possible. The challenge is in finding the right keywords to denote the class of quality-degraded images, as mentioned in the section on prompt engineering. Even if there was a clear label for the new class of synthetic imagery, it could happen that this synthetic look is inseparably associated with other classes.

A possible solution to this problem could be invisible watermarks applied to AI-generated images. This metadata would allow the generative models to correctly distinguish the class of AI-generated imagery from human creations. Another solution could be technological progress in image generation. By training generative AI on smaller and more tightly controlled datasets, many of the problems could be avoided. This sparse learning (Mishra et al., 2021) would also better mimic how humans learn from sparse cues in their environment.

Long-term Effects on Individuals and Human Culture

By cultivating generative systems and adopting them in our creative practices, we risk becoming dependent on them. Final-year students who finished their studies in 2020 are the last cohort of students to complete their education without the support of generative AI. Future generations will be born into a co-creative ecosystem of ubiquitous generative AI. Generative AI, in this regard, is similar to the Internet, which had a profound impact on our lives, especially on the generation of “GenZ” who grew up never knowing a time without the Internet. The Internet has affected our cognition (Carr, 2011; Firth et al., 2019). It has reconfigured our society and brought great benefits, but it has also presented drawbacks and risks. With generative AI and text-to-image generation, there could be unforeseen direct and indirect negative consequences on society and individuals.

For one, generative AI could negatively affect people’s imagination and cognitive functioning in the long term. The negative and lasting effect of habitual use of mobile phones on people’s cognition, memory, and attention span can already be witnessed (Wilmer et al., 2017). One could argue that habituated practices of image generation could in the long term also negatively affect people’s career ambitions. For instance, the widespread use of AI could discourage would-be artists from pursuing a creative career (Mok, 2023). Habitual use of generative AI could also negatively affect people’s cognitive abilities, with lasting effects on imagination and effects on child development. Aphantasia refers to the inability to visualise mental images (Milton et al., 2021). This “ability to create a quasi-perceptual visual picture in the mind’s eye” (Dance et al., 2022, p. 1) is important for daydreaming, imagination, and creativity (Zeman et al., 2020). Dance et al. (2022) found that about 4% of a population of about 1000 people had a weakness or inability to create mental imagery. Generative AI could potentially contribute to a rise in the prevalence of aphantasia in the general population.

Another effect could be that people become accustomed to lowering expectations and settling for second-best options. As discussed in the section on prompt engineering, the results of text-to-image generation are often random and do not match the mental image of the person writing the prompt. The retrieval of an exact image from the generative model’s “infinite index” (Deckers et al., 2023) can be an arduous task. With each new generation, the practitioner’s efforts to “retrieve” images from the generative model may become derailed. The generative system may, for instance, present the user with interesting results that do not reflect the initial prompt, but are worth pursuing further. This repeated settling for good enough results could lead to long-term changes in our ambition and could contribute to cultivating a culture of prototyping.

Another effect on culture could be a change in communication patterns. Generative AI could lead to a shift in how individuals communicate with each other. For instance, intelligent agents could write and summarise e-mails for people, a feat that OpenAI’s ChatGPT already accomplishes quite well today. Generative AI could also more directly support communication, for instance, in the form of real-time translation of spoken words in face-to-face communications (Kirkpatrick, 2020). Text-to-image generation could contribute to the proliferation of memes on the Internet and fuel a meme-driven culture. AI-generated media may allow creators to express feelings that could not be expressed through words.

However, the proliferation of synthetic media could lead to some human-created media becoming harder to find on the Web, with knock-on effects on humanity’s knowledge and culture. With the flood of synthetic media, the long-tail of information on the Web (i.e., the vast array of less popular or niche information and content that is not mainstream but cumulatively significant) expands and accrues noise, which in turn makes information in the long-tail more difficult to retrieve. Kandpal et al. (2022) found that large language models struggle to learn the long-tail of knowledge. If interaction with large language models becomes our primary way of answering queries, as opposed to searching the Web, long-tail knowledge could be lost. This presents unique challenges to augmented AI (Mialon et al., 2023). Augmented models are a class of generative models that are equipped with the ability to use tools and access external knowledge bases. Such augmented models can, for instance, query APIs, execute functions, and retrieve information from search engines. Synthetic media could negatively affect the operation of augmented models due to factual information becoming harder to retrieve in search engines.

Text-to-image generation could also fundamentally change people’s relation to visual media and how individuals appreciate art. AI-generated media is becoming ubiquitous, and it is not clear if pervasive synthetic media will make us appreciate human-made art more or less. If anybody can create digital artworks that look like they were created by a master painter, will people still appreciate real paintings, whether digital or on a physical canvas? The proliferation of AI-generated media makes it also harder for aesthetic trends to stick among many meaningless short-lived fads (Townsend, 2023). The model of attractive quality by Kano et al. (1984) posits that products can contain “exciting” factors that contribute to the appreciation of the product. Over time, exciting factors become expected. The iPhone’s touchscreen-based user experience, for instance, was exciting when it was first introduced, but later became the expected standard in mobile phones. Perhaps, once the novelty of AI-generated media wears off, text-to-image generation will turn from excitement to expected. Then, generative AI’s true potential will emerge. Generative AI has the potential to become part of the fundamental layer on which future human society is based.

Conclusion

Text-to-image generation technology has emerged as an exciting new area of creative practice, drawing parallels with photography in terms of its ability to visualise the surrounding world. However, as with any new technology, there are also concerns. The co-creative ecosystem of text-to-image generation, which involves both human and computer participants, raises issues related to bias in training data, a potential flood of synthetic data, and the long-term impact on individuals and human culture.

Despite these challenges, the creative practice of prompt engineering has already produced remarkable results, thanks in part to a flourishing online ecosystem of dedicated communities, learning resources, tools, and services. While these resources offer great potential for creativity and innovation, they also come with risks. Therefore, it is crucial to carefully consider the ethical implications of cultivating a co-creative ecosystem and take steps to mitigate any potential negative effects. Text-to-image generation has the potential to revolutionise the way new works of art are visualised and created, but it is essential to approach this technology with caution and responsibility. As this exciting new field is continually explored, vigilance must be maintained in identifying and addressing potential risks as well as dangers to ensure that generative AI is aligned with human values, and that the benefits of generative AI are realised for the benefit of all.

Notes

1.
https://www.midjourney.com.
2.
See https://discord.com/servers.
3.
Version 3.
4.
https://ljvmiranda921.github.io/notebook/2021/08/11/vqgan-list/.
5.
https://pharmapsychotic.com/tools.htm.
6.
https://colab.research.google.com.
7.
https://commoncrawl.org.

References

Alfaraj, A. (2022). Auto photoshop Stablediffusion plugin. https://github.com/AbdullahAlfaraj/Auto-Photoshop-StableDiffusion-Plugin
Baeza-Yates, R. (2018). Bias on the web. Communications of the ACM, 61(6), 54–61. https://doi.org/10.1145/3209581
Article Google Scholar
Baio, A. (2022). AI data laundering: How academic and nonprofit researchers shield tech companies from accountability. https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/
Bender, E. M., Timnit, G., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. FAccT ’21. Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922
Birhane, A., Vinay, U. P., & Kahembwe, E. (2021). Multimodal datasets: Misogyny, pornography, and malignant stereotypes. arXiv. https://doi.org/10.48550/ARXIV.2110.01963
Boden, M. A., & Edmonds, E. A. (2009). What is generative art? Digital Creativity, 20(1–2), 21–46. https://doi.org/10.1080/14626260902867915
Article Google Scholar
Bommasani, R., Drew, A. H., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., et al. (2021). On the opportunities and risks of foundation models. CoRR abs/2108.07258. http://arxiv.org/abs/2108.07258
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., et al. (2020). Language models are few-shot learners. arXiv. https://doi.org/10.48550/ARXIV.2005.14165
Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramèr, F., Balle, B., Ippolito, D., & Wallace, E. (2023). Extracting training data from diffusion models. arXiv. https://doi.org/10.48550/ARXIV.2301.13188
Carr, N. (2011). The Shallows: What the Internet is doing to our brains. W. W. Norton & Company Inc.
Google Scholar
Cohen, H. (1979). What is an image?
Google Scholar
Colton, S., Smith, M., Berns, S., Murdock, R., & Cook, M. (2021). Generative search engines: Initial experiments. In Proceedings of the 12th International Conference on Computational Creativity, 237–246. ICCC ’21. Association for Computational Creativity.
Google Scholar
Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L., & Raff, E. (2022). VQGAN-CLIP: Open domain image generation and editing with natural language guidance. arXiv. https://doi.org/10.48550/ARXIV.2204.08583
Dance, C. J., Ipser, A., & Simner, J. (2022). The prevalence of aphantasia (imagery weakness) in the general population. Consciousness and Cognition, 97, 103243. https://doi.org/10.1016/j.concog.2021.103243
Article Google Scholar
Danks, D., & London, A. J. (2017). Algorithmic bias in autonomous systems. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, (IJCAI-17), 4691–4697. https://doi.org/10.24963/ijcai.2017/654
Deckers, N., Fröbe, M., Kiesel, J., Pandolfo, G., Schröder, C., Stein, B., & Potthast, M. (2023). The infinite index: Information retrieval on generative text-to-image models. In ACM SIGIR Conference on Human Information Interaction and Retrieval. CHIIR ’23.
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, & J. Wortman Vaughan (Eds.), Advances in neural information processing systems (Vol. 34, pp. 8780–8794). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf
Durant, R. (2021). Artist Studies by @remi_durant. https://remidurant.com/artists/
Edwards, B. (2022a). Artist finds private medical record photos in popular AI training data set. https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set
Edwards, B. (2022b). ‘Too Easy’—Midjourney tests dramatic new version of its AI image generator. https://arstechnica.com/information-technology/2022/11/midjourney-turns-heads-with-quality-leap-in-new-ai-image-generator-version/
Eisenstein, E. L. (1980). The printing press as an agent of change. Cambridge University Press.
Book Google Scholar
Epstein, Z., Schroeder, H., & Newman, D. (2022). When happy accidents spark creativity: Bringing collaborative speculation to life with generative AI. In International Conference on Computational Creativity. ICCC ’22. arXiv. https://doi.org/10.48550/ARXIV.2206.00533
Feldman, V. (2019). Does learning require memorization? A short tale about a long tail. arXiv. https://doi.org/10.48550/ARXIV.1906.05271
Firth, J., Torous, J., Stubbs, B., Firth, J. A., Steiner, G. Z., Smith, L., Alvarez-Jimenez, M., et al. (2019). The ‘Online Brain’: How the Internet may be changing our cognition. World Psychiatry, 18(2), 119–129.
Article Google Scholar
Gabha, H. (2022a). Disco (Diffusion) modifiers. https://weirdwonderfulai.art/resources/disco-diffusion-modifiers/
Galanter, P. (2016). Generative art theory. In C. Paul (Ed.), A companion to digital art (pp. 146–180). John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118475249.ch5
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J. et al. (2020). The Pile: An 800GB dataset of diverse text for language modeling. https://doi.org/10.48550/ARXIV.2101.00027
GitHub Inc. (2021). GitHub copilot—Your AI pair programmer. https://copilot.github.com
Goldberg, Y. (2023). Some remarks on large language models. https://gist.github.com/yoavg/59d174608e92e845c8994ac2e234c8a9
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Q. Weinberger (Eds.) Advances in neural information processing systems (Vol. 27). Curran Associates, Inc.
Google Scholar
Harvey, A., & LaPlace, J. (2021). Megaface. https://exposing.ai/megaface/
Hertmann, A. (2020). Computers do not make art, people do. Communications of the ACM, 63(5), 45–48. https://doi.org/10.1145/3347092
Article Google Scholar
Hertzmann, A. (2018). Can computers create art? Arts, 7(2). https://doi.org/10.3390/arts7020018
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., et al. (2022). Training compute-optimal large language models. arXiv. https://doi.org/10.48550/ARXIV.2203.15556
Justia. (2022). HiQ labs, Inc. v. LinkedIn corporation. https://law.justia.com/cases/federal/appellate-courts/ca9/17-16783/17-16783-2022-04-18.html
Justia. (2023). Getty images (US), Inc. V. stability AI, Inc. https://docs.justia.com/cases/federal/district-courts/delaware/dedce/1:2023cv00135/81407/1
Kandpal, N., Deng, H., Roberts, A., Wallace, E., & Raffel, C. (2022). Large language models struggle to learn long-tail knowledge. arXiv. https://doi.org/10.48550/ARXIV.2211.08411
Kano, N., Seraku, N., Takahashi, F., & Tsuji, S.-I. (1984). Attractive quality and must-be quality. Journal of the Japanese Society for Quality Control, 14(2), 147–156.
Google Scholar
Kantosalo, A., & Takala, T. (2020). Five C’s for human–computer co-creativity: An update on classical creativity perspectives. In Proceedings of the 11th International Conference on Computational Creativity. Association for Computational Creativity.
Google Scholar
Kim, J. (2022). Keynote on interaction-centric AI. In NeurIPS 2022. https://slideslive.com/38996064/interactioncentric-ai
Kirkpatrick, K. (2020). Across the language barrier. Commununications of the ACM, 63(3), 15–17. https://doi.org/10.1145/3379495
Article Google Scholar
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. arXiv. https://doi.org/10.48550/ARXIV.2205.11916
Korbak, T., Shi, K., Chen, A., Bhalerao, R., Buckley, C. L., Phang, J., Bowman, S. R., & Perez, E. (2023). Pretraining language models with human preferences. arXiv https://doi.org/10.48550/ARXIV.2302.08582
Kuhn, B. M. (2022). If software is my copilot, who programmed my software? software freedom conservancy. https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/
Lipton, Z. C. (2018). The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3), 31–57. https://doi.org/10.1145/3236386.3241340
Article Google Scholar
Liu, V., & Chilton, L. B. (2022). Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. CHI ’22. Association for Computing Machinery. https://doi.org/10.1145/3491102.3501825
Mansimov, E., Parisotto, E., Ba, J., & Salakhutdinov, R. (2016). Generating images from captions with attention. In International Conference on Learning Representations. ICLR ’16.
Google Scholar
Marche, S. (2022). We’re witnessing the birth of a new artistic medium. The Atlantic. https://www.theatlantic.com/technology/archive/2022/09/ai-art-generators-future/671568/
Matejka, J., Glueck, M., Bradner, E., Hashemi, A., Grossman, T., & Fitzmaurice, G. (2018). Dream lens: Exploration and visualization of large-scale generative design datasets. In Proceedings of the 2018 Chi Conference on Human Factors in Computing Systems, 1–12. CHI ’18. Association for Computing Machinery. https://doi.org/10.1145/3173574.3173943
Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Rozière, B. et al. (2023). Augmented language models: A survey. arXiv. https://doi.org/10.48550/ARXIV.2302.07842
Milton, F., Fulford, J., Dance, C., Gaddum, J., Heuerman-Williamson, B., Jones, K., Knight, K. F., MacKisack, M., Winlove, C., & Zeman, A. (2021). Behavioral and neural signatures of visual imagery vividness extremes: Aphantasia versus Hyperphantasia. Cerebral Cortex Communications, 2(2). https://doi.org/10.1093/texcom/tgab035
Mishra, A., Albericio Latorre, J., Pool, J., Stosic, D., Stosic, D., Venkatesh, D., Yu, C., & Micikevicius, P. (2021). Accelerating sparse deep neural networks. arXiv. https://doi.org/10.48550/ARXIV.2104.08378
Mok, K. (2023). The power and ethical dilemma of AI image generation models. https://thenewstack.io/the-power-and-ethical-dilemma-of-ai-image-generation-models/
Monroe, D. (2021). Trouble at the source. Communications of the ACM, 64(12), 17–19. https://doi.org/10.1145/3490155
Article Google Scholar
Mountstephens, J., & Teo, J. (2020). Progress and challenges in generative product design: A review of systems. Computers, 9(4). https://doi.org/10.3390/computers9040080
Murdock, R., & Wang, P. (2021). Big sleep. https://github.com/lucidrains/big-sleep
Nech, A., & Kemelmacher-Shlizerman, I. (2017). Level playing field for million scale face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. https://doi.org/10.48550/arXiv.1705.00393
Olson, P. (2022). Creative AI is generating some messy problems. Bloomberg. https://www.washingtonpost.com/business/creative-ai-is-generating-some-messy-problems/2022/11/28/be2b2efc-6ee2-11ed-867c-8ec695e4afcd_story.html
Oppenlaender, J. (2022). The creativity of text-to-image generation. In 25th International Academic Mindtrek Conference, 192–202. Academic Mindtrek 2022. Association for Computing Machinery. https://doi.org/10.1145/3569219.3569352
Oppenlaender, J. (2023). A taxonomy of prompt modifiers for text-to-image generation. Behaviour & Information Technology. Taylor & Francis. https://doi.org/10.1080/0144929X.2023.2286532
Oppenlaender, J., Silvennoinen, J., Paananen, V., & Visuri, A. (2023). Perceptions and realities of text-to-image generation. In 26th International Academic Mindtrek Conference, 279–288. Academic Mindtrek 2023. Association for Computing Machinery. https://doi.org/10.1145/3616961.3616978
Paananen, V., Oppenlaender, J., & Visuri, A. (2023). Using text-to-image generation for architectural design ideation. In International Journal of Architectural Computing. SAGE. https://doi.org/10.1177/14780771231222783
Parsons, G. (2022). The DALL·E 2 prompt book. https://dallery.gallery/the-dalle-2-prompt-book/
Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., et al. (2022). Discovering language model behaviors with model-written evaluations. arXiv. https://doi.org/10.48550/ARXIV.2212.09251
Poursabzi-Sangdeh, F., Goldstein, D. G., Hofman, J. M., Wortman Vaughan, J., & Wallach, H. (2021). Manipulating and measuring model interpretability. In Proceedings of the 2021 Chi Conference on Human Factors in Computing Systems. CHI ’21. Association for Computing Machinery. https://doi.org/10.1145/3411764.3445315
Qu, Y., Shen, X., He, X., Backes, M., Zannettou, S., & Zhang, Y. (2023). Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. CCS ’23. ACM. https://doi.org/10.1145/3576915.3616679
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., et al. (2021). Learning transferable visual models from natural language supervision. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, 139, 8748–8763. ICML. PMLR. https://proceedings.mlr.press/v139/radford21a.html
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Google Scholar
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. & Liu, P. J. (2022). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1). https://jmlr.org/papers/volume21/20-074/20-074.pdf
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with CLIP latents. arXiv. https://doi.org/10.48550/ARXIV.2204.06125
Rassin, R., Ravfogel, S., & Goldberg, Y. (2022). DALLE-2 is seeing double: Flaws in word-to-concept mapping in Text2Image models. In Proceedings of the Fifth Blackboxnlp Workshop on Analyzing and Interpreting Neural Networks for Nlp, 335–45. Association for Computational Linguistics. https://aclanthology.org/2022.blackboxnlp-1.28
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis. ICML 2016. arXiv. https://doi.org/10.48550/ARXIV.1605.05396
Rombach, R., Blattmann, A., & Ommer, B. (2022). Text-guided synthesis of artistic images with retrieval-augmented diffusion models. arXiv. https://doi.org/10.48550/ARXIV.2207.13038
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-resolution image synthesis with latent diffusion models. http://arxiv.org/abs/2112.10752
Saini, L. (2022). Mister Ruffian's latent artist & modifier encyclopedia. https://docs.google.com/spreadsheets/d/1_jgQ9SyvUaBNP1mHHEzZ6HhL_Es1KwBKQtnpnmWW82I
Salminen, J., Jung, S.-G., Chowdhury, F., & Jansen, B. J. (2020). Analyzing demographic bias in artificially generated facial pictures. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, 1–8. CHI Ea ’20. Association for Computing Machinery. https://doi.org/10.1145/3334480.3382791
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C. W., Wightman, R., Cherti, M., Coombes, T., et al. (2022). LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-Sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://arxiv.org/abs/2210.08402
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., & Komatsuzaki, A. (2021). LAION-400M: Open dataset of CLIP-Filtered 400 million image-text pairs. arXiv. https://doi.org/10.48550/ARXIV.2111.02114
Shen, H., DeVos, A., Eslami, M., & Holstein, K. (2021). Everyday algorithm auditing: Understanding the power of everyday users in surfacing harmful algorithmic behaviors. Proceedings of ACM Human-Computer Interaction, 5 (CSCW2). https://doi.org/10.1145/3479577
Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2023). The curse of recursion: Training on generated data makes models forget. http://arxiv.org/abs/2305.17493
Somepalli, G., Singla, V., Goldblum, M., Geiping, J., & Goldstein, T. (2022). Diffusion art or digital forgery? investigating data replication in diffusion models. arXiv. https://doi.org/10.48550/ARXIV.2212.03860
Stability AI. (2022). Stability photoshop plugin. https://exchange.adobe.com/apps/cc/114117da/stable-diffusion
Stackoverflow.com. (2022). Temporary policy: ChatGPT is banned. https://meta.stackoverflow.com/questions/421831/temporary-policy-chatgpt-is-banned
Toews, R. (2022). 4 Predictions about the wild new world of text-to-image AI. Forbes. https://www.forbes.com/sites/robtoews/2022/09/11/4-hot-takes-about-the-wild-new-world-of-generative-ai/
Townsend, C. (2023). Explaining corecore: How tiktok’s newest trend may be a genuine gen-z art form. https://mashable.com/article/explaining-corecore-tiktok
Vincent, J. (2022a). Getty images bans AI-generated content over fears of legal challenges. The Verge. https://www.theverge.com/2022/9/21/23364696/getty-images-ai-ban-generated-artwork-illustration-copyright
Vincent, J. (2022b). The lawsuit that could rewrite the rules of AI copyright. The Verge. https://www.theverge.com/2022/11/8/23446821/microsoft-openai-github-copilot-class-action-lawsuit-ai-copyright-violation-training-data
Vincent, J. (n.d.). Getty images sues AI art generator stable diffusion in the US for copyright infringement. The Verge. https://www.theverge.com/2023/2/6/23587393/ai-art-copyright-lawsuit-getty-images-stable-diffusion
Wal van der, O., Jumelet, J., Schulz, K., & Zuidema, W. (2022). The birth of bias: A case study on the evolution of gender bias in an english language model. NAACL ’22. arXiv. https://doi.org/10.48550/ARXIV.2207.10245
Wang, S. (2022). Why ‘prompt engineering’ and ‘generative AI’ are overhyped. https://lspace.swyx.io/p/why-prompt-engineering-and-generative
Williams, B. A., Brooks, C. F., & Shmargad, Y. (2018). How algorithms discriminate based on data they lack: Challenges, solutions, and policy implications. Journal of Information Policy, 8, 78–115. https://doi.org/10.5325/jinfopoli.8.2018.0078
Article Google Scholar
Wilmer, H. H., Sherman, L. E., & Chein, J. M. (2017). Smartphones and cognition: A review of research exploring the links between mobile technology habits and cognitive functioning. Frontiers in Psychology, 8. https://doi.org/10.3389/fpsyg.2017.00605
Zammit, M., Liapis, A., & Yannakakis, G. (2022). Seeding diversity into AI art. In Proceedings of the 13th International Conference on Computational Creativity. Association for Computational Creativity.
Google Scholar
Zeman, A., Milton, F., Sala, S. D., Dewar, M., Frayling, T., Gaddum, J., Hattersley, A., et al. (2020). Phantasia—The psychological significance of lifelong visual imagery vividness extremes. Cortex, 130, 426–440. https://doi.org/10.1016/j.cortex.2020.04.003
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Oulu, Oulu, Finland
Jonas Oppenlaender

Authors

Jonas Oppenlaender
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jonas Oppenlaender .

Editor information

Editors and Affiliations

School Of Marketing And Communication, University of Vaasa, Vaasa, Finland
Rebekah Rousi
University of Vaasa, Vaasa, Finland
Catharina von Koskull
Aalto University, Aalto, Finland
Virpi Roto

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Oppenlaender, J. (2024). The Cultivated Practices of Text-to-Image Generation. In: Rousi, R., von Koskull, C., Roto, V. (eds) Humane Autonomous Technology. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-031-66528-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-66528-8_14
Published: 22 October 2024
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-031-66527-1
Online ISBN: 978-3-031-66528-8
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics

The Cultivated Practices of Text-to-Image Generation

Abstract

Similar content being viewed by others

Human-AI Co-creation: Evaluating the Impact of Large-Scale Text-to-Image Generative Models on the Creative Process

Innovative AI-Powered Image Generator: Converting Text into Images with OpenAI

Ambrosinus-Toolkit Plugin: Artificial Intelligence Text-to-Image Generative Models Through Grasshopper

Keywords

Introduction

Background on Text-to-Image Generation

Text-to-Image Generation as the New Photography

Parallels Between Photography and Text-to-Image Generation

Criticisms About Text-to-Image Generation

The Creative Practice of Prompt Engineering

Human–computer Co-creative Ecosystem of Text-to-Image Generation

Communities Dedicated to Text-to-Image Generation

Dedicated Learning Resources

Dedicated Tools and Services

The Risks and Dangers of Cultivating the Co-creative Ecosystem

Bias in Training Data and Generative Models

A Flood of Synthetic Data

Long-term Effects on Individuals and Human Culture

Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

The Cultivated Practices of Text-to-Image Generation

Abstract

Similar content being viewed by others

Human-AI Co-creation: Evaluating the Impact of Large-Scale Text-to-Image Generative Models on the Creative Process

Innovative AI-Powered Image Generator: Converting Text into Images with OpenAI

Ambrosinus-Toolkit Plugin: Artificial Intelligence Text-to-Image Generative Models Through Grasshopper

Keywords

Introduction

Background on Text-to-Image Generation

Text-to-Image Generation as the New Photography

Parallels Between Photography and Text-to-Image Generation

Criticisms About Text-to-Image Generation

The Creative Practice of Prompt Engineering

Human–computer Co-creative Ecosystem of Text-to-Image Generation

Communities Dedicated to Text-to-Image Generation

Dedicated Learning Resources

Dedicated Tools and Services

The Risks and Dangers of Cultivating the Co-creative Ecosystem

Bias in Training Data and Generative Models

A Flood of Synthetic Data

Long-term Effects on Individuals and Human Culture

Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation