research-article

Open access

Farsight: Fostering Responsible AI Awareness During AI Application Prototyping

Authors:

Michael MadaioAuthors Info & Claims

CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

Article No.: 976, Pages 1 - 40

https://doi.org/10.1145/3613904.3642335

Published: 11 May 2024 Publication History

All formats PDF

Abstract

Prompt-based interfaces for Large Language Models (LLMs) have made prototyping and building AI-powered applications easier than ever before. However, identifying potential harms that may arise from AI applications remains a challenge, particularly during prompt-based prototyping. To address this, we present Farsight, a novel in situ interactive tool that helps people identify potential harms from the AI applications they are prototyping. Based on a user’s prompt, Farsight highlights news articles about relevant AI incidents and allows users to explore and edit LLM-generated use cases, stakeholders, and harms. We report design insights from a co-design study with 10 AI prototypers and findings from a user study with 42 AI prototypers. After using Farsight, AI prototypers in our user study are better able to independently identify potential harms associated with a prompt and find our tool more useful and usable than existing resources. Their qualitative feedback also highlights that Farsight encourages them to focus on end-users and think beyond immediate harms. We discuss these findings and reflect on their implications for designing AI prototyping experiences that meaningfully engage with AI harms. Farsight is publicly accessible at: https://pair-code.github.io/farsight.

Fig. 1:

1 Introduction

Fig. 2:

Fig. 3:

As artificial intelligence (AI) becomes increasingly integrated into our everyday lives, mitigating the societal harms posed by AI technologies has never been more important. In response to the demand for accountable and safe AI, there have been growing efforts from both industry and academia towards responsible design and development of AI [143, 183]. The majority of these endeavors focus on machine learning (ML) experts, such as ML developers and other AI practitioners. For example, researchers have introduced techniques that help ML developers interpret ML models [102, 128, 150] and assess model fairness [30, 90, 189]. Additionally, researchers have also proposed frameworks that target ML developers’ workflows, such as improving data collection and annotation practices [14, 118, 123], documenting training data and models [43, 63, 122], and anticipating an ML product’s potentials for harms [46, 120].

However, more recently, we have witnessed a rapid advancement of large language models (LLMs) such as Gemini [178] and GPT-4 [132], alongside the emergence of prompt-based interfaces like Google AI Studio [70], GPT Playground [133], AI Chains [204], and Wordflow [184] (Figure 2B). These general-purpose models and easy-to-use interfaces have significantly increased access to the process of prototyping and building diverse AI-powered applications—leading to a paradigm shift in AI development workflows that poses unique challenges to responsible AI, including introducing new potential harms to avoid [190], as well as challenges applying existing responsible AI practices [98].

Many people who use prompts to create AI applications now encompass a broader spectrum of roles beyond traditional ML experts (Figure 2A), such as designers, writers, lawyers, and everyday users [55, 84, 193, 207], whereas existing responsible AI research often targets ML experts such as ML engineers and data scientists [78, 198]. Many users of AI prompt-based prototyping interfaces [e.g., 70, 204, 133, 184], or “AI prototypers” [cf. 84] do not have experience in AI or computer science, which can lead to challenges in anticipating the consequences of their AI applications [143]—a difficult task even for computer science faculty and AI researchers [20, 45]. Furthermore, LLMs demonstrate a wide range of capabilities that are continually being discovered across various contexts, including tasks such as summarization, classification, and translation [18, 174]. This characteristic of LLMs gives rise to complex and uncertain impacts of LLM-powered applications [61], presenting a significant departure from the classical ML models targeted by existing responsible AI endeavors [98, 190] and introducing a new layer of complexity for responsible AI researchers to help AI developers anticipate downstream consequences.

To help address these challenges in applying responsible AI practices to LLM-powered AI applications, we present Farsight (Figure 1, Figure 2B), an interactive tool to help AI prototypers identify potential harms of their LLM-powered applications—a key early step in harm prevention and mitigation [104, 120, 131, 176, 177]—during the prototyping stage. Using Farsight as a probe, we conduct multiple mixed-method user studies to investigate (1) how an early-stage intervention changes AI prototypers’ awareness of and approach to identifying harms, (2) the effectiveness of our tool in helping people envision harms, and (3) the challenges AI prototypers face during this harm envisioning process. We contribute:

•

Farsight, the first in situ interactive harm envisioning tool that empowers AI prototypers to identify potential harms that may arise from their prompt-based AI applications, directly within their prompting environments (Figure 1, Figure 2). Inspired by prior harm envisioning frameworks [24, 46, 120] and in situ security alert tools [109, 125, 147], Farsight overcomes unique design challenges identified from a literature review (section 2) and a co-design user study with 10 AI prototypers (section 3).

•

Novel techniques and interactive system designs to foster responsible AI awareness among AI prototypers. Given a user’s prompt, Farsight leverages embedding similarities to surface news articles about relevant AI incidents from the AI Incident Database [111] and uses LLMs to generate potential use cases, impacted stakeholders, and harms for AI prototypers to review, edit, and add to. Applying a progressive disclosure design [129], our tool fits into users’ diverse prompting workflows. With a novel adaptation of node-link diagrams [146], Farsight enables users to interactively visualize, generate, and edit use cases, stakeholders, and harms (section 4).

•

Empirical findings about harm envisioning processes from a co-design study and an evaluation study. During our design of Farsight, we conducted a co-design study with 10 AI prototypers to evaluate our design ideas and generate new ideas (section 3). After developing Farsight, we conducted an evaluation user study with 42 AI prototypers to examine the effectiveness of Farsight in aiding users to brainstorm harms and improving their ability to independently identify harms. Our mixed-method analysis highlights that, after using Farsight, AI prototypers are better able to independently identify potential harms that might arise from an application developed with a given prompt, and participants report that our tool is more useful and usable than existing resources. In particular, Farsight encourages users to shift their focus from the AI model to the end-users, providing them with a broader perspective to consider indirect stakeholders and cascading harms (section 6).

•

An open-source, web-based implementation that lowers the barrier to applying responsible AI practices. We develop Farsight with cutting-edge web technologies, such as Web Components [115] and WebGL [114], so that it can be easily integrated into any web-based prompt development environments, such as Google AI Studio and Jupyter Notebook (Figure 3). We open source¹ Farsight as a collection of reusable interactive components that future researchers and designers can easily adopt (section 4.4). To see a demo video of Farsight, visit https://youtu.be/BlSFbGkOlHk.

2 Related Work

2.1 Anticipating Technology’s Negative Impacts

Various design methods and approaches have been developed to support ideation about potential downstream impacts of technology, including anticipatory tech ethics [22, 126], speculative design [10, 49, 197], and value-sensitive design [57, 59, 60] among others. To support designers with this, prior research has developed design toolkits [e.g., 29] and resources, such as Envisioning Cards [58], Value Cards [165], Timelines [199], and the Black Mirror Writers’ Room [89], among others [e.g., 11, 46]. Such resources are intended to be used by designers of technology early in the design process, but they may not fit neatly into existing product design and development processes, particularly for AI-powered application design paradigms, where large pre-trained models are used for many downstream tasks [183].

In addition to technology designers, computing researchers have called for the computer science field to consider the negative impacts of their work in addition to the positive impacts [76]. In AI research, conferences such as NeurIPS have begun requiring that researchers articulate potential negative broader impacts of their work in statements at the ends of their papers [140] to avoid the “failures of imagination” [20] that may lead to downstream harms. Prior work analyzed these broader impacts statements, finding convergence around a set of topics such as risks to privacy and bias, but often lacking concrete specifics or strategies for mitigation [8, 99, 127, 167]. However, prior work suggests that many CS researchers may not have the training, resources, or inclination to engage in this type of anticipatory work [45, 175], suggesting that new tools, training, and processes, are needed to support researchers and developers in engaging in anticipatory work in ways that are integrated into their research practices. More recently, researchers have proposed a framework that uses LLMs to anticipate harms for classifiers by generating stakeholders and vignettes for a given scenario [24], evaluating this framework through interviews with responsible AI researchers. Farsight builds upon this framework and extends it to (1) target an early prototyping stage through in situ and interactive interfaces that promote user engagement in the harm envisioning process, (2) support LLM-powered applications with diverse tasks beyond classification, and (3) evaluates its effectiveness through a user study with 42 AI prototypers.

2.2 Identifying and Mitigating LLM Harms

More recently, there has been a growing body of research that specifically focuses on identifying and mitigating the harms of LLMs. Researchers have introduced harm taxonomies specifically for LLMs, which identify known risks (i.e., informed by observed instances of harm) [18, 100, 190] and emerging risks of LLMs (anticipated risks based on foreseeable capabilities of LLMs) [108, 166]. Since LLMs can be used for a wide range of tasks associated with many different categories of harms, researchers have presented frameworks and evaluation methods to assess a particular type of LLM harm, including misinformation [74, 135], representation and toxicity [42, 64], human autonomy [65, 168], malicious use [38, 154], and data privacy [87, 97]. The popular methods to identify these harms include benchmarking [27, 28], user research [101, 106], and adversarial testing [41, 137]. Based on existing benchmarks and harm taxonomies of LLM risks, Weidinger et al. [191] introduce a sociotechnical evaluation framework that identifies three AI actors with LLM safety responsibilities: AI model developers, AI application developers, and third-party stakeholders.

The mitigation strategies for these harms depend on the use cases and context. Popular strategies include algorithmic and sociotechnical approaches [192], such as improving the training data to mitigate social stereotypes and biases [173]; fine-tuning LLM models on curated datasets [64]; filtering LLM outputs [194, 205]; employing special decoding techniques [93, 158], adding instructions in prompts [9], monitoring the use of LLMs [192]; as well as inclusive product design and development from the beginning [34, 36, 75, 83]. Building on this prior work, Farsight introduces a novel framework that leverages human-AI collaboration to help AI prototypers identify the potential harms of LLMs. Specifically targeting AI prototypers as one subset of AI application developers [183, 191], Farsight introduces novel techniques and in situ interfaces to foster responsible AI awareness during AI prototyping, although the current version of Farsight does not assist AI prototypers in mitigating potential LLM harms.

2.3 Responsible AI Tools and Practices

Despite the increasing emphasis on responsible AI from the technology industry [4, 7, 134, 201], academia [44, 96], and policymakers [81, 105, 177], incorporating responsible AI practices into AI product development remains a challenge [e.g., 183, 181, 2]— in part due to practitioners’ insufficient knowledge of responsible AI [e.g., 159, 141, 62], lack of engagement with direct stakeholders or domain experts [80, 103], and organizational culture and structure [104, 143].

To address these challenges and facilitate the adoption of responsible AI practices, researchers have proposed several approaches. These include incorporating ethics into AI education [56, 165, 172], providing engaging playbooks or design activities [11, 79, 206], and fostering ethical norms in AI research and development [99, 142, 171]. In addition, researchers have also proposed a wide range of tools to operationalize responsible AI practices [95, 198]. These tools encompass libraries and frameworks that cover various dimensions of responsible AI, including fairness [13, 155, 189], explainability [128, 150], testing and error analysis [119, 151, 182], and model development documentation [63, 122, 142].

Moreover, alongside these advancements, there has been a rise in the research and development of easy-to-use interactive visualization tools to further facilitate the operationalization of responsible AI. For example, tools like What-If Tool [195], FairVis [25], and Visual Auditor [124] enable ML developers to visually assess the fairness of ML models across a diverse range of inputs. Visual analytics systems such as Summit [77], LIT [179], and GAM Changer [187] empower ML developers to interpret their models and fix problematic behaviors. Interactive visual testing tools like Errudite [203], Angler [153], and AdaTest [149] help ML developers surface weaknesses in their models.

Inspired by these tools, Farsight joins the body of research of interactive visualization tools for responsible AI by visualizing use cases, stakeholders, and harms (section 4.3). In contrast to existing tools that target traditional ML models after they have been trained, Farsight focuses on diverse LLM-powered applications in an early prototyping stage. During this stage, AI prototypers have greater flexibility to iterate on the design and objectives of their applications and implement early mitigation strategies such as engaging with stakeholders and improving data collection [3].

2.4 In Situ Alerting Tools

Although in situ responsible AI tools are relatively nascent, there is a large body of research in designing in-context warning tools and interfaces. For example, security and HCI researchers study how to best present warnings to raise people’s online security awareness [e.g., 147, 109, 125] and protect people from malware and phishing attacks [e.g., 51, 145, 53]. The key challenges when designing effective warning interfaces include the presentation of comprehensible messages and supporting evidence [15, 54], engaging users [50, 202], and preventing alert fatigue and habituation [5, 17]. To address these challenges, researchers recommend designing simple interfaces [66, 67], considering the trade-off between blocking and non-blocking warnings [50], varying interfaces [5], and requiring user input [23].

Using in-context warnings to improve users’ safety awareness and encourage users to take protection measures can be considered a form of “digital nudging” [26, 160]. More recently, researchers have also adapted in-context security warnings to nudge social media users to recognize and avoid online disinformation [85, 163] and reflect before posting potentially harmful content [88, 169, 200]. Beyond platform-initiated integration of warnings, end-users also voluntarily seek in-context alert interfaces for productivity improvement. For example, writers use grammar checker tools like Grammarly [73], which offer in-context warnings and scores to improve their writing. Similarly, software developers use accessibility developer tools [40, 69] to detect potential accessibility issues during the development process. However, there has been little work in designing and evaluating in situ warnings for developing AI applications, particularly for responsible AI. Farsight’s design draws inspiration from many existing warning interfaces (section 3). Our work advances the landscape of in situ alerting research by addressing responsible AI for modern AI application development.

3 Formative Study & Design Goals

To identify the needs and potential challenges faced by users in envisioning harms, we conducted a formative co-design study to investigate (1) how AI prototypers envision harms (if they do), (2) what design ideas are most helpful for them, and (3) how to motivate users to think about potential risks when prototyping an AI application. In this section, we report our findings from the formative co-design study, and in section 6, we report on our findings from a subsequent evaluation user study.

3.1 Co-design Study

Participants. To inform our tool’s design, we conducted a co-design user study with 10 AI prototypers based in the United States. These participants were recruited from Google through internal mailing lists. Our recruitment criteria required participants to have experience using an internal prompt-crafting tool, PromptMaker [84], which is similar to Google AI Studio [70] and GPT Playground [133]. Each session was 60 minutes, and each participant received an average of $50 USD in their choice of a gift card or a donation to their preferred charity. Among the 10 participants (U1–U10), 6 identified as men, 3 identified as women, and 1 identified as non-binary. Four participants self-reported having expertise in responsible AI. Information about participants’ job roles is listed in Table 1. All participants are our targeted users (AI prototypers).

Table 1:

Participant Roles	Participant IDs
	1*, 4, 5, 6, 7, 10
Research Scientist	3, 8
Technical Writer	2
Program Manager	9*

Table 1: The co-design user study includes 10 participants with diverse roles. All participants have experience in prompting LLMs. Four participants who self-reported having expertise in responsible AI are marked with asterisks (*).

Procedure. We structured our study as a “during-design co-design study” [156]. Participants were asked to bring a recent prompt that they had written to the study. The study started with a semi-structured interview regarding participants’ prompting workflows and their experience in thinking about potential harms linked to their applications (section A.2). Then participants were asked to use our very early-stage design prototypes (section A.2) to envision potential harms associated with their application while thinking aloud. Participants were also presented with low-fidelity sketches for our other design ideas. These prototypes and sketches can be found in Figure S1. Finally, we asked participants to rate and provide feedback on all of our design ideas and generate their own design suggestions (section A.3).

Fig. 4:

Design feedback. Interestingly, although perhaps not surprisingly [cf. 78], none of the 6 participants without expertise in responsible AI reported that they typically considered the potential harms of their AI prototypes when writing prompts, while 3 of the 4 participants with expertise in responsible AI did report typically anticipating harms during the prototyping process. Participants’ ratings were shown in Figure 4. Overall, participants favored using AI to generate use cases of their AI prototypes, potential stakeholders, and potential harms. Many participants also highlighted the importance of being able to edit AI-generated content and control the generation direction (U4, U8). On the other hand, participants were less in favor of more distracting design ideas (e.g., an anthropomorphized assistant tool similar to Microsoft Office’s Clippy) or irrelevant content (e.g., the latest, rather than the most relevant AI incidents). Participants also provided us with helpful usability feedback that we integrated into our final design of Farsight (section 4).

New design ideas. Participants generated many interesting design ideas to help raise responsible awareness among AI prototypers. For example, participants recommended categorizing AI-generated harms (U1, U5), allowing users to rate the severity of harms (U6), and using users’ input to steer AI generation (U10). We integrated these design ideas into the final design of Farsight (section 4). Some other interesting design ideas include designing a game-like reward system to incentivize users to anticipate harms (U5), building online communities to allow users to share their envisioned harms using Farsight and seek support (U2), allowing real-time collaborative harm envisioning similar to Google Slides (U1, U4), and automatically revising a user’s prompt to address identified harms (U4). We discuss the implications of these design ideas in user motivation (section 7.1) and mitigation strategies (section 7.3).

3.2 Design Goals

Based on our literature review (section 2) and findings and early feedback from the co-design user study, we identify the following five design goals (G1–G5) for Farsight.

G1.

Guide users in imagining use cases. Existing research highlights the challenges faced by ML practitioners when attempting to anticipate the uses of their ML-powered applications and how different individuals or groups may be affected [20, 45, 103, 171]. Confirming this, software engineer U6 noted “You don’t really know how your tool could be used, so it’s really hard to envision what harms would be.” The availability of LLMs and prompt-crafting tools has broadened the spectrum of AI prototypers to include people without prior technology development experience [55, 84], which can further magnify the challenges associated with envisioning diverse use cases for AI applications. Therefore, we design Farsight to help AI prototypers with diverse backgrounds to brainstorm a wide array of use cases for their AI applications.

G2.

Help users understand, organize, and prioritize harms. Depending on an AI application’s goal, implementation, and context, some harms are more salient than others [24, 121]. To help AI prototypers assess harms, Farsight should first help them understand where and how harms might occur and who might be impacted, by connecting harms to use cases and stakeholders [12, 103, 120]. Participants expressed a desire for the ability to categorize (U1, U5) and rate the severity (U6) of harms. To meet these needs, we aim to design an easy-to-use interface that empowers users to navigate, comprehend, and label harms within diverse potential harm scenarios.

G3.

Fit into current workflows and mitigate habituation. In our co-design study, none of the 6 participants without expertise in responsible AI had previously thought about harms when writing prompts. We also found some participants were not incentivized to anticipate harm on their own; for example, U6 explained “To be honest, as a software engineer, I don’t use policy tools [compliance tools like checklists] unless I have to.” Thus, to make Farsight easy to adopt, we aim to take inspiration from in situ warning tools [e.g., 51, 145, 53] to design it in a way that fits into AI prototypers’ existing workflows instead of introducing new workflows. In addition, we aim to apply strategies like varying content [5] and promoting user input [23] to mitigate habituation—a common pitfall of in-context warning designs [5, 17].

G4.

Promote user engagement and provide compelling examples. Prior research highlights that the effectiveness of warning tools depends on their clarity and persuasiveness [15, 54]. As we are targeting AI prototypers with diverse experience in AI and responsible AI, Farsight should be easy to use and understand. When asked what would help them envision potential harms for their AI applications, many participants mentioned referring to prior examples of AI harms (U1, U2, U8). For instance, U2 said “Giving some specific real [harm] examples for different types of seemingly innocuous applications would help alert people [to consider harms].” Therefore, we aimed to integrate real examples in Farsight to motivate and help users understand the potential risk of their applications. Participants like being able to control the harm envisioning process (Figure 4), and active participation is a key factor in learning [92]—essential to foster AI prototypers’ ability to independently identify harms. Thus, Farsight is designed to provide users with human agency and encourage users to actively and critically think about harms.

G5.

Open-source and adaptable implementation. Given the ever-expanding array of LLMs and prompt-crafting tools [31], our approach in designing Farsight is to ensure it remains adaptable to this dynamically evolving landscape. We aimed to design Farsight to be model-agnostic and environment-agnostic, thereby making it accessible to users of different LLM models (e.g., Gemini [178], GPT-4 [132], Llama 2 [180]) and prompt-crafting interfaces (e.g., GPT Playground [133], Google AI Studio [70], Wordflow [184]). Furthermore, we open source our implementation to foster future advancements in the design, research, and development of responsible AI tools.

4 User Interface

Following the five design goals (G1–G5), we present Farsight, the first in situ interactive tool that aims to foster responsible AI awareness among AI prototypers during the AI prototyping process. Farsight is designed to be a plugin of any web-based prompt-crafting tools. Farsight’s interface employs progressive disclosure [129], enabling users to smoothly transition across three main components, with each phase increasing the level of user engagement. The Alert Symbol (section 4.1) presents an always-on symbol that shows the approximated alert level of a user’s current prompt; the Awareness Sidebar (section 4.2) highlights news articles about related AI incidents and LLM-generated use cases and harms; and the Harm Envisioner (section 4.3) visualizes LLM-generated harms and allows users to edit, add, and share harms. Examples in this section use PaLM 2 model through its APIs; we chose this model because it provided free API access to the public during our design process. Researchers and designers can easily replace PaLM 2 model with other LLMs by changing the API endpoints in Farsight.

4.1 Alert Symbol

Fig. 5:

The Alert Symbol is an always-on display on top of the AI prototyping tool, displaying the alert level of a user’s prompt (Figure 5). Every time the user runs their prompt, the Alert Symbol updates the alert level using the new prompt. Based on the computed alert level, there are three modes (Figure 5), each characterized by a progressively more attention-grabbing style. Thus, Farsight only disrupts AI prototypers’ flow when their prompts require more caution (G3).

Calculating the Alert Level. Auditing and quantifying the societal risk of LLM-powered applications is an open research problem [144]. To categorize the potential harms that might arise from users’ prompts, we propose a novel technique that uses the similarity between the prompt and previously documented AI incident reports as a proxy for the prompt’s alert level. First, we use an LLM to extract high-dimensional latent representations (embeddings) of all AI incident reports indexed in the AI Incident Database [111], which includes more than 3k community-curated news reports about AI failures and harms. Then, we extract the embedding of the user’s prompt and compute pairwise cosine distances between the prompt embedding and AI incident report embeddings. We label each incident report as

based on two distance thresholds 0.69 and 0.75. We determine these two thresholds from an experiment with 1k random prompts (see section B.1 and Figure S2 for details). Researchers can easily adjust these two thresholds (between 0 and 1) to calibrate an article’s relevancy.

Finally, we show the numbers of AI incidents that are classified as

in an orange circle and

in a red circle (Figure 5) as a proxy of the prompt’s potential risk. In other words, we consider a prompt to have a higher risk if many AI incident reports are semantically and syntactically similar to it.

4.2 Awareness Sidebar

Fig. 6:

After a user clicks the Alert Symbol, the Awareness Sidebar (Figure 6) expands from one side edge of the AI prototyping tool (G3), highlighting potential consequences of AI applications or features that are based on the user’s current prompt. We use a real prompt from Awesome ChatGPT Prompts [1] in the example in Figure 6.

Incident Panel. To encourage users to consider potential risks associated with their prompts (Figure 6A), the Incident Panel highlights news headlines of AI incidents that are relevant to the user’s prompt (Figure 6-B2). These incidents comprise the top 30 incident reports that are classified as

, sorted in reverse order based on their embedding’s cosine distances to the embedding of the user’s prompt. The thumbnails are color-coded based on the incident’s relevancy level. Users can click the headline or the thumbnail to open the original incident report in a new tab. These real AI incidents can function as cautionary tales [103, 199] reminding users of potential AI harms (G4).

Use Case Panel. To help users imagine how their AI prototype may be used in AI applications or features (G1), the Use Case Panel (Figure 6C) presents a diverse set of potential use cases that are generated by an LLM. Each use case is shown as a sentence describing how a particular group of end-users could use this AI application in a specific context. For example, for a writing tutor prompt, a potential use case can be “teachers use it to provide feedback on student writing.” (Figure 6-C1). We also use an LLM to generate a potential harm that could occur within that use case, shown below the use case sentence. For example, a harm for the teacher feedback use case can be “students may feel like they are not getting personalized feedback from their teachers.” We use few-shot learning to prompt the LLM to generate use cases and harms, whereas we generate use cases, stakeholders, and harms in Harm Envisioner (section 4.3). We open-source all of our prompts.

To help users assess and organize use cases and harms (G2), we also leverage an LLM to categorize each use case as

, or

, although we acknowledge that these may vary by use cases, development and deployment contexts, as well as relevant policies or regulatory frameworks in various jurisdictions. These three categories are introduced by responsible AI researchers to help ML developers structure their harm envisioning process [121]. The

use cases are those that align with the development target use cases. The

use cases encompass those that may arise in high-stakes domains, such as medicine, finance, and the law. The

category includes scenarios where malicious actors exploit the AI application to cause harms. The Use Case Panel organizes use cases and harms into three tabs (Figure 6-C1–3) based on their categories. The first tab, “mix”, provides an overview by showing one use case and its corresponding harm from each of the other tabs.

4.3 Harm Envisioner

Fig. 7:

Both the Alert Symbol and the Awareness Sidebar provide easy-to-understand in-context reminders to help users reflect on potential harms associated with their prompts. However, instead of passively reading AI incident reports and LLM-generated content, users desire to actively edit and add new use cases, stakeholders, and harms (Figure 4). Also, active participation—a key factor in learning—may help foster AI prototypers’ ability to independently identify harms. Therefore, we design Harm Envisioner (Figure 7) to support users in actively envisioning potential harms associated with their prompts (G4). We use a real prompt from Awesome ChatGPT Prompts [1] in the example in Figure 7.

Interactive Node-link Tree Visualization. After clicking the “Envision Consequences & Harms” button in the Awareness Sidebar, Harm Envisioner appears as a pop-up window on top of the prompt-crafting tool (Figure 7). It begins with a text box filled with an LLM-generated summary of a user’s prompt (Figure 7B). The user is prompted to revise the summary to align with the target task in their prompt. Next, the window transitions into an interactive node-link tree visualization [146], where the user can pan and zoom to navigate the view (Figure 7C). First, the window shows the user’s prompt summary as the root of the tree which is visualized as a text box. The user can click the root node and the LLM will generate potential use cases of an AI application based on the user’s prompt, and the use cases are visualized as the root’s children nodes. Similarly, users can click a generated node and the LLM will generate its children nodes (stakeholders and then harms). There is a max of four layers, following an order of the user’s prompt summary → use cases → stakeholders → harms. This layer order reflects the recommended harm envisioning workflow in responsible AI literature [12, 46, 103, 120, 121] and helps users to comprehend and organize diverse harms across different contexts (G2). For additional examples of LLM-generated use cases, stakeholders, and harms in Farsight, see Table S1.

Fig. 8:

Human-AI Collaboration in Harm Envisioning. Our goal is to use AI-generated harms to encourage users to reflect on potential downstream harms and inspire them to add, edit, or curate potential harms (G4). To do that, Harm Envisioner allows users to edit any tree nodes by clicking a button in the toolbar (Figure 7-C1) or entering new text in the tree node. In addition, users can delete (Figure 7-C2) and use the LLM to regenerate all of an edited node’s children nodes, to effectively steer the harm envisioning direction by offering feedback to the LLM (G4). To meet users’ needs of categorizing harms (G2), we use an LLM to classify each harm into a harm type based on a systematic review and taxonomy of AI harms [164]. Users can use the dropdown menu to change the harm’s category (Figure 8). To help users prioritize and take notes about harms, the Harm Envisioner allows users to rate the severity of each harm by clicking

in the toolbar. Finally, users can click

to export all content (e.g., use cases, stakeholders, and harms) in the Harm Envisioner as a Markdown file.

4.4 Open-source and Reusable Implementation

To make Farsight easily adoptable by both AI prototypers and AI companies (G5), we implement Farsight to be model-agnostic and environment-agnostic, and we open-source our implementation. Farsight uses LLMs by calling their public APIs so that users can use their preferred LLMs by easily replacing the API endpoints. To help AI companies and researchers integrate Farsight into AI prototyping tools, we leverage Web Components [115] and Lit [68] to implement Farsight as reusable modules, which can be easily integrated into any web-based interfaces regardless of their development stacks (e.g., React, Vue, Svelte). To help AI prototypers use our tool, we present a Chrome extension² that integrates Farsight into Google AI Studio and a Python package³ that brings Farsight to computational notebooks. We implement the interactive tree visualization using D3.js [19] and embedding similarity computation using TensorFlow.js [170] with WebGL [114] acceleration. Computational notebook support is implemented using NOVA [188].

5 Usage Scenario

We present a hypothetical usage scenario to illustrate how Farsight fosters responsible awareness among AI prototypers. Rosa is a native English speaker from the United States who recently traveled to Vietnam to teach English. She is the only English teacher at an under-resourced high school. Overwhelmed with grading English writing assignments for all students in the school, Rosa tries to develop an LLM-powered AI application that provides writing feedback based on a student’s essay. She writes her prompt (Figure 6A) in an AI prototyping tool with Farsight integrated. After running the prompt, Rosa notices the alarming Alert Symbol (Figure 6A), so she clicks on it, which expands the Awareness Sidebar (Figure 6-BC). Rosa reads a few related articles shown in the Incident Panel (Figure 6-B2). She finds these articles are indeed related to AI in education and are helpful, but they mainly focus on students using AI to cheat rather than teachers using AI to grade assignments. Rosa skims through the LLM-generated potential use cases and finds the use case “teachers use it to provide feedback on student writing” very relatable (Figure 6-C1). Intrigued by its associated harm “students may feel like they are not getting personalized feedback from their teachers”, Rosa clicks the Envision Consequences button and wishes to learn more about this use case and its associated potential harms.

Harm envisioner. Next, Farsight shows a pop-up window asking Rosa to revise and confirm an LLM-generated summary of her prompt (Figure 7-B). After clicking

, Rosa sees the Harm Envisioner presenting an interactive tree visualization showing the functionality of her AI application as a root node and multiple use cases as its children nodes (Figure 7-C). With a map-like interface, Rosa quickly uses zoom-and-pan to zoom into the teaching use case. After clicking

, the Harm Envisioner quickly generates the stakeholders associated with the use case and the harms associated with each stakeholder. Rosa takes some time to reflect on the LLM-generated harm of students not getting personalized feedback (Figure 7-Harm-1). She has never thought about this consequence before, but she thinks it makes sense—AI does not have background knowledge about each student, so its feedback would not be tailored to students’ individual needs. After rating it as very severe

by clicking

, Rosa continues reading other LLM-generated harms. She does not think the harm of teachers losing jobs to her AI tutor is relevant, so she deletes it (Figure 7-C2).

Human-AI collaboration. After seeing the random question “increased labor?” next to teacher (Figure 7-C3), Rosa thinks maybe it will be more time-consuming to review AI-generated feedback than grading students’ assignments herself, so she enters that harm into the Harm Envisioner. Next, Rosa is not sure about the legal liability of her school (Figure 7-Harm-3), but it might be worth discussing with other teachers. Finally, reflecting on her experience with the Harm Envisioner and AI incident articles, Rosa thinks the potential harms of her writing tutor AI application outweigh the potential convenience for her. Therefore, Rosa decides to stop prototyping this application. However, Rosa still sees value in leveraging LLMs in education, so she bookmarks related AI incident articles and clicks

to download all the content in the Harm Envisioner as a Markdown file. She will bring these resources to discuss with her colleagues the next day.

6 Evaluation User Study

We conducted a user study to evaluate Farsight’s effectiveness in aiding AI prototypers to anticipate the potential harms associated with AI features. In addition, we investigate how AI prototypers use Farsight during an early prototyping stage. To investigate the effect of user engagement in AI-assisted harm envisioning, we tested two variants of our tool: Farsight, including all components, and Farsight Lite, including only the Alert Symbol (Figure 1-B) and the Awareness Sidebar (Figure 1-C). In other words, Farsight Lite is a “subset” of Farsight. Farsight Lite only shows one direct stakeholder for each use case in the Use Case Panel, while Farsight allows users to interactively add more stakeholders, use cases, and harms in the Harm Envisioner (Figure 1-A). The study included 42 AI prototypers with diverse roles who were recruited from a large technology company based in the United States. In this user study, we aimed to investigate the following three research questions:

RQ1.

How do Farsight and Farsight Lite affect users’ ability for and approach to identifying potential harms?

RQ2.

How effective and useful are Farsight and Farsight Lite in assisting users in envisioning harms in comparison to existing commonly-used resources?

RQ3.

What challenges do AI prototypers face when envisioning potential harms during the AI prototyping stage? How do Farsight and Farsight Lite help AI prototypers address these challenges?

6.1 Participants

Table 2:

Participant Roles	Participant IDs
	3, 4, 5, 6, 7, 12, 13, 15, 16, 17, 19, 23, 25, 26, 28, 29, 33, 34, 35, 41, 42
Product Manager	1, 8, 10, 11, 14, 20, 24, 27, 36
Linguist	2, 21, 30, 31
AI Researcher	9, 18, 39, 40
UX Researcher	22
Data Scientist	32
Test Engineer	37
Marketing Specialist	38

Table 2: The evaluation user study included 42 participants with diverse roles and experience in prompting LLMs.

We recruited 45 voluntary participants from both internal mailing lists related to AI and snowball sampling at Google, based in the United States. The recruitment required participants to have experience in writing prompts for LLMs. In total, we received 61 responses, and we selected 45 participants based on their schedule availability. We conducted pilot studies using the first three study sessions, which were not included in our data analysis. As a result, we had a total of 42 participants. Each study session was either 90 minutes (n=28 sessions) or 60 minutes (n=14 sessions), depending on the participants’ availability. During the 90-minute sessions (or 60-minute sessions), each participant received an average of $62 USD (or $41) compensation in their preferred form such as gift cards and charity credits.

Among the 42 participants, 26 identified as men, 14 as women, and 2 preferred not to disclose their gender. Information about their job roles is listed in Table 2. Recruited participants self-reported an average score of 2.55 for their knowledge and experience with responsible AI on a 5-point Likert scale (Figure 10-top), where 1 represents “No experience” and 5 represents “Expert (I have helped others apply responsible AI practices).” In addition, participants self-reported an average score of 2.81 for experience with LLM prompting on a 5-point Likert scale (Figure 10-bottom), where 1 represents “Beginner” and 5 represents “Expert.” All participants are Farsight’s targeted users, AI prototypers.

Fig. 9:

Fig. 10:

6.2 Study Design

We conducted this study with participants one-on-one. Out of 42 sessions, 2 were conducted in-person, and 40 were through video conferencing software due to office locations and participants’ scheduling constraints. With the permission of all participants, we recorded the participants’ audio and computer screen for subsequent analysis. To start, each participant signed a consent form and filled out a survey regarding their familiarity with responsible AI and LLM prompting (Figure 10). Then, participants were randomly assigned to one of six conditions taking into account their time availability: C_FG, C_F, C_LG, C_L, C_GF, C_GL (Figure 9). C stands for the study condition, C_FG means that participants used Farsight first and then Envisioning Guide, and C_L means that participants only used Farsight Lite—the other acronyms follow this same pattern. Sessions of C_F and C_L were scheduled for 60 minutes each, while the remaining sessions were allotted 90 minutes each. We assigned 7 participants to each condition, as this was the maximum number that allowed for an equal distribution of participants across all conditions, given the time constraints and the availability of the 61 individuals who signed up for the study.

Our study followed a mixed design that combines both between-subjects and within-subjects designs [161]. Each session included three or four harm-envisioning activities, denoted as H1, H2, H3, and H4 (section 6.2.2), as well as one or two semi-structured interviews to collect participants’ feedback (section 6.2.3). In each harm-envisioning activity, participants were asked to envision potential harms associated with a particular AI feature while thinking aloud (Figure 9). In H1 and H3, participants envisioned harms on their own, whereas in H2 and H4, they could use a harm envisioning tool we assigned them based on their study condition (e.g., Farsight, Farsight Lite, or Envisioning Guide). All collected harms were rated by seven researchers with experience with responsible AI evaluations, who assigned each potential harm numeric scores in terms of their likelihood and severity (section 6.2.4). We compared the envisioned harms in H1 and H3 (between-subjects) to investigate how different tools affect users’ ability and approach to anticipating harms (RQ1). We compared the envisioned harms in H2 and H4 (within-subjects) to assess the effectiveness of different tools in helping users envision harms (RQ2). Besides the quantitative data on the number and ratings of potential harms, we also collected qualitative data from think-aloud and two interviews (RQ1–RQ3). We incorporated 60-minute sessions (C_F and C_L) into our study design due to challenges in recruiting participants available for a 90-minute duration.

Fig. 11:

6.2.1 Baseline Harm Envisioning Tool.

To compare our work against current responsible AI workflows, we created a baseline intervention Envisioning Guide: a combination of Microsoft’s Harm Modeling Practice [120] and the Harm Taxonomy from Shelby et al. [164]. These two resources are the latest and the most representative resources designed to help practitioners envision harms. We combined them because (1) we aim to simulate the current practice where AI prototypers can choose from various existing harm envisioning tools, and (2) we do not intend to study the causal effects of any specific resource. We administered this intervention by providing a Google Doc containing a detailed table and information from these resources (Figure 11). Both resources were designed to help technology developers and researchers anticipate and prevent negative societal impacts of their technology innovations.

6.2.2 Harm Envisioning Activities.

Depending on the conditions, the study included three or four harm envisioning activities (H1–H4). Within each harm envisioning activity, participants were presented with a description of an AI feature and the prompt that generated that feature. We chose the four AI features (Figure 9) based on a qualitative analysis of 100 randomly sampled internal prompts written by real AI prototypers. These four features are representative of popular LLM tasks (e.g., summarization, classification, and question answering) and comprehensible to participants with diverse roles. In H1 and H3, participants independently envisioned harms, whereas in H2 and H4, they were provided with a harm envisioning assistance tool (e.g., Farsight, Farsight Lite, or Envisioning Guide). To emulate AI prototyping workflows, we asked participants to perform simple prompt engineering tasks in H2 and H4 before envisioning potential harms of presented AI features.

For each harm, participants were instructed to describe who would be affected (i.e., the stakeholders) and how the stakeholder might be harmed. We provided a harm example for a code generation AI feature: “App end-users might face financial loss due to AI-introduced software vulnerabilities.” During the process, participants were asked to share their screens and verbalize their thoughts. They were also asked to enter their envisioned harms into a Google Doc table featuring a who column and a how column. Moreover, participants had the option to articulate the harm verbally, and we transcribed it into the table. At the end of each harm envisioning activity, we reviewed the table together with the participants to ensure the accuracy of the harm descriptions. Participants were instructed to achieve three objectives: (1) envision as many harms as possible; (2) envision the most likely harms; and (3) envision the most severe harms.

H1: Pre-task. To understand how participants independently envision potential harms before using the tool, as a baseline for RQ1, participants were asked to anticipate potential harms concerning an LLM-powered email summarizer on their own (Figure 9). They received information about the AI functionality: “Shorten and improve a user’s email”, a development context, and a prompt that enables this functionality (see details in section C.1). The duration of this activity was limited to 10 minutes.

H2: Intervention. In the second harm envisioning activity, we asked participants to use different harm envisioning assistant tools. Depending on the assigned condition, a participant could use Farsight (C_FG, C_F), Farsight Lite (C_LG, C_L), or Envisioning Guide (C_GF, C_GL) to help them anticipate harms. The activity began with a tutorial on the designated tool. The AI feature used in this activity was an LLM-powered toxicity classifier (Figure 9). Participants received information regarding the AI functionality “Detect toxic text content,” a development context, and a prompt that enables this AI functionality (section C.2). To emulate AI prototyping workflows, we also tasked participants with a simple prompt engineering assignment (section C.2).

After completing prompt engineering, participants envisioned harms linked to the toxicity classifier. They were instructed to freely use the assigned tools while sharing their screens and thinking aloud. For participants assigned with Envisioning Guide (C_GF, C_GL), the process of entering envisioned harms was the same as H1. Participants assigned with Farsight (C_FG, C_F) or Farsight Lite (C_LG, C_L) could click a button in the tools to export all harms as a text file. The export included both AI-generated harms and harms added or modified by participants. Participants were asked to copy the harms into the Google Doc. As a significant portion of these harms were generated by AI, we asked participants to select harms that (1) they agreed with and (2) would report to their colleagues and managers. Also, participants were welcome to add more harms to the table. For our analysis, we only included the exported harms that participants had selected and added to the table. The duration of this activity was limited to 25 minutes.

H3: Post-task. To understand how the intervention may have affected participants’ ability to independently envision harms (RQ1), we asked participants to envision harms associated with an LLM-powered article summarizer on their own (Figure 9). To ensure a valid comparison between the envisioned harms and participants’ approaches to the pre-task (H1), we introduced a parallel AI summarizer feature in this activity that was isomorphic to the pre-task [139]. In particular, to deter participants from directly reusing their envisioned harms from H1, we replaced the email summarizer in H1 with an article summarizer. The AI functionality was described as “Summarize an article in one sentence”. The development context and prompt are available in section C.3. The duration of this activity was limited to 10 minutes.

H4: Alternative. To assess the effectiveness and usefulness of Farsight and Farsight Lite in comparison to Envisioning Guide (RQ2) and study the usage patterns of different tools (RQ3), n = 28 participants engaged in 90-minute sessions (C_FG, C_LG, C_GF, and C_GL) to envision harms using a tool different from the one used in H2 (Figure 9). Participants were asked to envision potential harms associated with an LLM-powered math tutor app with the AI functionality “Answer math-related questions”, a development context, and a prompt (section C.4). The procedure for this activity paralleled H2, including a tutorial, prompt engineering exercise (section C.4), and harm envisioning. This activity’s duration was limited to 25 minutes.

6.2.3 Semi-structured Interviews.

This study included two semi-structured interview sessions (Figure 9). The first interview took place after the post-task activity (H3), where we asked participants to reflect on their process for anticipating potential harms during the LLM prototyping process, and how their approach may have changed after the intervention (RQ1). We also asked participants about their challenges in harm anticipation, their experiences of using harm envisioning tools, and potential actions they would take to address the identified harms (RQ3, section D.1). After participants in 90-minute sessions (C_FG, C_LG, C_GF, and C_GL) finished H4, we asked them to compare and rate the usefulness and usability of the two tools they had used in this study (RQ2). We also asked them to rate the helpfulness of different components in the tools on a 5-point Likert scale, as elucidated in section D.2.

6.2.4 Harm Rating.

After completing all 42 study sessions, we recruited seven raters to rate all 989 harms collected in H1–H4 to evaluate participants’ ability to envision harms. These seven raters included four of the paper authors and three industry researchers; all raters had experience with responsible AI (unlike many of the participants)—either as responsible AI researchers, developers of responsible AI tools or playbooks, or in a consultant role on responsible AI for product teams. Ideally, evaluations of identified harms would involve both domain experts for the domain in question (e.g., education) and/or stakeholders from demographic groups or communities who may be likely to experience those harms. For this preliminary study, due to timing and resource constraints, we recruited responsible AI researchers as raters instead of specific domain experts or people impacted by AI applications. The limitations of this approach are further discussed in section 6.7 and 7.2

Our collected harms were either (1) directly envisioned by participants or (2) exported from Farsight or Farsight Lite and subsequently curated by participants during H2 and H4. Each harm included the impacted stakeholders and a description of the harm. After removing duplicates and random shuffling, we randomly and evenly assigned harms to raters via spreadsheet format. Raters had access to the details of the intended AI feature of each harm, including the prompt and the context of the AI feature (e.g., the prompt and context in section C.1). To prevent the raters from being influenced by our hypotheses, we did not include the experimental conditions in the rating sheet. In other words, raters did not know if a harm was from a Farsight user, a Farsight Lite user, or a Envisioning Guide user. To mitigate rating noise, we designated three raters for each harm. As identifying likely and severe harms is often an objective in AI harm envisioning exercises [120, 142], we asked raters to rate each harm’s likelihood and severity on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree to statements “This harm is likely to occur for this stakeholder” and “This harm will severely impact this stakeholder”). Raters could also choose an N/A option if they perceived a rating was not applicable for that feature or use case. During data analysis, we numericalized these four categories as ordinal scores: 1, 2, 3, 4 and removed N/As. See Table S2 for a random subset of harms that were collected from participants and their corresponding ratings.

6.3 Data Analysis

We applied a mixed-methods approach for data analysis. First, we conducted a quantitative analysis (section 6.3.1) on the changes in participants’ ability to envision harms by comparing pre-task H1 to post-task H3 responses (RQ1). We also quantitatively assess three different tools’ effectiveness in helping users anticipate harms by comparing H2 and H4 responses (RQ2). The quantitative analyses involved metrics such as the total number of envisioned harms, as well as the average likelihood and severity ratings of envisioned harms across 3 raters. Next, we performed a qualitative analysis (section 6.3.2) on transcripts from think-aloud sessions and interviews to further investigate participants’ strategies and challenges in envisioning harms, and usage patterns of different tools (RQ1–RQ3).

Fig. 12:

6.3.1 Quantitative Analysis.

We first conducted quantitative analyses on the count, likelihood, and severity of harms across different conditions to evaluate the effectiveness of our tools (RQ1, RQ2). We measured the likelihood and severity for each harm using the average of ratings from three raters after removing any N/As. The average pairwise weighted Cohen’s kappas [32, 112] for likelihood and severity ratings are 0.14 and 0.09 (see Figure S3 and section E for details). These values fall within the range of slight agreement [94]. We discuss this relatively low inter-rater agreement in 7.2 162] show all measures, except for the changes of harm count between H1 and H3 with Envisioning Guide, follow a normal distribution. We used t-tests with Bonferroni corrections for multiple hypothesis testing.

We also analyzed participants’ ratings of the tools’ usefulness and usability when comparing the two tools used in the study (RQ2, section 6.5.3). We converted the 5-point Likert scale ratings into numerical values and assessed the difference between ratings of our tools and Envisioning Guide using Mann-Whitney U tests [107]. Considering that most of the ratings did not exhibit a normal distribution, we chose to use Mann-Whitney U tests, as these tests do not assume normality in the data. See section 6.5.3 for discussion of the findings from these questions about usefulness and usability.

6.3.2 Qualitative Analysis.

We conducted a qualitative analysis on the screen recordings and transcripts of the study sessions that include participants’ verbalized thoughts during the harm envisioning activities (H1–H4) and interviews. All study sessions were screen-recorded and audio-recorded, with the audio subsequently transcribed by the video conferencing software. We adopted an inductive thematic analysis approach [21, 116] and open coded the 56-hour-long transcripts using the qualitative analysis software Dovetail [47]. After generating a codebook, we applied deductive coding [116] to assign harm envisioning patterns to each participant during H1 and H3 (RQ1, 6.4.2

6.4 Findings: Changes in Users’ Envisioning Ability and Approach (RQ1)

In the study, participants were asked to independently envision harms associated with an email summarizer (H1) and an article summarizer (H3) before and after using a harm envisioning tool (Farsight, Farsight Lite, or Envisioning Guide) to anticipate harms for a toxicity classifier (H2). We quantitatively and qualitatively compared participants’ envisioned harms and approaches in H1 and H3 across different conditions in H2.

6.4.1 Farsight and Farsight Lite Improved Users’ Ability to Envision Harms.

For each participant, we compared the count, average likelihood, and average severity of their independently envisioned harms before (H1) and after (H3) the intervention (Figure 12). Using paired sample t-tests with Bonferroni correction [48], we found that after using Farsight and Farsight Lite, users could envision significantly more harms on their own (p = 0.0028, p = 0.0003), showing an average increase of 2.42 and 3.00 harms, respectively. The effect sizes, as measured by Cohen’s d [33], were d = 1.21 and d = 1.27, indicating a very large effect [157]. On the contrary, for participants using Envisioning Guide, the average count of identified harms experienced a marginal decrease ( − 0.14). We hypothesize that the observation of three participants identifying fewer harms after using Envisioning Guide (see the outliers in Figure 12 A) is because Envisioning Guide had a high cognitive load. The high cognitive load may have resulted in these three participants having less energy to envision harms in H3 compared to H1. Changes in the average likelihood and average severity, on the other hand, were not statistically significant for any of the interventions (Figure 12-BC). Our finding implies that after using Farsight and Farsight Lite, users could anticipate a greater number of harms linked to AI features independently, while the average likelihood and severity of identified harms remained unaltered.

Table 3:

Harm Envisioning Pattern	Description
Failure-mode-driven envisioning	Participants envisioned harm by initially considering the AI feature’s failure modes (e.g., wrong summarization), limitations of LLMs (e.g., hallucination), or vulnerabilities within system implementation (e.g., data storage). This pattern is similar to a Failure Mode and Effects Analysis [152].
Usage-driven envisioning	Participants envisioned harm by initially considering who may be impacted through this feature and in what usage scenario, such as students using the article summarizer for completing assignments. Then, participants envisioned potential harms that might impact the stakeholders within the identified scenario.
Consider high-stakes uses	Participants deliberately thought about high-stakes use cases of the AI feature, such as being used in medical, financial, and legal domains.
Consider misuses	Participants deliberately envisioned potential misuse of the AI feature, where malicious actors like scammers and hackers could exploit this AI feature to cause harm.
Consider indirect stakeholders	Participants deliberately brainstormed stakeholders indirectly impacted by the AI feature, such as people who did not use the AI tools, individuals mentioned in the input text, and society at large.
Consider cascading harms	Participants deliberately considered (1) harms that could result from other harms, such as productivity loss due to AI errors can lead to economic loss; or (2) harms that might occur even when the AI feature operated as expected, such as students using AI to cheat in homework.

Table 3: We identified six non-exclusive common patterns in independent harm envisioning by analyzing transcripts of participants’ think-aloud process during the harm envisioning activities in H1 and H3.

6.4.2 Changes in Harm Envisioning Approaches.

We also investigated the impacts of different tools on participants’ approaches to harm envisioning by analyzing their self-reports in interview 1 and the think-aloud data in H1 and H3.

Self-reported changes after using Farsight and Farsight Lite. The major themes of self-reported changes are similar between Farsight and Farsight Lite. A large number of participants noted that while they initially considered the AI feature and its potential harms in a general sense during H1, they shifted towards a more focused consideration of specific use cases and stakeholders in H3 (e.g., P23, P34, P38). Some participants highlighted they started to brainstorm potential misuses in H3 (P25, P32). For stakeholders, participants broadened their consideration to people and organizations not initially considered during H1. P40 acknowledged a transition from a focus on “protecting the AI company” in H1 to considering end-users in H3. Similarly, P17 reported a focus on end-users after using Farsight:

“Earlier maybe I was coming towards it from a very engineering or a very broad feature perspective. The third time, I was thinking more about people who were actually using the product and getting affected. So I was thinking more with respect to the people using it, rather than that being a feature in some application.” (P17)

Many participants also highlighted that they began to adopt the frameworks presented in Farsight and Farsight Lite (e.g., P9, P10, P32) to structure their harm envisioning procedures. For example, P10 and P32 appreciated the categorization of use cases, and they reported considering intended uses, high-stakes uses, and misuses in H3. After using Farsight, P9 said they followed the sequence of layers in the tree visualization to conceptualize use cases, stakeholders, and harms:

“I found that sort of flow from identifying potential use cases, then identifying stakeholders of those use cases, then identifying potential harms for each of the stakeholders to be really valuable. That’s a great way to scaffold it and think through the flow rather than just sort of bouncing around, which is what I had been doing [in H1]. So yeah, I found that super valuable that has changed the way that I think about it. And that’s the framework that I’ll use in the future.” (P21)

Self-reported changes after using Envisioning Guide. Many participants using Envisioning Guide in H2 (C_GF, C_GL) also noted shifts in their approaches to envisioning harms. Several participants noted that they started to follow the structure outlined in the Harm Modeling Guide to envision harms (P8, P40, P42). Some participants started thinking more about under-represented social groups in H3 (P8, P31). Furthermore, many participants described the harm taxonomy as a “mental checklist” that provided them with a language to articulate and think about harms (e.g., P6, P14, P31).

Fig. 13:

Observed changes in envisioning approaches. By analyzing transcripts of participants’ think-aloud process during the harm envisioning activities in H1 and H3, we identified six non-exclusive common patterns in harm anticipation (Table 3). Then, we examined the effects of different interventions on participants’ envisioning patterns by comparing the number of participants who applied and did not apply these six patterns in H1 and H3 across interventions (Figure 13). The intervention assignment is random.

Interestingly, the counts of participants who applied each pattern in H1 were consistent across interventions, with the exception of Farsight Lite where notably more participants considered indirect stakeholders in H1 (Figure 13-5). Before the interventions, the majority of participants relied on failure-mode-driven envisioning when anticipating harms (Figure 13-1), focusing on the AI feature’s limitation, failure modes, and technical implementation details. This observation corroborates participants’ self-reported envisioning approaches, where participants like P17 acknowledged having a “very engineering or a very broad feature perspective” in H1.

After the intervention, we observed that all three harm envisioning tools (Farsight, Farsight Lite, and Envisioning Guide) influenced participants to adopt a usage-driven envisioning approach when independently envisioning harms (Figure 13-2). Particularly, Farsight had the most pronounced effect, followed by Farsight Lite and then Envisioning Guide. All these tools encouraged participants to think more about high-stakes uses (Figure 13-3) and indirect stakeholders (Figure 13-5). Both Farsight and Farsight Lite exerted a stronger influence on considering misuses (Figure 13-4) and cascading harms (Figure 13-6) compared to Envisioning Guide. However, Envisioning Guide had slightly more impact than Farsight Lite in encouraging consideration of high-stakes uses (Figure 13-3) and indirect stakeholders (Figure 13-5).

Interestingly, Farsight had a notably more pronounced effect in leading participants to consider indirect stakeholders (Figure 13-5) and cascading harms (Figure 13-6) than the other tools. For indirect stakeholders, a possible explanation is that during H2, many participants encountered unexpected indirect stakeholders revealed by Farsight (6.5.2 Farsight Lite in fostering consideration of indirect stakeholders, as Farsight Lite had only identified one direct stakeholder for each use case, and participants could not use AI to generate more stakeholders.

For cascading harms, we hypothesize two potential explanations. First, many participants applied a reviewing approach when engaging with AI-generated harms in Farsight and Farsight Lite, where they tried to understand and make sense of these harms. In H2, reviewing existing harms prompted participants to consider cascading harms that might arise from other harms (6.5.2 unexpected AI-generated cascading harms in H2 (6.5.2

6.5 Findings: Farsight’s Effectiveness in Assisting Harm Envisioning (RQ2)

In addition to assessing the impacts of different harm envisioning tools on users’ ability to independently envision harms, we also evaluated the tools’ effectiveness in aiding users to anticipate harms. Specifically, we quantitatively compared participants’ envisioned harms when using different harm envisioning tools in H2 and H4. Furthermore, we qualitatively analyzed participants’ usage patterns, interview responses, and survey data.

Fig. 14:

6.5.1 Farsight and Farsight Lite helped users envision more harms.

We compared the count, average likelihood, and average severity of harms collected in H2 and H4 using our tools, Farsight and Farsight Lite, against the baseline Envisioning Guide (Figure 14). These harms were identified by participants using different harm envisioning tools or generated by AI and selected by the participants. This analysis followed a within-subjects approach, including 28 participants from C_FG, C_GF, C_LG, and C_GL. In each comparison, such as Farsight vs Envisioning Guide, a total of 14 participants used both tools, with 7 of them starting with Farsight in H2 (C_FG), and the remaining 7 beginning with Envisioning Guide (C_GF). Results from paired t-tests, adjusted with Bonferroni correction, highlighted that participants using Farsight and Farsight Lite resulted in a significantly higher number of harms compared to those using Envisioning Guide (p = 0.0018, p = 0.0034), with an average difference in the count of 4 (Figure 14A). The effect sizes, as measured by Cohen’s d, were d = 1.57 and d = 1.48, indicating a very large effect. However, no significant differences were observed regarding the likelihood and severity of identified harms between our tools and Envisioning Guide (Figure 14-BC). Our findings suggest that our tools are effective in assisting users to identify a greater number of harms compared to existing resources, while the quality of the identified harms remains consistent.

6.5.2 Usage patterns.

We summarized how participants use Farsight and Farsight Lite in H2 and H4.

Trying to understand (unexpected) AI-generated content. Upon encountering AI-generated content (e.g., use cases, stakeholders, and harms), participants first sought to (1) understand why AI had generated it and then (2) assess its likelihood and relevance to their AI application. For example, for the toxicity classifier in H2, Farsight and Farsight Lite sometimes would generate a use case “HR departments use it to screen job applicants for toxic behaviors.” This use case was usually unexpected to participants, and it provoked them to think how an HR department could employ a toxicity classifier. Some participants imagined that the HR could use this classifier on applicants’ social media to identify red flags (e.g., P10, P11, P29), while others could only see it being used on applicants’ cover letters (P4). Participants then assessed how likely and relevant is this scenario before diving into related harms.

Subjectivity in apprehending auto-generated content. We observed that based on participants’ prior experiences, they could have very different views on auto-generated content in Farsight. For example, participants had different perceptions of how their companies’ HR division might use a toxicity classifier (e.g., applying the classifier to job applicants’ social media content or their application material). Also, for the toxicity classifier in H2, the Incident Panel would often show an incident report on biases in sentiment analysis tools. While some participants could quickly make the connection between sentiment analysis and toxicity classification and reflect on biases in toxicity classifiers (P10, P36), others would overlook this incident (P19, P38).

In some cases, participants’ disagreement came from their different definitions of harm. For example, in both H2 and H4, our tools would generate potential harms for people who do not use the AI applications, such as “students who do not use the math tutoring app may feel left behind.” Some participants perceived these harms as crucial considerations for assessing the impacts of AI applications (e.g., P6, P18, P30), while others argue against considering harms when an AI feature is absent (e.g., P4, P9, P13). We discuss the implications of subjectivity and rater disagreement in harm envisioning in 7.2

Sparked to brainstorm new harms. The content in Farsight and Farsight Lite often inspired participants to brainstorm new use cases, stakeholders, and harms. After seeing an AI-generated stakeholder, many participants could quickly identify potential harms for that stakeholder. For instance, seeing the stakeholder teachers in the math tutoring app in H4, P22 added a new harm that teachers may struggle to integrate this tool into their existing teaching workflows. Many participants also came up with new harms by making connections across different AI-generated use cases, stakeholders, and harms. For example, Farsight anticipated two use cases for the toxicity classifier: (1) online moderators using it to identify toxic content, and (2) hate groups using it to recruit people. P2 connected both use cases and added a new harm: “online moderators could face death threats from hate groups who feel their speech is censored.”

Thinking beyond immediate harms. Instead of starting with a blank slate, our tools provided participants with initial materials that prompted them to think beyond the immediate harms and envision cascading repercussions. For example, after seeing the AI-generated harm “job applicants might be unfairly rejected” within the context of HR using a toxicity classifier to screen job applicants, P38 quickly thought of a cascading harm—the company’s diversity hiring effort could be harmed, as the toxicity classifier was more likely to misclassify and reject under-represented social groups. Similarly, P18 recognized in the long run, the hiring company could lose money due to the exclusion of qualified candidates caused by a biased toxicity classifier. This usage pattern might explain the increase of participants, who used Farsight and Farsight Lite in H2, independently envisioning cascading harms in H3 (Figure 13-6).

Fig. 15:

Fig. 16:

Thinking about mitigation strategies. Interestingly, after seeing AI-generated harms, many participants voluntarily considered actions and strategies to take after envisioning harms. For example, after seeing AI-generated harms for the toxicity classifier, P15 and P16 noted that it was important to allow impacted stakeholders to appeal if their content was removed because of the classifier. Similarly, P27 and P40 noted that people should implement a human review process if the toxicity classifier was used to remove social media content. Interacting with Farsight and Farsight Lite also encouraged participants to reflect on their prompting workflows. For example, P29 and P37 mentioned that the AI prototypers should start collecting good and diverse toxicity examples to improve the prompt through few-shot prompting. P2 noted that they would like to add additional instructions in their prompt to safeguard against biased output and potential data leakage. Finally, after envisioning more harms, P2 mentioned that they would rethink if it was worth continuing to prototype or develop this AI feature.

6.5.3 Our tools were usable, useful, and preferred by users.

We asked participants who had used one of our tools and Envisioning Guide (C_FG, C_GF, C_LG, C_GL) to compare and rate the usefulness and usability of the tools they had used on a 5-point Likert-scale. By comparing their ratings, we found both Farsight and Farsight Lite were preferred and considered as more helpful, easier to use, and more enjoyable compared to Envisioning Guide (Figure 15). Both tools had significantly higher ratings on “easy to use” than the baseline (p = 0.0384, p = 0.0260). In addition, Farsight was rated significantly more helpful than the baseline (Figure 15A), while Farsight Lite was more enjoyable (Figure 15B). The effect sizes of significant results, as measured by the common language effect size [110], were all above 0.7, indicating a large effect.

Usefulness of different features. Besides comparing the two tools, participants also rated the usefulness of specific features in each tool. The average ratings are shown in Figure 16. All features in our tools were rated favorably (Figure 16-AB). For Farsight, participants especially liked the interactive tree visualization. For example, P6 commented, “This tree makes a lot of sense. This is how I think about it in my brain as well.” Similarly, P16 appreciated the progressive disclosure in the visualization: “I’m able to not get overwhelmed by everything all at once.” The rating for the AI incident panel (in both Farsight and Farsight Lite) is relatively lower than other features. Participants explained that the surfaced incidents were not very relevant to their prompts (P39, P41), and the feature would require them to take time to read external articles (P24, P39).

6.6 Findings: Farsight’s Role in Overcoming Harm Envisioning Challenges (RQ3)

After completing the post-task (H3), participants were asked to reflect on the biggest challenges encountered in envisioning harms associated with AI features. We examined the major themes that emerged from these challenges. In addition, by analyzing participants’ usage patterns of Farsight and Farsight Lite, coupled with their interview feedback, we explored how our tools mitigate certain challenges and also identified our tools’ limitations.

6.6.1 Challenges in envisioning harms.

We summarized three major challenges that participants encountered.

C1.

Envisioning use cases. The most prevalent challenge in envisioning harms is to anticipate different use cases for an AI feature. Multiple participants noted that it was most challenging to imagine how different people would use technology, and it was particularly difficult to “put myself in someone’s shoes” (P27, P37, P39) and “empathize with different groups of people” (P11). Participants also underscored the vast space of possible use cases (P31, P33, P36), and “often you don’t find out the edge cases until you actually work with it” (P2). Some participants also emphasized that it sometimes required creativity to imagine how an AI feature would be used and especially misused (e.g., P5, P22, P23).

C2.

Bias and subjectivity in harm envisioning. Interestingly, several participants recognized their own biases in envisioning harms (e.g., P6, P21, P31). For example, P21 noted the challenge in overcoming their biases in anticipating the impacts of AI features: “I had been coming at it from a very American-centric point of view at first. To talk about bias, I hadn’t even conceived of the government using this to monitor my phone, but that could happen in other places.” Moreover, some participants acknowledged the subjectivity in the definition of harms, as well as in the assessment of harms’ likelihood and severity. For example, while envisioning harms and selecting harms to report (H2 and H4), some participants were conscious of whether other people would agree with their identification and assessment of harms (P19, P38).

C3.

Inexperience and discomfort in harm envisioning. Many participants mentioned that our study was their first time to envision harms for AI features (e.g., P17, P26, P28). For example, P26 noted “I have never envisioned harm before. This is not something I would think of when developing AI products.” Similarly, P18 said “I’m familiar with technical issues but not their social influence”. Also, P30 pointed out that there were few incentives for developers to envision harms. In addition to unfamiliarity, some participants also noted that it was uncomfortable and sad to think about harms (P3, P12). For example, P3 said “It’s not comfortable thinking through all the bad things that can happen. I think in general people don’t like thinking about bad things too much.”

6.6.2 Farsight and Farsight Lite address major challenges.

Our tools could help users address identified challenges.

A co-pilot for brainstorming diverse use cases. Many participants appreciated that our tools provided them with a starting point to predict use cases (e.g., P8, P29, P41). For example, after seeing a few AI-generated use cases, P8 found it much easier to envision other use cases, and similarly, P24 felt empowered to “have a wider net to cast” (C1). Also, P14 noted that even seeing far-fetched AI-generated content helped them brainstorm new use cases. On the other hand, P21 appreciated that Farsight had identified many unexpected and thought-provoking use cases that provided a different perspective in anticipating harms (C2).

In situ guide that promotes user agency. Participants especially liked that our tools were directly integrated into existing AI prototyping tools and contextualized based on the prompt (e.g., P19, P31, P37), where Farsight and Farsight Lite required minimal effort to get started envisioning harms (C3). Participants also thought the Incident Panel and Use Case Panel as a good reminder for potential harms for the AI feature that one is prototyping (e.g., P12, P41, P42). For example, P12 commented that “Even if it’s just sitting there, it would be educational.” Many participants also liked the interactivity of our tools and found it engaging for adding new use cases, stakeholders, and harms (e.g., P9, P19, P24)—many of them noted that Farsight was so intriguing that they would like to continue using it to explore potential harms (C3). Participants felt they had agency in harm envisioning when using Farsight. For example, P21 commented “If you think something [AI-generated content] is totally bonkers, whatever, just delete it.” Similarly, P4 and P5 compared the Harm Envisioner to a mind map, as they appreciate that the interface allows them to freely organize and revise their thoughts in harm envisioning.

6.6.3 Limitations of Farsight and Farsight Lite.

Our findings showed that, in comparison to Envisioning Guide, Farsight and Farsight Lite did not show significant differences in participants’ ability to envision more likely or more severe harms (section 6.4.1), nor did they assist participants in envisioning more likely or more severe harms (section 6.5.1). Additionally, participants’ feedback revealed two major limitations of our tools.

Varied quality of LLM-generated content. Depending on participants’ prompts, the related AI incidents in the Incident Panel, and LLM-generated use cases, stakeholders, and harms were different across participants. Sometimes, participants found a few LLM-generated content confusing and unhelpful. For example, when using our tools on the math tutor prompt (H4), the incidents in the Incident Panel feature articles about hallucination in chat-based LLM models. Some participants found these articles too generic and not relevant to the math tutor app (P39, P41).

Also, some LLM-generated use cases could be too far-fetched. For example, for the math tutor prompt, Farsight sometimes showed a use case: “Scammers use it to explain complex investment schemes to potential victims.” While some participants found it interesting and relevant (e.g., P14, P26), others found it unrealistic and not useful (e.g., P6, P12). This disagreement highlights the subjectivity in identifying and assessing harms (7.2 Even if it’s wrong [LLM-generated use case], it is still kind of helpful to think beyond the immediate use case and who else can use this tool.” Similarly, P21 said “Some of these feel more of a stretch but it’s interesting because I could see how it gives me ideas for things to watch out for which I still appreciate.”

Lack of actionability. Another limitation is that our tools did not provide users with actions to prevent or mitigate identified harms (P13, P22, P34). P13 also commented that increasing awareness without providing actions to address responsible AI issues could be harmful, because “People have an empathy quota, and it might just be displacing more impactful efforts.” Related to the discomfort that some participants had experienced when envisioning harms (C3), P40 mentioned that they felt scared and overwhelmed because there were so many possible harms and they did not know how to address them. Similarly, P29 noted that the lack of actionability made them feel anxious and disappointed:

“I’m glad that I got to know about them [potential misuses]. But I feel I’m vulnerable, probably because I can’t do much about stopping them. So that’s something that really makes me feel very disappointed. Because unless we do case-by-case analysis, this [preventing misuses] can be very tricky. I feel like it’s kind of adding anxiety to me. It’s good to know, but I feel like I can’t do much about it.” (P29)

We did not incorporate harm mitigation into our tools, because mitigating harms associated with LLM-powered applications remains an open research question (see more discussion in section 7.3). After the evaluation study, we improved Farsight by providing pointers to existing LLM safety resources [e.g., 6, 117, 71, 164] when users exported their harms.

6.7 Limitations of Study Design

We acknowledge limitations in our tool and study designs. First, we recruited participants from a single large technology company. This was because we needed to require participants to have prior experience in prototyping LLM-powered applications using a particular prompt-crafting tool, into which we integrated Farsight and Farsight Lite in the study. Consequently, all 42 participants had backgrounds in the technology industry in varying roles, such as software engineers, product managers, UX researchers, and linguists⁴ as shown in Table 2. Our participants have a wide range of familiarity with responsible AI and prompting (Figure 10), and they use LLMs for diverse tasks, including prototyping AI features with LLMs—much like the intended users of Farsight. Therefore, findings from our study may be generalizable to AI prototypers who have worked in the technology industry, and who are using LLMs to prototype AI-based applications. Nevertheless, to understand how usable or effective Farsight may be for a broader spectrum of AI prototypers, particularly those with limited background or knowledge of AI, such as creative writers, teachers, students, and more, further research involving individuals with more diverse backgrounds is needed. Second, we administered only one post-task (H3) immediately following the intervention (H2). To evaluate the long-term impact of our tools on users’ ability to envision harms, a more extended longitudinal study is needed.

Finally, an inter-rater reliability test showed that, on average, the seven raters (i.e., of the likelihood and severity of the identified harms) only had a slight agreement (section E). The ratings of the likelihood and severity of participants’ identified harms should thus be taken as an initial step in evaluating identified harms, and not as the sole evidence demonstrating the value of this approach. The relatively low inter-rater reliability may be due to the fact that perceptions of severity and likelihood may be highly influenced by the raters’ personal experiences, backgrounds, knowledge, and their positionality as a whole. Indeed, substantial prior work on annotations of offensive language, hate speech, and other linguistic phenomena [35, 39, 123, 136, 138] suggest that disagreements between raters with different subjectivities (i.e., personal backgrounds and experiences) is an inherent challenge to sociotechnical evaluations, and not one that can be solved with more or better raters. We further discuss the challenges regarding subjectivity in identifying and assessing harms in 7.2

7 Discussion

7.1 Motivation & Engagement in Responsible AI

Potentials of in situ and early intervention in motivating responsible AI practices. Existing research suggests that many AI developers may not have incentives to consider potential harms related to their AI applications [143]—or may be actively disincentivized to identify such harms [104]. Our co-design user study validates this finding among an emerging community involved in AI development—AI prototypers who use LLMs to rapidly iterate on potential AI-based applications (section 3.1). With the rapidly increasing access to LLMs and easy-to-use prototyping tools, it is crucial to motivate AI prototypers to consider AI risks when prototyping their AI applications or features (G3). To tackle this challenge, we propose an in situ system design that integrates our tool into the AI prototyper’s existing workflows and employs different design strategies to draw users’ attention without causing significant interruption to their flow. Our evaluation study shows that users appreciate our design, and find this in-context warning tool easy to adopt and engaging (6.6.2 Farsight piques users’ interests (6.5.2 in situ design and early intervention for future responsible AI works. Therefore, future designers of AI development tools (e.g., Google AI Studio, computational notebooks, and VSCode) can natively integrate in situ interfaces to promote responsible AI practices. In addition, future researchers can adopt our design strategies to foster other responsible AI practices, such as illustrating bias in LLMs and encouraging development documentation at an early AI development stage.

Tension between automation and human agency.Farsight’s seamless integration into AI prototypers’ workflows helps motivate AI prototypers to engage with harm envisioning. In addition, rather than asking users to anticipate harms entirely from scratch, Farsight leverages LLMs to generate the initial set of use cases, stakeholders, and harms, providing users with inspiration and a foundation to build upon (section 6.6.3). However, this seamless and automated design might deter users from fully engaging in and contemplating the limitations and potential risks associated with LLMs. Prior research in responsible AI has proposed the value of a seamful design [e.g., 52, 86], where the designers strategically reveal seams or introduce frictions or “productive restraint” [85, 104] to support increased reflection on responsible AI during development. To explore this tension and tradeoffs between a seamfully-designed workflow that is easy to use by prototypers, and a seamful design that prompts reflection-in-action [52], we (1) designed the Harm Envisioner to encourage users to edit LLM-generated content and steer the harm envisioning direction (section 4.3, G4), and (2) evaluated two variants of our tool in the evaluation study—Farsight and Farsight Lite, where Farsight Lite omits the Harm Envisioner.

Our study results highlight that participants feel they have agency (6.6.2 4). Our quantitative results also show that Farsight, with higher human agency, is more effective than Farsight Lite across all measures (section 6.4.1, section 6.5.1). On the other hand, when engaging with AI-generated content, some participants also report discomfort (C3) and even anxiety (6.6.3 in situ AI automation) and seamful design (promoting user reflection) are complementary to each other—tradeoffs and a balance between the two should be considered during the design of responsible AI tools [cf. 198]. For future responsible AI work, researchers should engage with potential users and other impacted stakeholders throughout the design process and adjust their design ideas to ensure the responsible AI tools they are designing are both easily adoptable and capable of eliciting active and critical reflection.

7.2 Subjectivity in Harm Envisioning

In our evaluation user study, many participants report challenges overcoming the limitations of their own experiences and perspectives when envisioning harms (C2). In addition, we also observed the seven RAI raters of participants’ harms disagreed about which harms were more or less severe or likely, resulting in a low inter-rater reliability for these two dimensions (section E, Table S2). Our empirical findings contribute to prior research that highlights the role of subjectivity and positionality in anticipating harms [20, 99] and in data annotation, particularly for annotations of toxicity or hate speech [e.g., 123, 43, 39, 35, 138]. What constitutes harm and the assessment of harm severity are often influenced by the individual’s background, lived experiences, or even the organizational culture they are working in [130, 196]. For example, for the article summarizer (H3), one participant envisioned a harm scenario: “If the summary is wrong, journalists’ reputation might be harmed.” (Table S2). This harm scenario received likelihood ratings of 1, 4, and 3, and severity ratings of 1, 3, and 4 from three randomly assigned raters. It is possible that the rater who assigned the ratings of 3 and 4 possessed specific knowledge about the harms of journalists using LLMs to write article summaries, which led them to rate this harm scenario as more likely and more severe.

A need for new methods to assess harms. Emerging research is beginning to develop methods for measuring and resolving disagreements among annotators in cases where there may in fact be no ground truth [e.g., 39, 138, 35, 123, 72]. Our findings in this paper—including the low inter-rater reliability of the responsible AI raters—suggest that new methods are needed in responsible AI to account for different perspectives on the severity and likelihood of potential downstream harms. This may ideally involve recruiting participants from communities or populations who may be impacted by a given AI application (e.g., the stakeholders generated by Farsight, for instance, as well as other stakeholders identified by members of the communities themselves [37]). Moreover, with the rapidly increasing access to LLMs and easy-to-use AI prototyping tools, AI prototypers may encompass a broader spectrum of roles beyond traditional AI practitioners [e.g., 78, 183]. Thus, they may lack either the experience or the resources to recognize the limitations of their own subjectivity when anticipating harms of their AI applications, and may lack the means to identify and engage with diverse stakeholders as part of harm envisioning.

Benefits and challenges of using LLM to envision AI harms. Our evaluation study highlights that diverse and unexpected AI-generated use cases, stakeholders, and harms in Farsight help some participants overcome their own failures of imagination [20] in order to think from a broader perspective when independently envisioning harms (6.5.2 Farsight than with existing harm envisioning processes [120] (Figure 12). There are two implications of these findings. First, LLMs can be a promising tool to help AI prototypers think outside of their own perspectives, and future researchers can adapt our approach to other responsible AI practices. Second, LLMs may encode biases from their training data [e.g., 190], and Farsight may also reflect the biases of its creators, as expressed in the underlying prompts used in Farsight’s LLM, which raises a critical question: to what extent can LLMs be helpful as part of a harm envisioning process, without over-indexing on particular harms or leading AI prototypers to overlook other types of harms?

Our research provides an initial starting point into investigating these questions, as well as opening new questions into the role of subjectivity in harm envisioning. Future research can further investigate the factors influencing users’ ability to envision harms of AI applications, develop new ways to model and resolve disagreement among AI prototypers or other evaluators about the severity and likelihood of envisioned harms, and integrate such implications into LLM-powered responsible AI tools for AI prototypers or other AI practitioners. Future research can also explore tradeoffs between semi-automated harm envisioning processes (like Farsight) and more traditional processes like value-sensitive design [e.g., 57], participatory design [e.g., 16, 130, 37], and more.

7.3 Mitigating Harms during AI Prototyping

A limitation of Farsight is its focus on harm identification rather than harm mitigation. Participants from our co-design study (section 3.1) and evaluation study (6.6.3 Farsight to provide actionable items to help them prevent and mitigate identified harms. Some participants also suggested we develop an in situ prompt editing tool to address harms identified from Farsight (section 3.1). Interestingly, while using Farsight, some participants voluntarily thought about actions and strategies to take after envisioning harms, such as implementing an appeal process, collecting better data, and revising the prompts (6.5.2 2.2).

Looking ahead, we argue that it is crucial for future designers to provide users with harm mitigation suggestions and resources in systems similar to Farsight. Some participants in our study complained that Farsight is exploiting users’ “empathy quota” and potentially desensitizing them about LLM harms, because Farsight only warns users about harms without providing mitigation suggestions (6.6.3 2.4) and monitoring alarms in healthcare. Alarm fatigue occurs “when non-actionable alarms are in the majority, and clinicians develop decreased reactivity, causing them to ‘tune out’ or ignore the alarms” [82]. Therefore, to combat alarm fatigue and effectively promote responsible AI practices, future designers should make responsible AI alerts actionable and prioritize actionable warnings in their systems.

Our findings highlight that Farsight users have a great appetite for mitigation strategies during AI prototyping. We have two hypotheses for this observation. First, as Farsight promotes human agency, it might also give participants a feeling of ownership of their identified harms. Prior research shows that triggering a feeling of ownership motivates users’ actions [26]. Another hypothesis is that Farsight elicits fear by exposing participants to diverse potential harms of their AI applications, evidenced by participant-reported discomfort (C3) and anxiety (6.6.3 fear appeals as a design strategy to motivate users to take security actions [148]. Therefore, our empirical findings highlight promising research opportunities in (1) providing in situ mitigation strategies during the early AI prototyping stage, and (2) investigating if in situ tools can increase users’ adoption of harm mitigation strategies.

8 Conclusion

We introduce Farsight, the first in situ interactive tool to address the challenges in anticipating potential harms in LLM-powered applications during prototyping. By highlighting relevant AI incident reports and enabling AI prototypers to curate and modify LLM-generated use cases, stakeholders, and harms, Farsight improves users’ ability to independently anticipate potential risks associated with their prompts. A user study with 42 AI prototypers shows that our tool is useful and usable. Farsight fosters a user-centric approach, encouraging creators to consider end-users, and cascading harms, and extend their awareness beyond immediate harms. Our tool is open-source and readily adoptable. We hope our work will inspire future research and development of responsible AI tools that target the early stages of the AI development process.

Acknowledgments

We express our gratitude to all anonymous participants who took part in our co-design and evaluation studies. A special thank you to Jaemarie Solyst and Savvas Petridis for piloting our studies. We are deeply thankful to our three anonymous raters for rating the harms collected during the evaluation study. We are immensely grateful for the invaluable feedback provided by Parker Barnes, Carrie Cai, Alex Fiannaca, Tesh Goyal, Ellen Jiang, Minsuk Kahng, Shaun Kane, Donald Martin, Alicia Parrish, Adam Pearce, Savvas Petridis, Mahima Pushkarna, Dheeraj Rajagopal, Kevin Robinson, Taylor Roper, Negar Rostamzadeh, Renee Shelby, Jaemarie Solyst, Vivian Tsai, James Wexler, Ann Yuan, and Andrew Zaldivar. Our gratitude also extends to the anonymous Google employees who generously allowed us to use their prompts to design and prototype Farsight. Furthermore, we are grateful to James Wexler, Paul Yang, Tulsee Doshi, and Marian Croak for their assistance in open-sourcing Farsight. Lastly, we would like to acknowledge the anonymous reviewers for their detailed and helpful feedback.

A Co-design User Study Interviews

A.1 Co-design Prototypes and Sketches

Fig. S1:

A.2 Interview 1 Questions

•

Why do you use a prompt crafting tool?

•

How have you used it most recently? Can you walk me through one of your example prompts?

•

For your previous prompts, did you ever think about the potential societal impacts of your AI application / feature?

It’s OK if this wasn’t part of your process.

If yes, how did you think through or envision those potential impacts of your AI prototypes / ideas?

If yes, can you share some examples of the types of impacts of your AI application that you considered? Including both positive and negative impacts.

If no, how would you envision potential impacts of your AI prototypes / ideas?

•

What would motivate you to think more about the potential negative impacts of the applications you were prototyping?

A.3 Interview 2 Rating Forms and Questions

•

How do you think our design might fit into your prompting workflow?

How do you think our design might fit into your typical AI application prototyping workflow?

•

What would these suggested use cases, stakeholders, and harms prompt you to think or do differently? (if not answered already)

•

What would prevent you from using this tool?

•

Other design ideas?

What would encourage more use/engagement with the tool?

What are other ways you could imagine raising awareness of potential responsible AI and safety issues?

B Interface Details

B.1 Determine the Thresholds for Relevancy of AI Incident Reports

We collect 1000 random internal prompts written by real AI prototypers. Then, we compute the embedding similarity between these prompts and all AI incident reports [111]. We use the 20% and 70% cumulative density function cut-offs (0.69 and 0.75) of the max prompt-incident embedding distance as our thresholds for

. Researchers can easily adjust these two thresholds (bounded between 0 and 1) to calibrate an article’s relevancy.

Fig. S2:

B.2 Example Farsight Output

Table S1:


Use Cases	Stakeholders	Harms
Online moderators use it to remove toxic content from social media.	Online moderators	Online moderators may be exposed to toxic content while using the AI tool.
		Online moderators may feel frustrated or anxious about the accuracy of the AI tool.
		Online moderators may lose their jobs due to AI-generated content being offensive.
	Social media users	Social media users may be unfairly censored by the AI tool.
		Social media users may feel frustrated or anxious about the accuracy of the AI tool.
		Social media users may feel like they are not being heard by the moderators.
	The company	The company may be sued for defamation if the AI tool mislabels a post as toxic.
		The company may lose customers if the AI tool is not accurate enough.
		The company may be accused of bias if the AI tool is not fair in its classification of toxic content.
	Social media users who do not use this AI product	Social media users who do not use this AI product may be exposed to toxic content.
		Social media users who do not use this AI product may feel like they are not being heard by the moderators.
		Social media users who do not use this AI product may feel frustrated or anxious about the amount of toxic content on social media.
	Social media advertisers	Social media advertisers may lose revenue due to toxic content being removed from social media.
		Social media advertisers may have to spend more time and money on creating non-toxic content.
		Social media advertisers may feel frustrated or anxious about the accuracy of the AI tool.
Customer service agents use it to identify abusive customers.	Customer service agents	Customer service agents may feel stressed or anxious about the accuracy of the AI tool.
		Customer service agents may be accused of bias if the AI tool misidentifies abusive customers.
		Customer service agents may lose the skill to identify abusive customers independently.
	Abusive customers	Abusive customers may be denied service due to being labeled as toxic.
		Abusive customers may feel frustrated or anxious about being labeled as toxic.
		Abusive customers may be more likely to engage in abusive behavior in the future.
	The company	The company may be sued for defamation if the AI tool mislabels a customer as abusive.
		The company may face increased regulatory scrutiny if the AI tool is found to be inaccurate or biased.
		The company may lose customers if the AI tool is perceived as biased against certain groups.
	Customers who are not abusive	Customers who are not abusive may be misidentified as abusive and treated poorly by customer service agents.
		Customers who are not abusive may feel frustrated or anxious about being misidentified as abusive.
		Customers who are not abusive may lose trust in customer service agents.
	Customers who are misidentified as abusive	Customers who are misidentified as abusive may be denied service or treated poorly by customer service agents.
		Customers who are misidentified as abusive may feel frustrated or anxious about being treated poorly.
		Customers who are misidentified as abusive may lose trust in customer service agents.
	Law enforcement	Law enforcement may make false arrests or detentions of innocent people due to AI-generated toxicity labels.
		Law enforcement may lose trust in AI tools due to false positives and negatives.
		Law enforcement may be biased against certain groups of people due to AI-generated toxicity labels.
Law enforcement uses it to identify potential terrorists.	Potential terrorists	Potential terrorists may be unfairly targeted by law enforcement due to AI-generated misclassifications.
		Potential terrorists may feel frustrated or anxious about being unfairly targeted by law enforcement.
		Potential terrorists may be denied opportunities due to AI-generated misclassifications.
	Victims of terrorism	Victims of terrorism may be re-traumatized by AI-generated content that glorifies or incites violence against them.
		Victims of terrorism may feel unsafe or threatened by AI-generated content that glorifies or incites violence against them.
		Victims of terrorism may feel like their voices are not being heard by society.
	Civil rights groups	Civil rights groups may be unfairly targeted by law enforcement due to AI-generated false positives.
		Civil rights groups may lose trust in law enforcement due to AI-generated false positives.
		Civil rights groups may have to spend more time and resources fighting against AI-generated false positives.
	People who are falsely identified as potential terrorists	People who are falsely identified as potential terrorists may be subject to increased surveillance, harassment, and discrimination.
		People who are falsely identified as potential terrorists may lose their jobs, housing, or other opportunities.
		People who are falsely identified as potential terrorists may be detained or imprisoned without due process.
Military uses it to identify potential insurgents.	Military	Military may use AI to target and harm innocent civilians.
		Military may lose trust from the public due to its use of AI.
		Military may be pressured to use AI to stay competitive with other militaries.
	Insurgency	Insurgency may be able to evade detection by using AI-generated text that is not classified as toxic.
		Insurgency may be able to recruit more members by using AI-generated text that is not classified as toxic.
		Insurgency may be able to spread propaganda more effectively by using AI-generated text that is not classified as toxic.
	Governments	Governments may use AI to target and surveil innocent people.
		Governments may use AI to justify violence against innocent people.
		Governments may lose control over how AI is used to target and surveil people.
	Civilians who are wrongly identified	Civilians who are wrongly identified may be subject to violence or discrimination.
		Civilians who are wrongly identified may lose their freedom or property.
		Civilians who are wrongly identified may experience emotional distress.
	Civilians who are not identified as insurgents	Civilians who are not identified as insurgents may be subject to violence or discrimination.
		Civilians who are not identified as insurgents may lose their freedom or property.
		Civilians who are not identified as insurgents may experience emotional distress.
	Hate groups	Hate groups may be able to recruit more people due to the AI tool.
		Hate groups may be able to spread their ideology more effectively due to the AI tool.
		Hate groups may be able to avoid detection by law enforcement due to the AI tool.
	Potential recruits	Potential recruits may be exposed to toxic content that could lead to radicalization.
		Potential recruits may feel alienated and isolated from society due to their exposure to toxic content.
		Potential recruits may lose the opportunity to develop healthy relationships with people outside of the hate group.
Hate groups use it to identify potential recruits.	The company	The company may be held liable for hate speech generated by its AI product.
		The company may face increased regulation due to its AI product’s misuse.
		The company may lose customers due to negative publicity.
	People who are targeted by hate groups	People who are targeted by hate groups may be harassed or threatened by hate groups.
		People who are targeted by hate groups may feel anxious or fearful of being targeted again.
		People who are targeted by hate groups may lose their sense of safety and belonging.
	People who are not targeted by hate groups	People who are not targeted by hate groups may be exposed to toxic content.
		People who are not targeted by hate groups may be harassed or threatened by hate groups.
		People who are not targeted by hate groups may feel anxious or fearful about the rise of hate groups.
Scammers use it to identify potential victims.	Victims of scams	Victims of scams may lose money or personal information.
		Victims of scams may experience emotional distress.
		Victims of scams may lose trust in others.
	Scammers	Scammers may be able to target more vulnerable people with their scams.
		Scammers may be able to avoid detection by using AI-generated text that is less likely to be flagged as toxic.
		Scammers may be able to increase their profits by using AI-generated text to target more people.
	The company	The company may be held liable for the harm caused by scammers using the AI tool.
		The company may lose customers due to negative publicity.
		The company may face increased regulatory scrutiny.
	Victim’s financial institutions	Victim’s financial institutions may lose money due to scammers using AI to identify potential victims.
		Victim’s financial institutions may face increased legal liability due to scammers using AI to identify potential victims.
		Victim’s financial institutions may lose trust from their customers due to scammers using AI to identify potential victims.
	Family and friends of the victim	Family and friends of the victim may be harmed by the scammer’s actions.
		Family and friends of the victim may be harmed by the scammer’s emotional abuse.
		Family and friends of the victim may be harmed by the scammer’s financial abuse.

Table S1: Farsight uses LLMs to generate use cases, stakeholders, and harms associated with AI apps based on a given prompt. This table shows the outputs of Farsight for a toxicity classifier prompt (§ C.2). By default, the Harm Envisioner displays six use cases (arranged by two

, two

, and two

), five stakeholders (arranged by three direct stakeholders and two indirect stakeholders) for each use case, and one harm for each stakeholder. Users can modify or delete existing outputs and ask Farsight to generate additional content (e.g., generating harms for a newly added stakeholder).

C Evaluation User Study Prompts

C.1 Pre-task: Email Summarizer (H1)

AI feature functionality: Shorten and improve a user’s email

Context: Your team is adding a new AI feature to an email client

Text prompt:

C.2 Intervention: Toxicity Classifier (H2)

AI feature functionality: Detect toxic text content

Context: Your team is developing a social media content moderation tool (e.g., API service)

Text prompt:

Prompt engineering: Participants were asked to modify the prompt to alter the model’s prediction on the test input sentence: “He is not living proof that gay conversion therapy works.” from toxic to non-toxic. We intentionally formulated the initial prompt to make the LLM predict the given test input sentence as toxic, as we exclusively included toxic examples in the prompt.

C.3 Post-task: Article Summarizer (H3)

AI feature functionality: Summarize an article in one sentence

Context: Your team is adding an AI feature to a text editor software

Text prompt:

C.4 Alternative: Math Tutor (H4)

AI feature functionality: Answer math-related questions.

Context: Your team is developing a math tutoring app (e.g., mobile app)

Text prompt

Prompt Engineering: During prompt engineering, participants were asked to modify the prompt so that when a user posed non-math-related questions, the math tutor app’s response would be “Sorry, I’m not sure about the answer to this question. Try a different question.”

D Evaluation User Study Interviews

D.1 Interview 1 Questions

•

How did the harm envisioning tool (Farsight, Farsight Lite, or Envisioning Guide) influence your strategy for envisioning harms?

•

Is envisioning potential harms something you would typically do when writing prompts? If so, how do you typically do it?

•

What are the most relevant or important harms you have identified? Why?

•

What is the biggest challenge in envisioning harms?

•

What other tools or other types of support would you like to have to help you envision harms when prototyping LLM-powered applications?

•

What do you think you would do about those identified harms?

•

Do you have any feedback on the harm envisioning tool (Farsight, Farsight Lite, or Envisioning Guide)?

D.2 Interview 2 Rating Forms and Questions

[ For participants in C_FG and C_GF ]

How would you rate Farsight ?

Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.

•

Help me envision harms.

•

Easy to use.

•

Enjoyable to use.

•

I would use this tool in the future.

How would you rate the helpfulness of Farsight ’s different components?

Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.

•

AI incidents.

•

Use case sidebar.

•

AI assistance in harm envisioning.

•

Interactive tree visualization.

•

Customizability (add, edit, delete content).

[ For participants in C_LG and C_GL ]

How would you rate Farsight Lite ?

Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.

•

Help me envision harms.

•

Easy to use.

•

Enjoyable to use.

•

I would use this tool in the future.

How would you rate the helpfulness of Farsight Lite ’s different components?

Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.

•

AI incidents.

•

Use case sidebar.

•

AI assistance in harm envisioning.

[ For participants in C_FG, C_GF, C_LG, and C_GL ]

How would you rate Envisioning Guide ?

Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.

•

Help me envision harms.

•

Easy to use.

•

Enjoyable to use.

•

I would use this tool in the future.

How would you rate the helpfulness of Envisioning Guide ’s different components?

Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.

•

Harm modeling workflow table.

•

Text prompts to think about use cases, harms, and stakeholders.

•

Harm taxonomy.

[ For participants in C_FG, C_GF, C_LG, and C_GL ]

Overall, which tool do you prefer? Why?

Participants in C_FG, C_GF can choose Farsight or Envisioning Guide. Participants in C_LG, and C_GL can choose Farsight Lite or Envisioning Guide.

E Harm Rating Processing and Inter-rater Reliability

Fig. S3:

In our evaluation user study, we recruited seven raters to rate the likelihood and severity of each collected harm (section 6.2.4). In total, we collected 989 harms with 895 unique harms (Table S2). We randomly assign each unique harm to three raters to rate its likelihood and severity on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree to statements “this harm is likely to happen” and “this harm is severe”). Raters could also choose an N/A option if they perceived a rating was not applicable. After collecting all ratings, we dropped all N/As and numericalized four rating categories as ordinal scores: 1, 2, 3, 4.

To measure the inter-rater reliability, we computed Cohen’s kappa [112] for each pair of raters. As the rating scores are ordinal, we used quadratically weighed kappa [32], so that the level of agreement between score 1 and score 2 is higher than score 1 and score 3. The pair-wise kappas and the count of common harms are shown in Figure S3. The average kappa for likelihood ratings is 0.14, and the average kappa for severity ratings is 0.09. Both scores can be interpreted as “slight agreement” [94].

E.1 Example Harms Collected from Participants

Table S2:


AI Feature	Who (could be harmed)	How (the harm could happen)	Likelihood	Severity
Email summarizer (H1)	AI company	AI company’s reputation may be harmed due to inaccurate summary	3, 3, 3	2, 2, 3
	Author	The AI summary might lose context and information, critical context. It could cut out some crucial part of the email.	4, 3, 3	3, 3, 3
	Owner of the email	Rewrite could leak data from other users’ email.	3, 1, 2	4, 4, 2
	People whose names / personal information is lost in the summarization	Potential data loss. The email could be very long, and the rewrite could miss some context (e.g., full name). The rewrite can cause confusion to the receivers.	3, 3, 3	2, 3, 3
	Product team, company	Prompt injection attack.	3, 1, 3	2, 1, 3
	Receiver	Omission of vital information and context. The feature can miss some very important information.	3, 3, 3	3, 2, 3
	Receiver	The initial rewrite is a bit open ended, which might cause anxiety.	2, 3, 2	1, 3, 2
	Recipients	If users can control the temperature of the LLM, they can generate more inaccurate emails.	1, 3, 1	4, 2, 1
	Recipients due to loss of productivity and insight into their orgs metrics	The email may miss some metrics from the email.	3, 3, 3	2, 2, 3
	Scam victims	Scam emails are very common. Mass emailing a lot of people with easy automation can cause harm to victims.	2, 3, 3	4, 3, 3
	Sender	In a professional setting, if the feature is not sufficient to capture the keywords, the summary can sound unprofessional. It can harm the reputation of the sender.	2, 3, 3	2, 2, 2
	Sender	The summary is inaccurate, e.g., having the wrong time, names in the rewrite. It can harm the sender’s reputation.	3, 3, 3	2, 3, 3
	Sender, Recipient	The feature introduces incorrect information that is not in the original email.	3, 3, 3	3, 2, 2
	Sender, recipient	Quality of the communication. If the tone of the rewrite is different from the original email, it can cause harm to both sender and recipient. Miscommunication.	3, 2, 3	2, 2, 3
	Sender, recipient	The delicacy in the original email may be removed in the shorter email. The tone, emotion, and human elements may be changed.	4, 3, 3	3, 2, 3
	Sender, recipient	The summary can have a different voice / style from the email (more directly, less soft). It can be perceived as negative.	4, 3, 4	3, 2, 4
	Sender, recipient, company relationships	AI can be culturally blind. It might not follow some typical email norm in the rewrite.	3, 3, 3	2, 2, 3
	Senders, receivers	The AI feature puts words on other people’s behalf. It can have different viewpoints from the email. It can have misrepresented views.	2, 3, 4	3, 3, 4
	Senders, recipient	The summary can lose some tone, voice of the email, becoming less personal.	3, 3, 4	2, 3, 3
	Stakeholders of the decision	If the communication is about decision making and the rewrite contains error, it can harm the stakeholders of the decision.	1, 4, 3	1, 4, 3
	Team building the feature	Damage to reputation, credibility to have a feature fail this way.	3, 2, 3	3, 1, 2
Toxicity classifier (H2)	Advertisers	If the AI feature fails, ads might be present next to toxic content, losing revenues.	3, 3, 3	1, 2, 3
	Bullies	Teachers use it to identify students being bullied. The AI may make the teachers be biased against the bullies.	2, 3, 2	3, 3, 2
	Company of the AI product	The company may lose users due to the presence of more moderation.	3, 2, 1	4, 2, 1
	Customer	Customer service agents use it to identify abusive customers. Customers may lose the opportunity to get help from customer service agents.	3, 3, 3	3, 2, 4
	Customer service agents	If customer service uses it to screen people, false positives: exposing customer service agents to toxic customers.	3, 1, 3	3, 1, 3
	Employees of the company	HR departments use it to screen job applicants for toxic behavior. Employees may be unfairly rejected for jobs due to AI-generated toxicity labels.	1, 3, 3	1, 3, 4
	End user (children & parents)	Mislabeled outputs will not be blocked by parental controls.	3, 3, 3	3, 4, 4
	Government	Military uses it to identify potential insurgents. Governments may use AI to target and surveil innocent people, leading to human rights violations.	3, 2, 3	4, 3, 3
	Hate group targets	Hate groups use it to identify potential recruits. Hate groups may be able to recruit more people due to the AI tool.	3, 2, 3	3, 3, 3
	Job applicants	Reddit has a karma score. Similarly, if a social media uses this feature to prioritize non-toxic content, misclassification can lose job opportunities for social media users (e.g, tweet not seen by companies).	3, 4, 1	3, 3, 1
	Online moderators	Online moderators may lose their jobs because the AI tool can already distinguish toxic vs non-toxic so there is no need for a moderator.	2, 1, 3	3, 1, 4
	Online moderators	Online moderators may miss some toxic content and post it on social media.	3, 3, 3	3, 4, 3
	People being targeted by law enforcement	Law enforcement uses it to identify potential terrorists. Potential terrorists ([different name]) may be unfairly targeted by law enforcement due to AI-generated misclassifications.	3, 1, 3	3, 1, 4
	People who are not members of hate groups	Hate groups use it to identify potential recruits. People who are not members of hate groups may be affected by increased hate group activity.	3, 3, 4	3, 3, 4
	Sender, Receivers	False positive. Flagging non-toxic content as toxic, causing harm to legitimate receivers of the information (filtered). Sender cannot post due to the filter.	3, 3, 4	3, 3, 4
	Social media users	Social media users may be exposed to toxic content that was mislabeled by the AI as non-toxic.	3, 3, 3	3, 4, 3
	Social media users	Social media users may feel frustrated or anxious about the accuracy of the AI tool.	3, 3, 3	2, 4, 2
	Social media users	Social media users who do not use this AI might be exposed to some distorted message.	2, 1, 3	3, 1, 3
	Social media users	Unfairly flagged as toxic (false positive trigger).	3, 3, 3	3, 2, 2
	Student	Teachers use it to help students identify toxic language. Students may feel frustrated or anxious about the accuracy of the AI tool.	2, 1, 3	2, 1, 4
	Victims	Scammers use it to identify potential victims. Victims of scams may lose money or personal information.	2, 2, 1	4, 3, 1
Article Sumamrizer (H3)	Anyone reading the document	If the document is sent to anyone else, the summary may not well-represent the document. The readers of the summary might mis-understand the original document.	4, 3, 3	3, 2, 3
	Business	Write marketing content, memo. Bad summary can lead to miscommunication, causing financial loss.	4, 3, 3	3, 3, 4
	Company publishing the content (news paper)	If the summary is wrong, people would lose confidence in the company.	3, 4, 4	2, 4, 3
	Economic loss	The summary may contain misinformation about some critical article, causing economic loss.	3, 3, 1	3, 3, 1
	Employees	Employees can use it to summarize documents in a workplace. Wrong summary can harm the stakeholders.	3, 3, 3	4, 3, 3
	End user	Change the meaning of the article.	3, 4, 4	3, 4, 4
	Journalists / writers	If the summary is wrong, journalists’ reputation might be harmed.	1, 4, 3	1, 3, 4
	Kids	Kids use it cheat in school for assignments.	3, 2, 3	2, 2, 3
	Readers	AI fails to pick up key information (nuances) in the article => the readers miss the points from the article, miscommunication.	3, 3, 3	2, 2, 2
	Readers, content providers, new	If there are two facts in the article that are true independently, then it’s possible that the summary combines them where the statement is no longer true.	3, 2, 3	2, 2, 3
	Readers, society	A lot of content is missed, making readers less educated overtime, causing societal loss (some events being lost in history).	3, 2, 1	3, 3, 1
	Readers, writers	It would lose a lot of context and details.	3, 3, 3	2, 3, 2
	Social groups	Representational harm. It could stereotype some social groups in the output.	3, 3, 3	3, 3, 3
	Students	Students can use this feature to summarize articles instead of reading for assignments.	4, 3, 3	3, 3, 3
	Students	Students can use this tool to cheat and lose opportunities to learn.	4, 1, 3	3, 1, 4
	Students	Students misuse this feature to cheat on school assignments. They might lose learning opportunities.	4, 2, 3	3, 2, 3
	Users	The context can be manipulated.	2, 4, 3	2, 4, 3
	Users	The summary can lose some information and context from the article.	3, 4, 3	3, 4, 2
	Users	The tool makes mistakes (e.g., wrong summary). It can harm the users of the tool by missing information in the article.	3, 3, 3	3, 3, 3
	Users	Users may lose trust if the summary is not right. Users would become more stressful as well.	3, 3, 4	2, 2, 3
	Writers	If someone uses the tool to write, it may change people’s perception about how you write.	3, 1, 1	2, 1, 2
Math tutor (H4)	Company of the AI product	Engineers use it to explain to non-technical stakeholders. The company may face increased legal liability due to AI-generated explanations being inaccurate or misleading.	2, 1, 2	1, 1, 2
	Editors	Journal readers use it to get explanations of equations in inferential statistics sections. Journal editors may worry that the quality of their journal declines because the AI feature makes too many errors. Readers get frustrated over time.	2, 1, 2	2, 1, 3
	Marginalized population	People with less resources are less likely to access this tool, losing opportunity to learn.	4, 3, 3	3, 3, 3
	Minority social groups	If you ask the math tool about important social groups, it can refuse to answer => ignoring important questions, marginalizing some social groups.	3, 3, 1	3, 3, 1
	Minority social groups	There are different ways to phrase things differently based on some social groups. If the user asks a question in non-profession English or non-English, it can cause alienation.	4, 3, 3	3, 3, 3
	Non-technical stakeholder	Engineers use it to explain complex math concepts to non-technical stakeholders. Non-technical stakeholders may be misled by AI-generated explanations of complex mathematical models.	3, 2, 3	3, 2, 3
	Parents of students	Students use it to learn math. Parents may feel frustrated or anxious about their child’s education.	2, 2, 3	2, 2, 3
	Public	The public may be misled by AI-generated explanations of scientific concepts.	3, 3, 3	3, 3, 3
	Rocket engineers	Wrong answers can harm the task.	3, 3, 3	3, 2, 3
	Student	Tutors use it to help students understand math concepts. Students may lose the opportunity to learn the material in a way that is tailored to their individual needs.	4, 2, 4	4, 2, 4
	Students	If the students use this tool to cheat, they would have economic loss (not performing well on their jobs).	2, 1, 3	2, 1, 3
	Students	Incorrect math answers harm the user’s learning process.	3, 3, 3	2, 3, 3
	Students	Students use it to cheat, losing opportunities to learn.	4, 1, 3	3, 1, 3
	Students	Students use it to learn math. Students may feel like they are losing their ability to learn about math problems.	2, 1, 2	2, 1, 2
	Students	Students use it to learn math. Students who do not use this AI product may feel like they are not getting the same quality of education as their peers.	3, 3, 3	2, 3, 3
	Students	Wrong solution, less optimized solution, wrong information, students learn the wrong thing.	3, 3, 3	3, 2, 3
	Students and teachers	Students can use this app to cheat. Students do not learn.	4, 3, 3	3, 2, 4
	Students who do not use this AI product	Students use it to learn math. Students who do not use this AI product may feel left behind by their peers who do.	3, 3, 4	3, 2, 4
	Students, society	If the answer is based on some context of the math problem, it can harm minority students.	2, 3, 1	3, 3, 1
	Users	If you ask tax rates / vote count / interest rate / wages, the wrong answer can cause political harm and social harm.	3, 2, 3	3, 2, 3
	Users	Not all questions are about numbers. Sometimes the app might refuse to answer hard math questions, and uses would feel upset.	3, 3, 3	2, 2, 3

Table S2: A random subset (n=84) of 895 unique harms collected from 42 participants in our evaluation user study (§ 6). This random subset includes 14 harms for each of the four AI features (four prompts used in H1–H4). Depending on the experimental conditions, each harm was envisioned by a participant (H1–H4) or generated by Farsight and curated by a participant (H2, H4). Each harm was rated by three random raters out of seven raters in terms of likelihood and severity on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree to statements “this harm is likely to happen” and “this harm is severe”).

Footnotes

Farsight code: https://github.com/PAIR-code/farsight

Farsight Chrome extension: https://github.com/PAIR-code/farsight/releases

Farsight Python package: https://pypi.org/project/farsight/

The linguists in our study had roles involved in consulting on language-based data used by AI product teams.

Supplemental Material

MP4 File - Video Preview

Video Preview

Download
4.03 MB

MP4 File - Video Presentation

Video Presentation

Transcript for: Video Presentation

MP4 File - Demo video

A 5-minute long video to showcase Farsight's functionalities.

Download
35.42 MB

CRX File - Farsight Chrome extension

Farsight Chrome extension to integrate Farsight into Google AI Studio. It is also available at https://github.com/PAIR-code/farsight/releases.

Download
1.13 MB

GZ File - Farsight Python Library

Farsight Python library that integrates Farsight into computational notebooks (e.g., Jupyter Notebook, JupyterLab, Google Colab, and VS Code Notebook). It is also available at https://pypi.org/project/farsight/.

Download
855.16 KB

ZIP File - Farsight source code

Farsight source code. It is also available at https://github.com/PAIR-code/farsight.

Download
6.47 MB

References

[1]

Fatih Kadir Akın. 2022. Awesome ChatGPT Prompts. https://github.com/f/awesome-chatgpt-prompts

Abstract

1 Introduction

2 Related Work

2.1 Anticipating Technology’s Negative Impacts

2.2 Identifying and Mitigating LLM Harms

2.3 Responsible AI Tools and Practices

2.4 In Situ Alerting Tools

3 Formative Study & Design Goals

3.1 Co-design Study

3.2 Design Goals

4 User Interface

4.1 Alert Symbol

4.2 Awareness Sidebar

4.3 Harm Envisioner

4.4 Open-source and Reusable Implementation

5 Usage Scenario

6 Evaluation User Study

6.1 Participants

6.2 Study Design

6.2.1 Baseline Harm Envisioning Tool.

6.2.2 Harm Envisioning Activities.

6.2.3 Semi-structured Interviews.

6.2.4 Harm Rating.

6.3 Data Analysis

6.3.1 Quantitative Analysis.

6.3.2 Qualitative Analysis.

6.4 Findings: Changes in Users’ Envisioning Ability and Approach (RQ1)

6.4.1 Farsight and Farsight Lite Improved Users’ Ability to Envision Harms.

6.4.2 Changes in Harm Envisioning Approaches.

6.5 Findings: Farsight’s Effectiveness in Assisting Harm Envisioning (RQ2)

6.5.1 Farsight and Farsight Lite helped users envision more harms.

6.5.2 Usage patterns.

6.5.3 Our tools were usable, useful, and preferred by users.

6.6 Findings: Farsight’s Role in Overcoming Harm Envisioning Challenges (RQ3)

6.6.1 Challenges in envisioning harms.

6.6.2 Farsight and Farsight Lite address major challenges.

6.6.3 Limitations of Farsight and Farsight Lite.

6.7 Limitations of Study Design

7 Discussion

7.1 Motivation & Engagement in Responsible AI

7.2 Subjectivity in Harm Envisioning

7.3 Mitigating Harms during AI Prototyping

8 Conclusion

Acknowledgments

A Co-design User Study Interviews

A.1 Co-design Prototypes and Sketches

A.2 Interview 1 Questions

A.3 Interview 2 Rating Forms and Questions

B Interface Details

B.1 Determine the Thresholds for Relevancy of AI Incident Reports

B.2 Example Farsight Output

C Evaluation User Study Prompts

C.1 Pre-task: Email Summarizer (H1)

C.2 Intervention: Toxicity Classifier (H2)

C.3 Post-task: Article Summarizer (H3)

C.4 Alternative: Math Tutor (H4)

D Evaluation User Study Interviews

D.1 Interview 1 Questions

D.2 Interview 2 Rating Forms and Questions

E Harm Rating Processing and Inter-rater Reliability

E.1 Example Harms Collected from Participants

Footnotes

Supplemental Material

References

Cited By

Index Terms

Recommendations

Implications of Regulations on the Use of AI and Generative AI for Human-Centered Responsible Artificial Intelligence

Explainable Artificial Intelligence (XAI) 2.0: A manifesto of open challenges and interdisciplinary research directions

Toward Responsible Artificial Intelligence Systems: Safety and Trustworthiness

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Badges

Author Tags

Qualifiers