[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3613904.3642335acmconferencesArticle/Chapter ViewFull TextPublication PageschiConference Proceedingsconference-collections
research-article
Open access

Farsight: Fostering Responsible AI Awareness During AI Application Prototyping

Published: 11 May 2024 Publication History

Abstract

Prompt-based interfaces for Large Language Models (LLMs) have made prototyping and building AI-powered applications easier than ever before. However, identifying potential harms that may arise from AI applications remains a challenge, particularly during prompt-based prototyping. To address this, we present Farsight, a novel in situ interactive tool that helps people identify potential harms from the AI applications they are prototyping. Based on a user’s prompt, Farsight highlights news articles about relevant AI incidents and allows users to explore and edit LLM-generated use cases, stakeholders, and harms. We report design insights from a co-design study with 10 AI prototypers and findings from a user study with 42 AI prototypers. After using Farsight, AI prototypers in our user study are better able to independently identify potential harms associated with a prompt and find our tool more useful and usable than existing resources. Their qualitative feedback also highlights that Farsight encourages them to focus on end-users and think beyond immediate harms. We discuss these findings and reflect on their implications for designing AI prototyping experiences that meaningfully engage with AI harms. Farsight is publicly accessible at: https://pair-code.github.io/farsight.
Fig. 1:
Fig. 1: With in situ interfaces and novel techniques, Farsight empowers AI prototypers to envision potential harms that may arise from their large language models (LLMs)-powered AI applications during early prototyping. (A) In this example, an AI prototyper is creating a prompt for an English-to-French translator in a web-based AI prototyping tool. (B) The Alert Symbol from Farsight warns the user of potential risks associated with their AI application. (C) Clicking the symbol expands the Awareness Sidebar, highlighting news articles relevant to the user’s prompt (top), and LLM-generated potential use cases and harms (bottom). (D) Clicking the blue button opens the Harm Envisioner that allows the user to interactively envision, assess, and reflect on the potential use cases, stakeholders, and harms of their AI application with the assistance of an LLM.

1 Introduction

Fig. 2:
Fig. 2: (A) Many AI prototypers from diverse backgrounds and roles use (B) prompting tools to prototype AI applications. Farsight provides a range of in situ widgets for these tools, helping AI prototypers envision the potential harms of their AI applications during an early prototyping stage.
Fig. 3:
Fig. 3: Farsight fits into AI prototypers’ diverse prompting workflows including prompting GUIs and computational notebooks. For example, (A) when an AI prototyper writes prompts for a therapy chatbot in Google AI Studio [70], Farsight’s Chrome extension alerts the user about related accidents and potential harms. (B) When an AI prototyper writes prompts for a toxicity classifier in Jupyter Notebook [91, 185], Farsight’s Python library shows potential negative consequences of this classifier.
As artificial intelligence (AI) becomes increasingly integrated into our everyday lives, mitigating the societal harms posed by AI technologies has never been more important. In response to the demand for accountable and safe AI, there have been growing efforts from both industry and academia towards responsible design and development of AI [143, 183]. The majority of these endeavors focus on machine learning (ML) experts, such as ML developers and other AI practitioners. For example, researchers have introduced techniques that help ML developers interpret ML models [102, 128, 150] and assess model fairness [30, 90, 189]. Additionally, researchers have also proposed frameworks that target ML developers’ workflows, such as improving data collection and annotation practices [14, 118, 123], documenting training data and models [43, 63, 122], and anticipating an ML product’s potentials for harms [46, 120].
However, more recently, we have witnessed a rapid advancement of large language models (LLMs) such as Gemini [178] and GPT-4 [132], alongside the emergence of prompt-based interfaces like Google AI Studio [70], GPT Playground [133], AI Chains [204], and Wordflow [184] (Figure 2B). These general-purpose models and easy-to-use interfaces have significantly increased access to the process of prototyping and building diverse AI-powered applications—leading to a paradigm shift in AI development workflows that poses unique challenges to responsible AI, including introducing new potential harms to avoid [190], as well as challenges applying existing responsible AI practices [98].
Many people who use prompts to create AI applications now encompass a broader spectrum of roles beyond traditional ML experts (Figure 2A), such as designers, writers, lawyers, and everyday users [55, 84, 193, 207], whereas existing responsible AI research often targets ML experts such as ML engineers and data scientists [78, 198]. Many users of AI prompt-based prototyping interfaces [e.g., 70, 204, 133, 184], or “AI prototypers” [cf. 84] do not have experience in AI or computer science, which can lead to challenges in anticipating the consequences of their AI applications [143]—a difficult task even for computer science faculty and AI researchers [20, 45]. Furthermore, LLMs demonstrate a wide range of capabilities that are continually being discovered across various contexts, including tasks such as summarization, classification, and translation [18, 174]. This characteristic of LLMs gives rise to complex and uncertain impacts of LLM-powered applications [61], presenting a significant departure from the classical ML models targeted by existing responsible AI endeavors [98, 190] and introducing a new layer of complexity for responsible AI researchers to help AI developers anticipate downstream consequences.
To help address these challenges in applying responsible AI practices to LLM-powered AI applications, we present Farsight (Figure 1, Figure 2B), an interactive tool to help AI prototypers identify potential harms of their LLM-powered applications—a key early step in harm prevention and mitigation [104, 120, 131, 176, 177]—during the prototyping stage. Using Farsight as a probe, we conduct multiple mixed-method user studies to investigate (1) how an early-stage intervention changes AI prototypers’ awareness of and approach to identifying harms, (2) the effectiveness of our tool in helping people envision harms, and (3) the challenges AI prototypers face during this harm envisioning process. We contribute:
Farsight, the first in situ interactive harm envisioning tool that empowers AI prototypers to identify potential harms that may arise from their prompt-based AI applications, directly within their prompting environments (Figure 1, Figure 2). Inspired by prior harm envisioning frameworks [24, 46, 120] and in situ security alert tools [109, 125, 147], Farsight overcomes unique design challenges identified from a literature review (section 2) and a co-design user study with 10 AI prototypers (section 3).
Novel techniques and interactive system designs to foster responsible AI awareness among AI prototypers. Given a user’s prompt, Farsight leverages embedding similarities to surface news articles about relevant AI incidents from the AI Incident Database [111] and uses LLMs to generate potential use cases, impacted stakeholders, and harms for AI prototypers to review, edit, and add to. Applying a progressive disclosure design [129], our tool fits into users’ diverse prompting workflows. With a novel adaptation of node-link diagrams [146], Farsight enables users to interactively visualize, generate, and edit use cases, stakeholders, and harms (section 4).
Empirical findings about harm envisioning processes from a co-design study and an evaluation study. During our design of Farsight, we conducted a co-design study with 10 AI prototypers to evaluate our design ideas and generate new ideas (section 3). After developing Farsight, we conducted an evaluation user study with 42 AI prototypers to examine the effectiveness of Farsight in aiding users to brainstorm harms and improving their ability to independently identify harms. Our mixed-method analysis highlights that, after using Farsight, AI prototypers are better able to independently identify potential harms that might arise from an application developed with a given prompt, and participants report that our tool is more useful and usable than existing resources. In particular, Farsight encourages users to shift their focus from the AI model to the end-users, providing them with a broader perspective to consider indirect stakeholders and cascading harms (section 6).
An open-source, web-based implementation that lowers the barrier to applying responsible AI practices. We develop Farsight with cutting-edge web technologies, such as Web Components [115] and WebGL [114], so that it can be easily integrated into any web-based prompt development environments, such as Google AI Studio and Jupyter Notebook (Figure 3). We open source1 Farsight as a collection of reusable interactive components that future researchers and designers can easily adopt (section 4.4). To see a demo video of Farsight, visit https://youtu.be/BlSFbGkOlHk.

2 Related Work

2.1 Anticipating Technology’s Negative Impacts

Various design methods and approaches have been developed to support ideation about potential downstream impacts of technology, including anticipatory tech ethics [22, 126], speculative design [10, 49, 197], and value-sensitive design [57, 59, 60] among others. To support designers with this, prior research has developed design toolkits [e.g., 29] and resources, such as Envisioning Cards [58], Value Cards [165], Timelines [199], and the Black Mirror Writers’ Room [89], among others [e.g., 11, 46]. Such resources are intended to be used by designers of technology early in the design process, but they may not fit neatly into existing product design and development processes, particularly for AI-powered application design paradigms, where large pre-trained models are used for many downstream tasks [183].
In addition to technology designers, computing researchers have called for the computer science field to consider the negative impacts of their work in addition to the positive impacts [76]. In AI research, conferences such as NeurIPS have begun requiring that researchers articulate potential negative broader impacts of their work in statements at the ends of their papers [140] to avoid the “failures of imagination” [20] that may lead to downstream harms. Prior work analyzed these broader impacts statements, finding convergence around a set of topics such as risks to privacy and bias, but often lacking concrete specifics or strategies for mitigation [8, 99, 127, 167]. However, prior work suggests that many CS researchers may not have the training, resources, or inclination to engage in this type of anticipatory work [45, 175], suggesting that new tools, training, and processes, are needed to support researchers and developers in engaging in anticipatory work in ways that are integrated into their research practices. More recently, researchers have proposed a framework that uses LLMs to anticipate harms for classifiers by generating stakeholders and vignettes for a given scenario [24], evaluating this framework through interviews with responsible AI researchers. Farsight builds upon this framework and extends it to (1) target an early prototyping stage through in situ and interactive interfaces that promote user engagement in the harm envisioning process, (2) support LLM-powered applications with diverse tasks beyond classification, and (3) evaluates its effectiveness through a user study with 42 AI prototypers.

2.2 Identifying and Mitigating LLM Harms

More recently, there has been a growing body of research that specifically focuses on identifying and mitigating the harms of LLMs. Researchers have introduced harm taxonomies specifically for LLMs, which identify known risks (i.e., informed by observed instances of harm) [18, 100, 190] and emerging risks of LLMs (anticipated risks based on foreseeable capabilities of LLMs) [108, 166]. Since LLMs can be used for a wide range of tasks associated with many different categories of harms, researchers have presented frameworks and evaluation methods to assess a particular type of LLM harm, including misinformation [74, 135], representation and toxicity [42, 64], human autonomy [65, 168], malicious use [38, 154], and data privacy [87, 97]. The popular methods to identify these harms include benchmarking [27, 28], user research [101, 106], and adversarial testing [41, 137]. Based on existing benchmarks and harm taxonomies of LLM risks, Weidinger et al. [191] introduce a sociotechnical evaluation framework that identifies three AI actors with LLM safety responsibilities: AI model developers, AI application developers, and third-party stakeholders.
The mitigation strategies for these harms depend on the use cases and context. Popular strategies include algorithmic and sociotechnical approaches [192], such as improving the training data to mitigate social stereotypes and biases [173]; fine-tuning LLM models on curated datasets [64]; filtering LLM outputs [194, 205]; employing special decoding techniques [93, 158], adding instructions in prompts [9], monitoring the use of LLMs [192]; as well as inclusive product design and development from the beginning [34, 36, 75, 83]. Building on this prior work, Farsight introduces a novel framework that leverages human-AI collaboration to help AI prototypers identify the potential harms of LLMs. Specifically targeting AI prototypers as one subset of AI application developers [183, 191], Farsight introduces novel techniques and in situ interfaces to foster responsible AI awareness during AI prototyping, although the current version of Farsight does not assist AI prototypers in mitigating potential LLM harms.

2.3 Responsible AI Tools and Practices

Despite the increasing emphasis on responsible AI from the technology industry [4, 7, 134, 201], academia [44, 96], and policymakers [81, 105, 177], incorporating responsible AI practices into AI product development remains a challenge [e.g., 183, 181, 2]— in part due to practitioners’ insufficient knowledge of responsible AI [e.g., 159, 141, 62], lack of engagement with direct stakeholders or domain experts [80, 103], and organizational culture and structure [104, 143].
To address these challenges and facilitate the adoption of responsible AI practices, researchers have proposed several approaches. These include incorporating ethics into AI education [56, 165, 172], providing engaging playbooks or design activities [11, 79, 206], and fostering ethical norms in AI research and development [99, 142, 171]. In addition, researchers have also proposed a wide range of tools to operationalize responsible AI practices [95, 198]. These tools encompass libraries and frameworks that cover various dimensions of responsible AI, including fairness [13, 155, 189], explainability [128, 150], testing and error analysis [119, 151, 182], and model development documentation [63, 122, 142].
Moreover, alongside these advancements, there has been a rise in the research and development of easy-to-use interactive visualization tools to further facilitate the operationalization of responsible AI. For example, tools like What-If Tool [195], FairVis [25], and Visual Auditor [124] enable ML developers to visually assess the fairness of ML models across a diverse range of inputs. Visual analytics systems such as Summit [77], LIT [179], and GAM Changer [187] empower ML developers to interpret their models and fix problematic behaviors. Interactive visual testing tools like Errudite [203], Angler [153], and AdaTest [149] help ML developers surface weaknesses in their models.
Inspired by these tools, Farsight joins the body of research of interactive visualization tools for responsible AI by visualizing use cases, stakeholders, and harms (section 4.3). In contrast to existing tools that target traditional ML models after they have been trained, Farsight focuses on diverse LLM-powered applications in an early prototyping stage. During this stage, AI prototypers have greater flexibility to iterate on the design and objectives of their applications and implement early mitigation strategies such as engaging with stakeholders and improving data collection [3].

2.4 In Situ Alerting Tools

Although in situ responsible AI tools are relatively nascent, there is a large body of research in designing in-context warning tools and interfaces. For example, security and HCI researchers study how to best present warnings to raise people’s online security awareness [e.g., 147, 109, 125] and protect people from malware and phishing attacks [e.g., 51, 145, 53]. The key challenges when designing effective warning interfaces include the presentation of comprehensible messages and supporting evidence [15, 54], engaging users [50, 202], and preventing alert fatigue and habituation [5, 17]. To address these challenges, researchers recommend designing simple interfaces [66, 67], considering the trade-off between blocking and non-blocking warnings [50], varying interfaces [5], and requiring user input [23].
Using in-context warnings to improve users’ safety awareness and encourage users to take protection measures can be considered a form of “digital nudging” [26, 160]. More recently, researchers have also adapted in-context security warnings to nudge social media users to recognize and avoid online disinformation [85, 163] and reflect before posting potentially harmful content [88, 169, 200]. Beyond platform-initiated integration of warnings, end-users also voluntarily seek in-context alert interfaces for productivity improvement. For example, writers use grammar checker tools like Grammarly [73], which offer in-context warnings and scores to improve their writing. Similarly, software developers use accessibility developer tools [40, 69] to detect potential accessibility issues during the development process. However, there has been little work in designing and evaluating in situ warnings for developing AI applications, particularly for responsible AI. Farsight’s design draws inspiration from many existing warning interfaces (section 3). Our work advances the landscape of in situ alerting research by addressing responsible AI for modern AI application development.

3 Formative Study & Design Goals

To identify the needs and potential challenges faced by users in envisioning harms, we conducted a formative co-design study to investigate (1) how AI prototypers envision harms (if they do), (2) what design ideas are most helpful for them, and (3) how to motivate users to think about potential risks when prototyping an AI application. In this section, we report our findings from the formative co-design study, and in section 6, we report on our findings from a subsequent evaluation user study.

3.1 Co-design Study

Participants. To inform our tool’s design, we conducted a co-design user study with 10 AI prototypers based in the United States. These participants were recruited from Google through internal mailing lists. Our recruitment criteria required participants to have experience using an internal prompt-crafting tool, PromptMaker [84], which is similar to Google AI Studio [70] and GPT Playground [133]. Each session was 60 minutes, and each participant received an average of $50 USD in their choice of a gift card or a donation to their preferred charity. Among the 10 participants (U1–U10), 6 identified as men, 3 identified as women, and 1 identified as non-binary. Four participants self-reported having expertise in responsible AI. Information about participants’ job roles is listed in Table 1. All participants are our targeted users (AI prototypers).
Table 1:
Participant RolesParticipant IDs
1*, 4, 5, 6, 7, 10
Research Scientist3*, 8*
Technical Writer2
Program Manager9*
Table 1: The co-design user study includes 10 participants with diverse roles. All participants have experience in prompting LLMs. Four participants who self-reported having expertise in responsible AI are marked with asterisks (*).
Procedure. We structured our study as a “during-design co-design study” [156]. Participants were asked to bring a recent prompt that they had written to the study. The study started with a semi-structured interview regarding participants’ prompting workflows and their experience in thinking about potential harms linked to their applications (section A.2). Then participants were asked to use our very early-stage design prototypes (section A.2) to envision potential harms associated with their application while thinking aloud. Participants were also presented with low-fidelity sketches for our other design ideas. These prototypes and sketches can be found in Figure S1. Finally, we asked participants to rate and provide feedback on all of our design ideas and generate their own design suggestions (section A.3).
Fig. 4:
Average ratings on our design ideas, sorted by add/edit use cases, export and share harms, AI-generated use cases, tree visualization, add/edit stakeholders, AI-generated stakeholders, add/edit harms, AI-generated harms, harm summary, use case sidebar, related AI incidents, risk symbol, environment impact, slippy-like assistant, latest AI incidents.
Fig. 4: Average ratings on our design ideas from 10 AI prototypers. Features marked with S1).
Design feedback. Interestingly, although perhaps not surprisingly [cf. 78], none of the 6 participants without expertise in responsible AI reported that they typically considered the potential harms of their AI prototypes when writing prompts, while 3 of the 4 participants with expertise in responsible AI did report typically anticipating harms during the prototyping process. Participants’ ratings were shown in Figure 4. Overall, participants favored using AI to generate use cases of their AI prototypes, potential stakeholders, and potential harms. Many participants also highlighted the importance of being able to edit AI-generated content and control the generation direction (U4, U8). On the other hand, participants were less in favor of more distracting design ideas (e.g., an anthropomorphized assistant tool similar to Microsoft Office’s Clippy) or irrelevant content (e.g., the latest, rather than the most relevant AI incidents). Participants also provided us with helpful usability feedback that we integrated into our final design of Farsight (section 4).
New design ideas. Participants generated many interesting design ideas to help raise responsible awareness among AI prototypers. For example, participants recommended categorizing AI-generated harms (U1, U5), allowing users to rate the severity of harms (U6), and using users’ input to steer AI generation (U10). We integrated these design ideas into the final design of Farsight (section 4). Some other interesting design ideas include designing a game-like reward system to incentivize users to anticipate harms (U5), building online communities to allow users to share their envisioned harms using Farsight and seek support (U2), allowing real-time collaborative harm envisioning similar to Google Slides (U1, U4), and automatically revising a user’s prompt to address identified harms (U4). We discuss the implications of these design ideas in user motivation (section 7.1) and mitigation strategies (section 7.3).

3.2 Design Goals

Based on our literature review (section 2) and findings and early feedback from the co-design user study, we identify the following five design goals (G1G5) for Farsight.
G1.
Guide users in imagining use cases. Existing research highlights the challenges faced by ML practitioners when attempting to anticipate the uses of their ML-powered applications and how different individuals or groups may be affected [20, 45, 103, 171]. Confirming this, software engineer U6 noted “You don’t really know how your tool could be used, so it’s really hard to envision what harms would be.” The availability of LLMs and prompt-crafting tools has broadened the spectrum of AI prototypers to include people without prior technology development experience [55, 84], which can further magnify the challenges associated with envisioning diverse use cases for AI applications. Therefore, we design Farsight to help AI prototypers with diverse backgrounds to brainstorm a wide array of use cases for their AI applications.
G2.
Help users understand, organize, and prioritize harms. Depending on an AI application’s goal, implementation, and context, some harms are more salient than others [24, 121]. To help AI prototypers assess harms, Farsight should first help them understand where and how harms might occur and who might be impacted, by connecting harms to use cases and stakeholders [12, 103, 120]. Participants expressed a desire for the ability to categorize (U1, U5) and rate the severity (U6) of harms. To meet these needs, we aim to design an easy-to-use interface that empowers users to navigate, comprehend, and label harms within diverse potential harm scenarios.
G3.
Fit into current workflows and mitigate habituation. In our co-design study, none of the 6 participants without expertise in responsible AI had previously thought about harms when writing prompts. We also found some participants were not incentivized to anticipate harm on their own; for example, U6 explained “To be honest, as a software engineer, I don’t use policy tools [compliance tools like checklists] unless I have to.” Thus, to make Farsight easy to adopt, we aim to take inspiration from in situ warning tools [e.g., 51, 145, 53] to design it in a way that fits into AI prototypers’ existing workflows instead of introducing new workflows. In addition, we aim to apply strategies like varying content [5] and promoting user input [23] to mitigate habituation—a common pitfall of in-context warning designs [5, 17].
G4.
Promote user engagement and provide compelling examples. Prior research highlights that the effectiveness of warning tools depends on their clarity and persuasiveness [15, 54]. As we are targeting AI prototypers with diverse experience in AI and responsible AI, Farsight should be easy to use and understand. When asked what would help them envision potential harms for their AI applications, many participants mentioned referring to prior examples of AI harms (U1, U2, U8). For instance, U2 said “Giving some specific real [harm] examples for different types of seemingly innocuous applications would help alert people [to consider harms].” Therefore, we aimed to integrate real examples in Farsight to motivate and help users understand the potential risk of their applications. Participants like being able to control the harm envisioning process (Figure 4), and active participation is a key factor in learning [92]—essential to foster AI prototypers’ ability to independently identify harms. Thus, Farsight is designed to provide users with human agency and encourage users to actively and critically think about harms.
G5.
Open-source and adaptable implementation. Given the ever-expanding array of LLMs and prompt-crafting tools [31], our approach in designing Farsight is to ensure it remains adaptable to this dynamically evolving landscape. We aimed to design Farsight to be model-agnostic and environment-agnostic, thereby making it accessible to users of different LLM models (e.g., Gemini [178], GPT-4 [132], Llama 2 [180]) and prompt-crafting interfaces (e.g., GPT Playground [133], Google AI Studio [70], Wordflow [184]). Furthermore, we open source our implementation to foster future advancements in the design, research, and development of responsible AI tools.

4 User Interface

Following the five design goals (G1G5), we present Farsight, the first in situ interactive tool that aims to foster responsible AI awareness among AI prototypers during the AI prototyping process. Farsight is designed to be a plugin of any web-based prompt-crafting tools. Farsight’s interface employs progressive disclosure [129], enabling users to smoothly transition across three main components, with each phase increasing the level of user engagement. The Alert Symbol (section 4.1) presents an always-on symbol that shows the approximated alert level of a user’s current prompt; the Awareness Sidebar (section 4.2) highlights news articles about related AI incidents and LLM-generated use cases and harms; and the Harm Envisioner (section 4.3) visualizes LLM-generated harms and allows users to edit, add, and share harms. Examples in this section use PaLM 2 model through its APIs; we chose this model because it provided free API access to the public during our design process. Researchers and designers can easily replace PaLM 2 model with other LLMs by changing the API endpoints in Farsight.

4.1 Alert Symbol

Fig. 5:
Fig. 5: Three alert modes of the Alert Symbol.
The Alert Symbol is an always-on display on top of the AI prototyping tool, displaying the alert level of a user’s prompt (Figure 5). Every time the user runs their prompt, the Alert Symbol updates the alert level using the new prompt. Based on the computed alert level, there are three modes (Figure 5), each characterized by a progressively more attention-grabbing style. Thus, Farsight only disrupts AI prototypers’ flow when their prompts require more caution (G3).
Calculating the Alert Level. Auditing and quantifying the societal risk of LLM-powered applications is an open research problem [144]. To categorize the potential harms that might arise from users’ prompts, we propose a novel technique that uses the similarity between the prompt and previously documented AI incident reports as a proxy for the prompt’s alert level. First, we use an LLM to extract high-dimensional latent representations (embeddings) of all AI incident reports indexed in the AI Incident Database [111], which includes more than 3k community-curated news reports about AI failures and harms. Then, we extract the embedding of the user’s prompt and compute pairwise cosine distances between the prompt embedding and AI incident report embeddings. We label each incident report as , , based on two distance thresholds 0.69 and 0.75. We determine these two thresholds from an experiment with 1k random prompts (see section B.1 and Figure S2 for details). Researchers can easily adjust these two thresholds (between 0 and 1) to calibrate an article’s relevancy.
Finally, we show the numbers of AI incidents that are classified as in an orange circle and in a red circle (Figure 5) as a proxy of the prompt’s potential risk. In other words, we consider a prompt to have a higher risk if many AI incident reports are semantically and syntactically similar to it.

4.2 Awareness Sidebar

Fig. 6:
Fig. 6: The Awareness Sidebar provides in situ information to remind AI prototypers of potential risks. (A) Given a user’s current prompt, (B) the Incident Panel shows the (B1) latest and (B2) related AI incident reports sampled from the AI Incident Database [111]. (B2) The related AI incident tab is the default view, which uses text embedding similarities between the user’s prompt and all AI incident reports to surface relevant reports. (C) The Use Case Panel leverages LLM to generate potential use cases and harms. Each use case is classified by an LLM and organized into (C1) intended, (C2) high-stakes, and (C3) misuse tabs.
After a user clicks the Alert Symbol, the Awareness Sidebar (Figure 6) expands from one side edge of the AI prototyping tool (G3), highlighting potential consequences of AI applications or features that are based on the user’s current prompt. We use a real prompt from Awesome ChatGPT Prompts [1] in the example in Figure 6.
Incident Panel. To encourage users to consider potential risks associated with their prompts (Figure 6A), the Incident Panel highlights news headlines of AI incidents that are relevant to the user’s prompt (Figure 6-B2). These incidents comprise the top 30 incident reports that are classified as or , sorted in reverse order based on their embedding’s cosine distances to the embedding of the user’s prompt. The thumbnails are color-coded based on the incident’s relevancy level. Users can click the headline or the thumbnail to open the original incident report in a new tab. These real AI incidents can function as cautionary tales [103, 199] reminding users of potential AI harms (G4).
Use Case Panel. To help users imagine how their AI prototype may be used in AI applications or features (G1), the Use Case Panel (Figure 6C) presents a diverse set of potential use cases that are generated by an LLM. Each use case is shown as a sentence describing how a particular group of end-users could use this AI application in a specific context. For example, for a writing tutor prompt, a potential use case can be “teachers use it to provide feedback on student writing.” (Figure 6-C1). We also use an LLM to generate a potential harm that could occur within that use case, shown below the use case sentence. For example, a harm for the teacher feedback use case can be “students may feel like they are not getting personalized feedback from their teachers.” We use few-shot learning to prompt the LLM to generate use cases and harms, whereas we generate use cases, stakeholders, and harms in Harm Envisioner (section 4.3). We open-source all of our prompts.
To help users assess and organize use cases and harms (G2), we also leverage an LLM to categorize each use case as , , or , although we acknowledge that these may vary by use cases, development and deployment contexts, as well as relevant policies or regulatory frameworks in various jurisdictions. These three categories are introduced by responsible AI researchers to help ML developers structure their harm envisioning process [121]. The use cases are those that align with the development target use cases. The use cases encompass those that may arise in high-stakes domains, such as medicine, finance, and the law. The category includes scenarios where malicious actors exploit the AI application to cause harms. The Use Case Panel organizes use cases and harms into three tabs (Figure 6-C1–3) based on their categories. The first tab, “mix”, provides an overview by showing one use case and its corresponding harm from each of the other tabs.

4.3 Harm Envisioner

Fig. 7:
Fig. 7: The Harm Envisioner helps AI prototypers envision harms associated with their AI applications through human-AI collaboration. (A) Given a prompt, (B) Farsight uses an LLM to generate a summary of the prompt and asks users to revise it. (C) Then, the Harm Envisioner presents an interactive node-link diagram to visualize use cases, stakeholders, and harms. Initially, the Harm Envisioner only shows up to the Use Cases layer. (C1) Users can edit the node content before asking AI to generate its children nodes by clicking (C2) Users can delete unhelpful nodes. (C3) This view encourages users to think and add more harms by intermittently and randomly alternating harm categories shown in empty harm nodes, such as “increased labor?”
Both the Alert Symbol and the Awareness Sidebar provide easy-to-understand in-context reminders to help users reflect on potential harms associated with their prompts. However, instead of passively reading AI incident reports and LLM-generated content, users desire to actively edit and add new use cases, stakeholders, and harms (Figure 4). Also, active participation—a key factor in learning—may help foster AI prototypers’ ability to independently identify harms. Therefore, we design Harm Envisioner (Figure 7) to support users in actively envisioning potential harms associated with their prompts (G4). We use a real prompt from Awesome ChatGPT Prompts [1] in the example in Figure 7.
Interactive Node-link Tree Visualization. After clicking the “Envision Consequences & Harms” button in the Awareness Sidebar, Harm Envisioner appears as a pop-up window on top of the prompt-crafting tool (Figure 7). It begins with a text box filled with an LLM-generated summary of a user’s prompt (Figure 7B). The user is prompted to revise the summary to align with the target task in their prompt. Next, the window transitions into an interactive node-link tree visualization [146], where the user can pan and zoom to navigate the view (Figure 7C). First, the window shows the user’s prompt summary as the root of the tree which is visualized as a text box. The user can click the root node and the LLM will generate potential use cases of an AI application based on the user’s prompt, and the use cases are visualized as the root’s children nodes. Similarly, users can click a generated node and the LLM will generate its children nodes (stakeholders and then harms). There is a max of four layers, following an order of the user’s prompt summary → use cases → stakeholders → harms. This layer order reflects the recommended harm envisioning workflow in responsible AI literature [12, 46, 103, 120, 121] and helps users to comprehend and organize diverse harms across different contexts (G2). For additional examples of LLM-generated use cases, stakeholders, and harms in Farsight, see Table S1.
Fig. 8:
Fig. 8: Icons used to represent different harm types.
Human-AI Collaboration in Harm Envisioning. Our goal is to use AI-generated harms to encourage users to reflect on potential downstream harms and inspire them to add, edit, or curate potential harms (G4). To do that, Harm Envisioner allows users to edit any tree nodes by clicking a button in the toolbar (Figure 7-C1) or entering new text in the tree node. In addition, users can delete (Figure 7-C2) and use the LLM to regenerate all of an edited node’s children nodes, to effectively steer the harm envisioning direction by offering feedback to the LLM (G4). To meet users’ needs of categorizing harms (G2), we use an LLM to classify each harm into a harm type based on a systematic review and taxonomy of AI harms  [164]. Users can use the dropdown menu to change the harm’s category (Figure 8). To help users prioritize and take notes about harms, the Harm Envisioner allows users to rate the severity of each harm by clicking in the toolbar. Finally, users can click to export all content (e.g., use cases, stakeholders, and harms) in the Harm Envisioner as a Markdown file.

4.4 Open-source and Reusable Implementation

To make Farsight easily adoptable by both AI prototypers and AI companies (G5), we implement Farsight to be model-agnostic and environment-agnostic, and we open-source our implementation. Farsight uses LLMs by calling their public APIs so that users can use their preferred LLMs by easily replacing the API endpoints. To help AI companies and researchers integrate Farsight into AI prototyping tools, we leverage Web Components [115] and Lit [68] to implement Farsight as reusable modules, which can be easily integrated into any web-based interfaces regardless of their development stacks (e.g., React, Vue, Svelte). To help AI prototypers use our tool, we present a Chrome extension2 that integrates Farsight into Google AI Studio and a Python package3 that brings Farsight to computational notebooks. We implement the interactive tree visualization using D3.js [19] and embedding similarity computation using TensorFlow.js [170] with WebGL [114] acceleration. Computational notebook support is implemented using NOVA [188].

5 Usage Scenario

We present a hypothetical usage scenario to illustrate how Farsight fosters responsible awareness among AI prototypers. Rosa is a native English speaker from the United States who recently traveled to Vietnam to teach English. She is the only English teacher at an under-resourced high school. Overwhelmed with grading English writing assignments for all students in the school, Rosa tries to develop an LLM-powered AI application that provides writing feedback based on a student’s essay. She writes her prompt (Figure 6A) in an AI prototyping tool with Farsight integrated. After running the prompt, Rosa notices the alarming Alert Symbol (Figure 6A), so she clicks on it, which expands the Awareness Sidebar (Figure 6-BC). Rosa reads a few related articles shown in the Incident Panel (Figure 6-B2). She finds these articles are indeed related to AI in education and are helpful, but they mainly focus on students using AI to cheat rather than teachers using AI to grade assignments. Rosa skims through the LLM-generated potential use cases and finds the use case “teachers use it to provide feedback on student writing” very relatable (Figure 6-C1). Intrigued by its associated harm “students may feel like they are not getting personalized feedback from their teachers”, Rosa clicks the Envision Consequences button and wishes to learn more about this use case and its associated potential harms.
Harm envisioner. Next, Farsight shows a pop-up window asking Rosa to revise and confirm an LLM-generated summary of her prompt (Figure 7-B). After clicking , Rosa sees the Harm Envisioner presenting an interactive tree visualization showing the functionality of her AI application as a root node and multiple use cases as its children nodes (Figure 7-C). With a map-like interface, Rosa quickly uses zoom-and-pan to zoom into the teaching use case. After clicking , the Harm Envisioner quickly generates the stakeholders associated with the use case and the harms associated with each stakeholder. Rosa takes some time to reflect on the LLM-generated harm of students not getting personalized feedback (Figure 7-Harm-1). She has never thought about this consequence before, but she thinks it makes sense—AI does not have background knowledge about each student, so its feedback would not be tailored to students’ individual needs. After rating it as very severe  by clicking , Rosa continues reading other LLM-generated harms. She does not think the harm of teachers losing jobs to her AI tutor is relevant, so she deletes it (Figure 7-C2).
Human-AI collaboration. After seeing the random question “increased labor?” next to teacher (Figure 7-C3), Rosa thinks maybe it will be more time-consuming to review AI-generated feedback than grading students’ assignments herself, so she enters that harm into the Harm Envisioner. Next, Rosa is not sure about the legal liability of her school (Figure 7-Harm-3), but it might be worth discussing with other teachers. Finally, reflecting on her experience with the Harm Envisioner and AI incident articles, Rosa thinks the potential harms of her writing tutor AI application outweigh the potential convenience for her. Therefore, Rosa decides to stop prototyping this application. However, Rosa still sees value in leveraging LLMs in education, so she bookmarks related AI incident articles and clicks to download all the content in the Harm Envisioner as a Markdown file. She will bring these resources to discuss with her colleagues the next day.

6 Evaluation User Study

We conducted a user study to evaluate Farsight’s effectiveness in aiding AI prototypers to anticipate the potential harms associated with AI features. In addition, we investigate how AI prototypers use Farsight during an early prototyping stage. To investigate the effect of user engagement in AI-assisted harm envisioning, we tested two variants of our tool: Farsight, including all components, and Farsight Lite, including only the Alert Symbol (Figure 1-B) and the Awareness Sidebar (Figure 1-C). In other words, Farsight Lite is a “subset” of Farsight. Farsight Lite only shows one direct stakeholder for each use case in the Use Case Panel, while Farsight allows users to interactively add more stakeholders, use cases, and harms in the Harm Envisioner (Figure 1-A). The study included 42 AI prototypers with diverse roles who were recruited from a large technology company based in the United States. In this user study, we aimed to investigate the following three research questions:
RQ1.
How do Farsight and Farsight Lite affect users’ ability for and approach to identifying potential harms?
RQ2.
How effective and useful are Farsight and Farsight Lite in assisting users in envisioning harms in comparison to existing commonly-used resources?
RQ3.
What challenges do AI prototypers face when envisioning potential harms during the AI prototyping stage? How do Farsight and Farsight Lite help AI prototypers address these challenges?

6.1 Participants

Table 2:
Participant RolesParticipant IDs
3, 4, 5, 6, 7, 12, 13, 15, 16, 17, 19, 23, 25, 26, 28, 29, 33, 34, 35, 41, 42
Product Manager1, 8, 10, 11, 14, 20, 24, 27, 36
Linguist2, 21, 30, 31
AI Researcher9, 18, 39, 40
UX Researcher22
Data Scientist32
Test Engineer37
Marketing Specialist38
Table 2: The evaluation user study included 42 participants with diverse roles and experience in prompting LLMs.
We recruited 45 voluntary participants from both internal mailing lists related to AI and snowball sampling at Google, based in the United States. The recruitment required participants to have experience in writing prompts for LLMs. In total, we received 61 responses, and we selected 45 participants based on their schedule availability. We conducted pilot studies using the first three study sessions, which were not included in our data analysis. As a result, we had a total of 42 participants. Each study session was either 90 minutes (n=28 sessions) or 60 minutes (n=14 sessions), depending on the participants’ availability. During the 90-minute sessions (or 60-minute sessions), each participant received an average of $62 USD (or $41) compensation in their preferred form such as gift cards and charity credits.
Among the 42 participants, 26 identified as men, 14 as women, and 2 preferred not to disclose their gender. Information about their job roles is listed in Table 2. Recruited participants self-reported an average score of 2.55 for their knowledge and experience with responsible AI on a 5-point Likert scale (Figure 10-top), where 1 represents “No experience” and 5 represents “Expert (I have helped others apply responsible AI practices).” In addition, participants self-reported an average score of 2.81 for experience with LLM prompting on a 5-point Likert scale (Figure 10-bottom), where 1 represents “Beginner” and 5 represents “Expert.” All participants are Farsight’s targeted users, AI prototypers.
Fig. 9:
Fig. 9: The evaluation study included six conditions with different variations of harm envisioning tools (Farsight, Farsight Lite, and the baseline Envisioning Guide). Participants were asked to envision potential harms associated with an AI feature (e.g., email summarizer) in each harm-envisioning activity (H1, H2, H3, and H4). Participants had access to a harm envisioning tool in H2 and H4. The duration of sessions involving H4 and interview 2 was 90 minutes, while all other sessions lasted 60 minutes. Participants were randomly assigned to a condition, taking into account their availability for study session duration.
Fig. 10:
Fig. 10: Participants reported diverse levels of familiarity with responsible AI (top, average=2.55) and LLM prompting (bottom, average=2.81) on 5-point Likert scales.

6.2 Study Design

We conducted this study with participants one-on-one. Out of 42 sessions, 2 were conducted in-person, and 40 were through video conferencing software due to office locations and participants’ scheduling constraints. With the permission of all participants, we recorded the participants’ audio and computer screen for subsequent analysis. To start, each participant signed a consent form and filled out a survey regarding their familiarity with responsible AI and LLM prompting (Figure 10). Then, participants were randomly assigned to one of six conditions taking into account their time availability: CFG, CF, CLG, CL, CGF, CGL (Figure 9). C stands for the study condition, CFG means that participants used Farsight first and then Envisioning Guide, and CL means that participants only used Farsight Lite—the other acronyms follow this same pattern. Sessions of CF and CL were scheduled for 60 minutes each, while the remaining sessions were allotted 90 minutes each. We assigned 7 participants to each condition, as this was the maximum number that allowed for an equal distribution of participants across all conditions, given the time constraints and the availability of the 61 individuals who signed up for the study.
Our study followed a mixed design that combines both between-subjects and within-subjects designs [161]. Each session included three or four harm-envisioning activities, denoted as H1, H2, H3, and H4 (section 6.2.2), as well as one or two semi-structured interviews to collect participants’ feedback (section 6.2.3). In each harm-envisioning activity, participants were asked to envision potential harms associated with a particular AI feature while thinking aloud (Figure 9). In H1 and H3, participants envisioned harms on their own, whereas in H2 and H4, they could use a harm envisioning tool we assigned them based on their study condition (e.g., Farsight, Farsight Lite, or Envisioning Guide). All collected harms were rated by seven researchers with experience with responsible AI evaluations, who assigned each potential harm numeric scores in terms of their likelihood and severity (section 6.2.4). We compared the envisioned harms in H1 and H3 (between-subjects) to investigate how different tools affect users’ ability and approach to anticipating harms (RQ1). We compared the envisioned harms in H2 and H4 (within-subjects) to assess the effectiveness of different tools in helping users envision harms (RQ2). Besides the quantitative data on the number and ratings of potential harms, we also collected qualitative data from think-aloud and two interviews (RQ1RQ3). We incorporated 60-minute sessions (CF and CL) into our study design due to challenges in recruiting participants available for a 90-minute duration.
Fig. 11:
Fig. 11: In the evaluation user study, we compared our tools against Envisioning Guide, a combination of existing harm envisioning resources. This Envisioning Guide was presented to participants as a Google Doc with three sections. (A) The harm modeling workflow table comes from Microsoft’s Harm Modeling Practice [120], providing a four-step process to envision harms. (B) The harm modeling prompts from the Harm Modeling Practice [120] offer templates and questions to help users envision different stakeholders and use cases (not all content is displayed here). (C) The harm taxonomy [164] helps participants explore the space of potential harms by providing a comprehensive list of 20 harm categories organized into five themes (not all content is displayed here). Participants could click the

6.2.1 Baseline Harm Envisioning Tool.

To compare our work against current responsible AI workflows, we created a baseline intervention Envisioning Guide: a combination of Microsoft’s Harm Modeling Practice [120] and the Harm Taxonomy from Shelby et al. [164]. These two resources are the latest and the most representative resources designed to help practitioners envision harms. We combined them because (1) we aim to simulate the current practice where AI prototypers can choose from various existing harm envisioning tools, and (2) we do not intend to study the causal effects of any specific resource. We administered this intervention by providing a Google Doc containing a detailed table and information from these resources (Figure 11). Both resources were designed to help technology developers and researchers anticipate and prevent negative societal impacts of their technology innovations.

6.2.2 Harm Envisioning Activities.

Depending on the conditions, the study included three or four harm envisioning activities (H1–H4). Within each harm envisioning activity, participants were presented with a description of an AI feature and the prompt that generated that feature. We chose the four AI features (Figure 9) based on a qualitative analysis of 100 randomly sampled internal prompts written by real AI prototypers. These four features are representative of popular LLM tasks (e.g., summarization, classification, and question answering) and comprehensible to participants with diverse roles. In H1 and H3, participants independently envisioned harms, whereas in H2 and H4, they were provided with a harm envisioning assistance tool (e.g., Farsight, Farsight Lite, or Envisioning Guide). To emulate AI prototyping workflows, we asked participants to perform simple prompt engineering tasks in H2 and H4 before envisioning potential harms of presented AI features.
For each harm, participants were instructed to describe who would be affected (i.e., the stakeholders) and how the stakeholder might be harmed. We provided a harm example for a code generation AI feature: “App end-users might face financial loss due to AI-introduced software vulnerabilities.” During the process, participants were asked to share their screens and verbalize their thoughts. They were also asked to enter their envisioned harms into a Google Doc table featuring a who column and a how column. Moreover, participants had the option to articulate the harm verbally, and we transcribed it into the table. At the end of each harm envisioning activity, we reviewed the table together with the participants to ensure the accuracy of the harm descriptions. Participants were instructed to achieve three objectives: (1) envision as many harms as possible; (2) envision the most likely harms; and (3) envision the most severe harms.
H1: Pre-task. To understand how participants independently envision potential harms before using the tool, as a baseline for RQ1, participants were asked to anticipate potential harms concerning an LLM-powered email summarizer on their own (Figure 9). They received information about the AI functionality: “Shorten and improve a user’s email”, a development context, and a prompt that enables this functionality (see details in section C.1). The duration of this activity was limited to 10 minutes.
H2: Intervention. In the second harm envisioning activity, we asked participants to use different harm envisioning assistant tools. Depending on the assigned condition, a participant could use Farsight (CFG, CF), Farsight Lite (CLG, CL), or Envisioning Guide (CGF, CGL) to help them anticipate harms. The activity began with a tutorial on the designated tool. The AI feature used in this activity was an LLM-powered toxicity classifier (Figure 9). Participants received information regarding the AI functionality “Detect toxic text content,” a development context, and a prompt that enables this AI functionality (section C.2). To emulate AI prototyping workflows, we also tasked participants with a simple prompt engineering assignment (section C.2).
After completing prompt engineering, participants envisioned harms linked to the toxicity classifier. They were instructed to freely use the assigned tools while sharing their screens and thinking aloud. For participants assigned with Envisioning Guide (CGF, CGL), the process of entering envisioned harms was the same as H1. Participants assigned with Farsight (CFG, CF) or Farsight Lite (CLG, CL) could click a button in the tools to export all harms as a text file. The export included both AI-generated harms and harms added or modified by participants. Participants were asked to copy the harms into the Google Doc. As a significant portion of these harms were generated by AI, we asked participants to select harms that (1) they agreed with and (2) would report to their colleagues and managers. Also, participants were welcome to add more harms to the table. For our analysis, we only included the exported harms that participants had selected and added to the table. The duration of this activity was limited to 25 minutes.
H3: Post-task. To understand how the intervention may have affected participants’ ability to independently envision harms (RQ1), we asked participants to envision harms associated with an LLM-powered article summarizer on their own (Figure 9). To ensure a valid comparison between the envisioned harms and participants’ approaches to the pre-task (H1), we introduced a parallel AI summarizer feature in this activity that was isomorphic to the pre-task [139]. In particular, to deter participants from directly reusing their envisioned harms from H1, we replaced the email summarizer in H1 with an article summarizer. The AI functionality was described as “Summarize an article in one sentence”. The development context and prompt are available in section C.3. The duration of this activity was limited to 10 minutes.
H4: Alternative. To assess the effectiveness and usefulness of Farsight and Farsight Lite in comparison to Envisioning Guide (RQ2) and study the usage patterns of different tools (RQ3), n = 28 participants engaged in 90-minute sessions (CFG, CLG, CGF, and CGL) to envision harms using a tool different from the one used in H2 (Figure 9). Participants were asked to envision potential harms associated with an LLM-powered math tutor app with the AI functionality “Answer math-related questions”, a development context, and a prompt (section C.4). The procedure for this activity paralleled H2, including a tutorial, prompt engineering exercise (section C.4), and harm envisioning. This activity’s duration was limited to 25 minutes.

6.2.3 Semi-structured Interviews.

This study included two semi-structured interview sessions (Figure 9). The first interview took place after the post-task activity (H3), where we asked participants to reflect on their process for anticipating potential harms during the LLM prototyping process, and how their approach may have changed after the intervention (RQ1). We also asked participants about their challenges in harm anticipation, their experiences of using harm envisioning tools, and potential actions they would take to address the identified harms (RQ3, section D.1). After participants in 90-minute sessions (CFG, CLG, CGF, and CGL) finished H4, we asked them to compare and rate the usefulness and usability of the two tools they had used in this study (RQ2). We also asked them to rate the helpfulness of different components in the tools on a 5-point Likert scale, as elucidated in section D.2.

6.2.4 Harm Rating.

After completing all 42 study sessions, we recruited seven raters to rate all 989 harms collected in H1–H4 to evaluate participants’ ability to envision harms. These seven raters included four of the paper authors and three industry researchers; all raters had experience with responsible AI (unlike many of the participants)—either as responsible AI researchers, developers of responsible AI tools or playbooks, or in a consultant role on responsible AI for product teams. Ideally, evaluations of identified harms would involve both domain experts for the domain in question (e.g., education) and/or stakeholders from demographic groups or communities who may be likely to experience those harms. For this preliminary study, due to timing and resource constraints, we recruited responsible AI researchers as raters instead of specific domain experts or people impacted by AI applications. The limitations of this approach are further discussed in section 6.7 and 7.2
Our collected harms were either (1) directly envisioned by participants or (2) exported from Farsight or Farsight Lite and subsequently curated by participants during H2 and H4. Each harm included the impacted stakeholders and a description of the harm. After removing duplicates and random shuffling, we randomly and evenly assigned harms to raters via spreadsheet format. Raters had access to the details of the intended AI feature of each harm, including the prompt and the context of the AI feature (e.g., the prompt and context in section C.1). To prevent the raters from being influenced by our hypotheses, we did not include the experimental conditions in the rating sheet. In other words, raters did not know if a harm was from a Farsight user, a Farsight Lite user, or a Envisioning Guide user. To mitigate rating noise, we designated three raters for each harm. As identifying likely and severe harms is often an objective in AI harm envisioning exercises [120, 142], we asked raters to rate each harm’s likelihood and severity on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree to statements “This harm is likely to occur for this stakeholder” and “This harm will severely impact this stakeholder”). Raters could also choose an N/A option if they perceived a rating was not applicable for that feature or use case. During data analysis, we numericalized these four categories as ordinal scores: 1, 2, 3, 4 and removed N/As. See Table S2 for a random subset of harms that were collected from participants and their corresponding ratings.

6.3 Data Analysis

We applied a mixed-methods approach for data analysis. First, we conducted a quantitative analysis (section 6.3.1) on the changes in participants’ ability to envision harms by comparing pre-task H1 to post-task H3 responses (RQ1). We also quantitatively assess three different tools’ effectiveness in helping users anticipate harms by comparing H2 and H4 responses (RQ2). The quantitative analyses involved metrics such as the total number of envisioned harms, as well as the average likelihood and severity ratings of envisioned harms across 3 raters. Next, we performed a qualitative analysis (section 6.3.2) on transcripts from think-aloud sessions and interviews to further investigate participants’ strategies and challenges in envisioning harms, and usage patterns of different tools (RQ1RQ3).
Fig. 12:
Fig. 12: To evaluate how different interventions (Farsight, Farsight Lite, Envisioning Guide) affect users’ ability to envision harms independently, we conducted one-sample t-tests with Bonferroni correction to examine the difference in the (A) count, (B) average likelihood, and (C) average severity of participant-identified harms between H3 and H1. Each intervention had n = 14 participants, represented by 14 points on the chart. The charts also indicated the 95% confidence intervals, adjusted with Bonferroni correction. The results highlighted that after using Farsight and Farsight Lite, users could anticipate a significantly higher number of harms, while the average likelihood and severity of identified harms remained the same.

6.3.1 Quantitative Analysis.

We first conducted quantitative analyses on the count, likelihood, and severity of harms across different conditions to evaluate the effectiveness of our tools (RQ1, RQ2). We measured the likelihood and severity for each harm using the average of ratings from three raters after removing any N/As. The average pairwise weighted Cohen’s kappas [32, 112] for likelihood and severity ratings are 0.14 and 0.09 (see Figure S3 and section E for details). These values fall within the range of slight agreement [94]. We discuss this relatively low inter-rater agreement in 7.2 162] show all measures, except for the changes of harm count between H1 and H3 with Envisioning Guide, follow a normal distribution. We used t-tests with Bonferroni corrections for multiple hypothesis testing.
We also analyzed participants’ ratings of the tools’ usefulness and usability when comparing the two tools used in the study (RQ2, section 6.5.3). We converted the 5-point Likert scale ratings into numerical values and assessed the difference between ratings of our tools and Envisioning Guide using Mann-Whitney U tests [107]. Considering that most of the ratings did not exhibit a normal distribution, we chose to use Mann-Whitney U tests, as these tests do not assume normality in the data. See section 6.5.3 for discussion of the findings from these questions about usefulness and usability.

6.3.2 Qualitative Analysis.

We conducted a qualitative analysis on the screen recordings and transcripts of the study sessions that include participants’ verbalized thoughts during the harm envisioning activities (H1–H4) and interviews. All study sessions were screen-recorded and audio-recorded, with the audio subsequently transcribed by the video conferencing software. We adopted an inductive thematic analysis approach [21, 116] and open coded the 56-hour-long transcripts using the qualitative analysis software Dovetail [47]. After generating a codebook, we applied deductive coding [116] to assign harm envisioning patterns to each participant during H1 and H3 (RQ1, 6.4.2

6.4 Findings: Changes in Users’ Envisioning Ability and Approach (RQ1)

In the study, participants were asked to independently envision harms associated with an email summarizer (H1) and an article summarizer (H3) before and after using a harm envisioning tool (Farsight, Farsight Lite, or Envisioning Guide) to anticipate harms for a toxicity classifier (H2). We quantitatively and qualitatively compared participants’ envisioned harms and approaches in H1 and H3 across different conditions in H2.

6.4.1 Farsight and Farsight Lite Improved Users’ Ability to Envision Harms.

For each participant, we compared the count, average likelihood, and average severity of their independently envisioned harms before (H1) and after (H3) the intervention (Figure 12). Using paired sample t-tests with Bonferroni correction [48], we found that after using Farsight and Farsight Lite, users could envision significantly more harms on their own (p = 0.0028, p = 0.0003), showing an average increase of 2.42 and 3.00 harms, respectively. The effect sizes, as measured by Cohen’s d [33], were d = 1.21 and d = 1.27, indicating a very large effect [157]. On the contrary, for participants using Envisioning Guide, the average count of identified harms experienced a marginal decrease ( − 0.14). We hypothesize that the observation of three participants identifying fewer harms after using Envisioning Guide (see the outliers in Figure 12 A) is because Envisioning Guide had a high cognitive load. The high cognitive load may have resulted in these three participants having less energy to envision harms in H3 compared to H1. Changes in the average likelihood and average severity, on the other hand, were not statistically significant for any of the interventions (Figure 12-BC). Our finding implies that after using Farsight and Farsight Lite, users could anticipate a greater number of harms linked to AI features independently, while the average likelihood and severity of identified harms remained unaltered.
Table 3:
Harm Envisioning PatternDescription
Failure-mode-driven envisioningParticipants envisioned harm by initially considering the AI feature’s failure modes (e.g., wrong summarization), limitations of LLMs (e.g., hallucination), or vulnerabilities within system implementation (e.g., data storage). This pattern is similar to a Failure Mode and Effects Analysis [152].
Usage-driven envisioningParticipants envisioned harm by initially considering who may be impacted through this feature and in what usage scenario, such as students using the article summarizer for completing assignments. Then, participants envisioned potential harms that might impact the stakeholders within the identified scenario.
Consider high-stakes usesParticipants deliberately thought about high-stakes use cases of the AI feature, such as being used in medical, financial, and legal domains.
Consider misusesParticipants deliberately envisioned potential misuse of the AI feature, where malicious actors like scammers and hackers could exploit this AI feature to cause harm.
Consider indirect stakeholdersParticipants deliberately brainstormed stakeholders indirectly impacted by the AI feature, such as people who did not use the AI tools, individuals mentioned in the input text, and society at large.
Consider cascading harmsParticipants deliberately considered (1) harms that could result from other harms, such as productivity loss due to AI errors can lead to economic loss; or (2) harms that might occur even when the AI feature operated as expected, such as students using AI to cheat in homework.
Table 3: We identified six non-exclusive common patterns in independent harm envisioning by analyzing transcripts of participants’ think-aloud process during the harm envisioning activities in H1 and H3.

6.4.2 Changes in Harm Envisioning Approaches.

We also investigated the impacts of different tools on participants’ approaches to harm envisioning by analyzing their self-reports in interview 1 and the think-aloud data in H1 and H3.
Self-reported changes after using Farsight and Farsight Lite. The major themes of self-reported changes are similar between Farsight and Farsight Lite. A large number of participants noted that while they initially considered the AI feature and its potential harms in a general sense during H1, they shifted towards a more focused consideration of specific use cases and stakeholders in H3 (e.g., P23, P34, P38). Some participants highlighted they started to brainstorm potential misuses in H3 (P25, P32). For stakeholders, participants broadened their consideration to people and organizations not initially considered during H1. P40 acknowledged a transition from a focus on “protecting the AI company” in H1 to considering end-users in H3. Similarly, P17 reported a focus on end-users after using Farsight:
“Earlier maybe I was coming towards it from a very engineering or a very broad feature perspective. The third time, I was thinking more about people who were actually using the product and getting affected. So I was thinking more with respect to the people using it, rather than that being a feature in some application.” (P17)
Many participants also highlighted that they began to adopt the frameworks presented in Farsight and Farsight Lite (e.g., P9, P10, P32) to structure their harm envisioning procedures. For example, P10 and P32 appreciated the categorization of use cases, and they reported considering intended uses, high-stakes uses, and misuses in H3. After using Farsight, P9 said they followed the sequence of layers in the tree visualization to conceptualize use cases, stakeholders, and harms:
“I found that sort of flow from identifying potential use cases, then identifying stakeholders of those use cases, then identifying potential harms for each of the stakeholders to be really valuable. That’s a great way to scaffold it and think through the flow rather than just sort of bouncing around, which is what I had been doing [in H1]. So yeah, I found that super valuable that has changed the way that I think about it. And that’s the framework that I’ll use in the future.” (P21)
Self-reported changes after using Envisioning Guide. Many participants using Envisioning Guide in H2 (CGF, CGL) also noted shifts in their approaches to envisioning harms. Several participants noted that they started to follow the structure outlined in the Harm Modeling Guide to envision harms (P8, P40, P42). Some participants started thinking more about under-represented social groups in H3 (P8, P31). Furthermore, many participants described the harm taxonomy as a “mental checklist” that provided them with a language to articulate and think about harms (e.g., P6, P14, P31).
Fig. 13:
Fig. 13: By analyzing transcripts of 42 participants during the pre-task (H1) and post-task (H3) harm envisioning activities, we identified six non-exclusive common patterns in envisioning harms. This bar chart compares the number of participants who applied and did not apply these patterns before and after the three interventions. Note that there were 14 random participants for each intervention, and the initial number of participants applying certain patterns could differ. The chart highlights that both Farsight and Farsight Lite encouraged participants to consider how the AI feature would be used. Notably, the use of Farsight particularly influenced participants to think more about indirect stakeholders and cascading harms.
Observed changes in envisioning approaches. By analyzing transcripts of participants’ think-aloud process during the harm envisioning activities in H1 and H3, we identified six non-exclusive common patterns in harm anticipation (Table 3). Then, we examined the effects of different interventions on participants’ envisioning patterns by comparing the number of participants who applied and did not apply these six patterns in H1 and H3 across interventions (Figure 13). The intervention assignment is random.
Interestingly, the counts of participants who applied each pattern in H1 were consistent across interventions, with the exception of Farsight Lite where notably more participants considered indirect stakeholders in H1 (Figure 13-5). Before the interventions, the majority of participants relied on failure-mode-driven envisioning when anticipating harms (Figure 13-1), focusing on the AI feature’s limitation, failure modes, and technical implementation details. This observation corroborates participants’ self-reported envisioning approaches, where participants like P17 acknowledged having a “very engineering or a very broad feature perspective” in H1.
After the intervention, we observed that all three harm envisioning tools (Farsight, Farsight Lite, and Envisioning Guide) influenced participants to adopt a usage-driven envisioning approach when independently envisioning harms (Figure 13-2). Particularly, Farsight had the most pronounced effect, followed by Farsight Lite and then Envisioning Guide. All these tools encouraged participants to think more about high-stakes uses (Figure 13-3) and indirect stakeholders (Figure 13-5). Both Farsight and Farsight Lite exerted a stronger influence on considering misuses (Figure 13-4) and cascading harms (Figure 13-6) compared to Envisioning Guide. However, Envisioning Guide had slightly more impact than Farsight Lite in encouraging consideration of high-stakes uses (Figure 13-3) and indirect stakeholders (Figure 13-5).
Interestingly, Farsight had a notably more pronounced effect in leading participants to consider indirect stakeholders (Figure 13-5) and cascading harms (Figure 13-6) than the other tools. For indirect stakeholders, a possible explanation is that during H2, many participants encountered unexpected indirect stakeholders revealed by Farsight (6.5.2 Farsight Lite in fostering consideration of indirect stakeholders, as Farsight Lite had only identified one direct stakeholder for each use case, and participants could not use AI to generate more stakeholders.
For cascading harms, we hypothesize two potential explanations. First, many participants applied a reviewing approach when engaging with AI-generated harms in Farsight and Farsight Lite, where they tried to understand and make sense of these harms. In H2, reviewing existing harms prompted participants to consider cascading harms that might arise from other harms (6.5.2 unexpected AI-generated cascading harms in H2 (6.5.2

6.5 Findings: Farsight’s Effectiveness in Assisting Harm Envisioning (RQ2)

In addition to assessing the impacts of different harm envisioning tools on users’ ability to independently envision harms, we also evaluated the tools’ effectiveness in aiding users to anticipate harms. Specifically, we quantitatively compared participants’ envisioned harms when using different harm envisioning tools in H2 and H4. Furthermore, we qualitatively analyzed participants’ usage patterns, interview responses, and survey data.
Fig. 14:
Fig. 14: To evaluate the effectiveness of our tools in helping users anticipate harms, we conducted paired t-tests with Bonferroni correction to compare our tools (Farsight, Farsight Lite) against the baseline Envisioning Guide based on the (A) count, (B) average likelihood, and (C) average severity of harms collected in H2 and H4. In each comparison, such as Farsight vs Envisioning Guide, n = 14 participants (each shown as two connected dots) used both tools: 7 of them started with Farsight in H2, and the remaining 7 began with Envisioning Guide. The charts also highlighted the mean and standard deviation of all measures. The results showed that Farsight and Farsight Lite were effective in assisting users to anticipate a significantly greater number of harms compared to existing resources, while the quality of the identified harms remained consistent.

6.5.1 Farsight and Farsight Lite helped users envision more harms.

We compared the count, average likelihood, and average severity of harms collected in H2 and H4 using our tools, Farsight and Farsight Lite, against the baseline Envisioning Guide (Figure 14). These harms were identified by participants using different harm envisioning tools or generated by AI and selected by the participants. This analysis followed a within-subjects approach, including 28 participants from CFG, CGF, CLG, and CGL. In each comparison, such as Farsight vs Envisioning Guide, a total of 14 participants used both tools, with 7 of them starting with Farsight in H2 (CFG), and the remaining 7 beginning with Envisioning Guide (CGF). Results from paired t-tests, adjusted with Bonferroni correction, highlighted that participants using Farsight and Farsight Lite resulted in a significantly higher number of harms compared to those using Envisioning Guide (p = 0.0018, p = 0.0034), with an average difference in the count of 4 (Figure 14A). The effect sizes, as measured by Cohen’s d, were d = 1.57 and d = 1.48, indicating a very large effect. However, no significant differences were observed regarding the likelihood and severity of identified harms between our tools and Envisioning Guide (Figure 14-BC). Our findings suggest that our tools are effective in assisting users to identify a greater number of harms compared to existing resources, while the quality of the identified harms remains consistent.

6.5.2 Usage patterns.

We summarized how participants use Farsight and Farsight Lite in H2 and H4.
Trying to understand (unexpected) AI-generated content. Upon encountering AI-generated content (e.g., use cases, stakeholders, and harms), participants first sought to (1) understand why AI had generated it and then (2) assess its likelihood and relevance to their AI application. For example, for the toxicity classifier in H2, Farsight and Farsight Lite sometimes would generate a use case “HR departments use it to screen job applicants for toxic behaviors.” This use case was usually unexpected to participants, and it provoked them to think how an HR department could employ a toxicity classifier. Some participants imagined that the HR could use this classifier on applicants’ social media to identify red flags (e.g., P10, P11, P29), while others could only see it being used on applicants’ cover letters (P4). Participants then assessed how likely and relevant is this scenario before diving into related harms.
Subjectivity in apprehending auto-generated content. We observed that based on participants’ prior experiences, they could have very different views on auto-generated content in Farsight. For example, participants had different perceptions of how their companies’ HR division might use a toxicity classifier (e.g., applying the classifier to job applicants’ social media content or their application material). Also, for the toxicity classifier in H2, the Incident Panel would often show an incident report on biases in sentiment analysis tools. While some participants could quickly make the connection between sentiment analysis and toxicity classification and reflect on biases in toxicity classifiers (P10, P36), others would overlook this incident (P19, P38).
In some cases, participants’ disagreement came from their different definitions of harm. For example, in both H2 and H4, our tools would generate potential harms for people who do not use the AI applications, such as “students who do not use the math tutoring app may feel left behind.” Some participants perceived these harms as crucial considerations for assessing the impacts of AI applications (e.g., P6, P18, P30), while others argue against considering harms when an AI feature is absent (e.g., P4, P9, P13). We discuss the implications of subjectivity and rater disagreement in harm envisioning in 7.2
Sparked to brainstorm new harms. The content in Farsight and Farsight Lite often inspired participants to brainstorm new use cases, stakeholders, and harms. After seeing an AI-generated stakeholder, many participants could quickly identify potential harms for that stakeholder. For instance, seeing the stakeholder teachers in the math tutoring app in H4, P22 added a new harm that teachers may struggle to integrate this tool into their existing teaching workflows. Many participants also came up with new harms by making connections across different AI-generated use cases, stakeholders, and harms. For example, Farsight anticipated two use cases for the toxicity classifier: (1) online moderators using it to identify toxic content, and (2) hate groups using it to recruit people. P2 connected both use cases and added a new harm: “online moderators could face death threats from hate groups who feel their speech is censored.”
Thinking beyond immediate harms. Instead of starting with a blank slate, our tools provided participants with initial materials that prompted them to think beyond the immediate harms and envision cascading repercussions. For example, after seeing the AI-generated harm “job applicants might be unfairly rejected” within the context of HR using a toxicity classifier to screen job applicants, P38 quickly thought of a cascading harm—the company’s diversity hiring effort could be harmed, as the toxicity classifier was more likely to misclassify and reject under-represented social groups. Similarly, P18 recognized in the long run, the hiring company could lose money due to the exclusion of qualified candidates caused by a biased toxicity classifier. This usage pattern might explain the increase of participants, who used Farsight and Farsight Lite in H2, independently envisioning cascading harms in H3 (Figure 13-6).
Fig. 15:
Fig. 15: Average ratings from 28 participants, comparing the usefulness and usability of Farsight and Farsight Lite to Envisioning Guide. Both of our tools were preferred and perceived as more helpful, easier to use, and more enjoyable than the existing resources. Each comparison involved 14 participants who used one of our tools and Envisioning Guide in random order. We use an asterisk (*) to denote statistically significant rating differences, determined by Mann-Whitney U tests with Bonferroni correction. We used Mann-Whitney U tests instead of t-tests due to the non-normal distribution of many ratings.
Fig. 16:
Fig. 16: Average ratings of envisioning tool features.
Thinking about mitigation strategies. Interestingly, after seeing AI-generated harms, many participants voluntarily considered actions and strategies to take after envisioning harms. For example, after seeing AI-generated harms for the toxicity classifier, P15 and P16 noted that it was important to allow impacted stakeholders to appeal if their content was removed because of the classifier. Similarly, P27 and P40 noted that people should implement a human review process if the toxicity classifier was used to remove social media content. Interacting with Farsight and Farsight Lite also encouraged participants to reflect on their prompting workflows. For example, P29 and P37 mentioned that the AI prototypers should start collecting good and diverse toxicity examples to improve the prompt through few-shot prompting. P2 noted that they would like to add additional instructions in their prompt to safeguard against biased output and potential data leakage. Finally, after envisioning more harms, P2 mentioned that they would rethink if it was worth continuing to prototype or develop this AI feature.

6.5.3 Our tools were usable, useful, and preferred by users.

We asked participants who had used one of our tools and Envisioning Guide (CFG, CGF, CLG, CGL) to compare and rate the usefulness and usability of the tools they had used on a 5-point Likert-scale. By comparing their ratings, we found both Farsight and Farsight Lite were preferred and considered as more helpful, easier to use, and more enjoyable compared to Envisioning Guide (Figure 15). Both tools had significantly higher ratings on “easy to use” than the baseline (p = 0.0384, p = 0.0260). In addition, Farsight was rated significantly more helpful than the baseline (Figure 15A), while Farsight Lite was more enjoyable (Figure 15B). The effect sizes of significant results, as measured by the common language effect size [110], were all above 0.7, indicating a large effect.
Usefulness of different features. Besides comparing the two tools, participants also rated the usefulness of specific features in each tool. The average ratings are shown in Figure 16. All features in our tools were rated favorably (Figure 16-AB). For Farsight, participants especially liked the interactive tree visualization. For example, P6 commented, “This tree makes a lot of sense. This is how I think about it in my brain as well.” Similarly, P16 appreciated the progressive disclosure in the visualization: “I’m able to not get overwhelmed by everything all at once.” The rating for the AI incident panel (in both Farsight and Farsight Lite) is relatively lower than other features. Participants explained that the surfaced incidents were not very relevant to their prompts (P39, P41), and the feature would require them to take time to read external articles (P24, P39).

6.6 Findings: Farsight’s Role in Overcoming Harm Envisioning Challenges (RQ3)

After completing the post-task (H3), participants were asked to reflect on the biggest challenges encountered in envisioning harms associated with AI features. We examined the major themes that emerged from these challenges. In addition, by analyzing participants’ usage patterns of Farsight and Farsight Lite, coupled with their interview feedback, we explored how our tools mitigate certain challenges and also identified our tools’ limitations.

6.6.1 Challenges in envisioning harms.

We summarized three major challenges that participants encountered.
C1.
Envisioning use cases. The most prevalent challenge in envisioning harms is to anticipate different use cases for an AI feature. Multiple participants noted that it was most challenging to imagine how different people would use technology, and it was particularly difficult to “put myself in someone’s shoes” (P27, P37, P39) and “empathize with different groups of people” (P11). Participants also underscored the vast space of possible use cases (P31, P33, P36), and “often you don’t find out the edge cases until you actually work with it” (P2). Some participants also emphasized that it sometimes required creativity to imagine how an AI feature would be used and especially misused (e.g., P5, P22, P23).
C2.
Bias and subjectivity in harm envisioning. Interestingly, several participants recognized their own biases in envisioning harms (e.g., P6, P21, P31). For example, P21 noted the challenge in overcoming their biases in anticipating the impacts of AI features: “I had been coming at it from a very American-centric point of view at first. To talk about bias, I hadn’t even conceived of the government using this to monitor my phone, but that could happen in other places.” Moreover, some participants acknowledged the subjectivity in the definition of harms, as well as in the assessment of harms’ likelihood and severity. For example, while envisioning harms and selecting harms to report (H2 and H4), some participants were conscious of whether other people would agree with their identification and assessment of harms (P19, P38).
C3.
Inexperience and discomfort in harm envisioning. Many participants mentioned that our study was their first time to envision harms for AI features (e.g., P17, P26, P28). For example, P26 noted “I have never envisioned harm before. This is not something I would think of when developing AI products.” Similarly, P18 said “I’m familiar with technical issues but not their social influence”. Also, P30 pointed out that there were few incentives for developers to envision harms. In addition to unfamiliarity, some participants also noted that it was uncomfortable and sad to think about harms (P3, P12). For example, P3 said “It’s not comfortable thinking through all the bad things that can happen. I think in general people don’t like thinking about bad things too much.

6.6.2 Farsight and Farsight Lite address major challenges.

Our tools could help users address identified challenges.
A co-pilot for brainstorming diverse use cases. Many participants appreciated that our tools provided them with a starting point to predict use cases (e.g., P8, P29, P41). For example, after seeing a few AI-generated use cases, P8 found it much easier to envision other use cases, and similarly, P24 felt empowered to “have a wider net to cast” (C1). Also, P14 noted that even seeing far-fetched AI-generated content helped them brainstorm new use cases. On the other hand, P21 appreciated that Farsight had identified many unexpected and thought-provoking use cases that provided a different perspective in anticipating harms (C2).
In situ guide that promotes user agency. Participants especially liked that our tools were directly integrated into existing AI prototyping tools and contextualized based on the prompt (e.g., P19, P31, P37), where Farsight and Farsight Lite required minimal effort to get started envisioning harms (C3). Participants also thought the Incident Panel and Use Case Panel as a good reminder for potential harms for the AI feature that one is prototyping (e.g., P12, P41, P42). For example, P12 commented that “Even if it’s just sitting there, it would be educational.” Many participants also liked the interactivity of our tools and found it engaging for adding new use cases, stakeholders, and harms (e.g., P9, P19, P24)—many of them noted that Farsight was so intriguing that they would like to continue using it to explore potential harms (C3). Participants felt they had agency in harm envisioning when using Farsight. For example, P21 commented “If you think something [AI-generated content] is totally bonkers, whatever, just delete it.” Similarly, P4 and P5 compared the Harm Envisioner to a mind map, as they appreciate that the interface allows them to freely organize and revise their thoughts in harm envisioning.

6.6.3 Limitations of Farsight and Farsight Lite.

Our findings showed that, in comparison to Envisioning Guide, Farsight and Farsight Lite did not show significant differences in participants’ ability to envision more likely or more severe harms (section 6.4.1), nor did they assist participants in envisioning more likely or more severe harms (section 6.5.1). Additionally, participants’ feedback revealed two major limitations of our tools.
Varied quality of LLM-generated content. Depending on participants’ prompts, the related AI incidents in the Incident Panel, and LLM-generated use cases, stakeholders, and harms were different across participants. Sometimes, participants found a few LLM-generated content confusing and unhelpful. For example, when using our tools on the math tutor prompt (H4), the incidents in the Incident Panel feature articles about hallucination in chat-based LLM models. Some participants found these articles too generic and not relevant to the math tutor app (P39, P41).
Also, some LLM-generated use cases could be too far-fetched. For example, for the math tutor prompt, Farsight sometimes showed a use case: “Scammers use it to explain complex investment schemes to potential victims.” While some participants found it interesting and relevant (e.g., P14, P26), others found it unrealistic and not useful (e.g., P6, P12). This disagreement highlights the subjectivity in identifying and assessing harms (7.2 Even if it’s wrong [LLM-generated use case], it is still kind of helpful to think beyond the immediate use case and who else can use this tool.” Similarly, P21 said “Some of these feel more of a stretch but it’s interesting because I could see how it gives me ideas for things to watch out for which I still appreciate.
Lack of actionability. Another limitation is that our tools did not provide users with actions to prevent or mitigate identified harms (P13, P22, P34). P13 also commented that increasing awareness without providing actions to address responsible AI issues could be harmful, because “People have an empathy quota, and it might just be displacing more impactful efforts.” Related to the discomfort that some participants had experienced when envisioning harms (C3), P40 mentioned that they felt scared and overwhelmed because there were so many possible harms and they did not know how to address them. Similarly, P29 noted that the lack of actionability made them feel anxious and disappointed:
“I’m glad that I got to know about them [potential misuses]. But I feel I’m vulnerable, probably because I can’t do much about stopping them. So that’s something that really makes me feel very disappointed. Because unless we do case-by-case analysis, this [preventing misuses] can be very tricky. I feel like it’s kind of adding anxiety to me. It’s good to know, but I feel like I can’t do much about it.” (P29)
We did not incorporate harm mitigation into our tools, because mitigating harms associated with LLM-powered applications remains an open research question (see more discussion in section 7.3). After the evaluation study, we improved Farsight by providing pointers to existing LLM safety resources [e.g., 6, 117, 71, 164] when users exported their harms.

6.7 Limitations of Study Design

We acknowledge limitations in our tool and study designs. First, we recruited participants from a single large technology company. This was because we needed to require participants to have prior experience in prototyping LLM-powered applications using a particular prompt-crafting tool, into which we integrated Farsight and Farsight Lite in the study. Consequently, all 42 participants had backgrounds in the technology industry in varying roles, such as software engineers, product managers, UX researchers, and linguists4 as shown in Table 2. Our participants have a wide range of familiarity with responsible AI and prompting (Figure 10), and they use LLMs for diverse tasks, including prototyping AI features with LLMs—much like the intended users of Farsight. Therefore, findings from our study may be generalizable to AI prototypers who have worked in the technology industry, and who are using LLMs to prototype AI-based applications. Nevertheless, to understand how usable or effective Farsight may be for a broader spectrum of AI prototypers, particularly those with limited background or knowledge of AI, such as creative writers, teachers, students, and more, further research involving individuals with more diverse backgrounds is needed. Second, we administered only one post-task (H3) immediately following the intervention (H2). To evaluate the long-term impact of our tools on users’ ability to envision harms, a more extended longitudinal study is needed.
Finally, an inter-rater reliability test showed that, on average, the seven raters (i.e., of the likelihood and severity of the identified harms) only had a slight agreement (section E). The ratings of the likelihood and severity of participants’ identified harms should thus be taken as an initial step in evaluating identified harms, and not as the sole evidence demonstrating the value of this approach. The relatively low inter-rater reliability may be due to the fact that perceptions of severity and likelihood may be highly influenced by the raters’ personal experiences, backgrounds, knowledge, and their positionality as a whole. Indeed, substantial prior work on annotations of offensive language, hate speech, and other linguistic phenomena [35, 39, 123, 136, 138] suggest that disagreements between raters with different subjectivities (i.e., personal backgrounds and experiences) is an inherent challenge to sociotechnical evaluations, and not one that can be solved with more or better raters. We further discuss the challenges regarding subjectivity in identifying and assessing harms in 7.2

7 Discussion

7.1 Motivation & Engagement in Responsible AI

Potentials of in situ and early intervention in motivating responsible AI practices. Existing research suggests that many AI developers may not have incentives to consider potential harms related to their AI applications [143]—or may be actively disincentivized to identify such harms [104]. Our co-design user study validates this finding among an emerging community involved in AI development—AI prototypers who use LLMs to rapidly iterate on potential AI-based applications  (section 3.1). With the rapidly increasing access to LLMs and easy-to-use prototyping tools, it is crucial to motivate AI prototypers to consider AI risks when prototyping their AI applications or features (G3). To tackle this challenge, we propose an in situ system design that integrates our tool into the AI prototyper’s existing workflows and employs different design strategies to draw users’ attention without causing significant interruption to their flow. Our evaluation study shows that users appreciate our design, and find this in-context warning tool easy to adopt and engaging (6.6.2 Farsight piques users’ interests (6.5.2 in situ design and early intervention for future responsible AI works. Therefore, future designers of AI development tools (e.g., Google AI Studio, computational notebooks, and VSCode) can natively integrate in situ interfaces to promote responsible AI practices. In addition, future researchers can adopt our design strategies to foster other responsible AI practices, such as illustrating bias in LLMs and encouraging development documentation at an early AI development stage.
Tension between automation and human agency.Farsight’s seamless integration into AI prototypers’ workflows helps motivate AI prototypers to engage with harm envisioning. In addition, rather than asking users to anticipate harms entirely from scratch, Farsight leverages LLMs to generate the initial set of use cases, stakeholders, and harms, providing users with inspiration and a foundation to build upon (section 6.6.3). However, this seamless and automated design might deter users from fully engaging in and contemplating the limitations and potential risks associated with LLMs. Prior research in responsible AI has proposed the value of a seamful design [e.g., 52, 86], where the designers strategically reveal seams or introduce frictions or “productive restraint” [85, 104] to support increased reflection on responsible AI during development. To explore this tension and tradeoffs between a seamfully-designed workflow that is easy to use by prototypers, and a seamful design that prompts reflection-in-action [52], we (1) designed the Harm Envisioner to encourage users to edit LLM-generated content and steer the harm envisioning direction (section 4.3, G4), and (2) evaluated two variants of our tool in the evaluation study—Farsight and Farsight Lite, where Farsight Lite omits the Harm Envisioner.
Our study results highlight that participants feel they have agency (6.6.2 4). Our quantitative results also show that Farsight, with higher human agency, is more effective than Farsight Lite across all measures (section 6.4.1, section 6.5.1). On the other hand, when engaging with AI-generated content, some participants also report discomfort (C3) and even anxiety (6.6.3 in situ AI automation) and seamful design (promoting user reflection) are complementary to each other—tradeoffs and a balance between the two should be considered during the design of responsible AI tools [cf. 198]. For future responsible AI work, researchers should engage with potential users and other impacted stakeholders throughout the design process and adjust their design ideas to ensure the responsible AI tools they are designing are both easily adoptable and capable of eliciting active and critical reflection.

7.2 Subjectivity in Harm Envisioning

In our evaluation user study, many participants report challenges overcoming the limitations of their own experiences and perspectives when envisioning harms (C2). In addition, we also observed the seven RAI raters of participants’ harms disagreed about which harms were more or less severe or likely, resulting in a low inter-rater reliability for these two dimensions (section E, Table S2). Our empirical findings contribute to prior research that highlights the role of subjectivity and positionality in anticipating harms [20, 99] and in data annotation, particularly for annotations of toxicity or hate speech [e.g., 123, 43, 39, 35, 138]. What constitutes harm and the assessment of harm severity are often influenced by the individual’s background, lived experiences, or even the organizational culture they are working in [130, 196]. For example, for the article summarizer (H3), one participant envisioned a harm scenario: “If the summary is wrong, journalists’ reputation might be harmed.” (Table S2). This harm scenario received likelihood ratings of 1, 4, and 3, and severity ratings of 1, 3, and 4 from three randomly assigned raters. It is possible that the rater who assigned the ratings of 3 and 4 possessed specific knowledge about the harms of journalists using LLMs to write article summaries, which led them to rate this harm scenario as more likely and more severe.
A need for new methods to assess harms. Emerging research is beginning to develop methods for measuring and resolving disagreements among annotators in cases where there may in fact be no ground truth [e.g., 39, 138, 35, 123, 72]. Our findings in this paper—including the low inter-rater reliability of the responsible AI raters—suggest that new methods are needed in responsible AI to account for different perspectives on the severity and likelihood of potential downstream harms. This may ideally involve recruiting participants from communities or populations who may be impacted by a given AI application (e.g., the stakeholders generated by Farsight, for instance, as well as other stakeholders identified by members of the communities themselves [37]). Moreover, with the rapidly increasing access to LLMs and easy-to-use AI prototyping tools, AI prototypers may encompass a broader spectrum of roles beyond traditional AI practitioners [e.g., 78, 183]. Thus, they may lack either the experience or the resources to recognize the limitations of their own subjectivity when anticipating harms of their AI applications, and may lack the means to identify and engage with diverse stakeholders as part of harm envisioning.
Benefits and challenges of using LLM to envision AI harms. Our evaluation study highlights that diverse and unexpected AI-generated use cases, stakeholders, and harms in Farsight help some participants overcome their own failures of imagination [20] in order to think from a broader perspective when independently envisioning harms (6.5.2 Farsight than with existing harm envisioning processes [120] (Figure 12). There are two implications of these findings. First, LLMs can be a promising tool to help AI prototypers think outside of their own perspectives, and future researchers can adapt our approach to other responsible AI practices. Second, LLMs may encode biases from their training data [e.g., 190], and Farsight may also reflect the biases of its creators, as expressed in the underlying prompts used in Farsight’s LLM, which raises a critical question: to what extent can LLMs be helpful as part of a harm envisioning process, without over-indexing on particular harms or leading AI prototypers to overlook other types of harms?
Our research provides an initial starting point into investigating these questions, as well as opening new questions into the role of subjectivity in harm envisioning. Future research can further investigate the factors influencing users’ ability to envision harms of AI applications, develop new ways to model and resolve disagreement among AI prototypers or other evaluators about the severity and likelihood of envisioned harms, and integrate such implications into LLM-powered responsible AI tools for AI prototypers or other AI practitioners. Future research can also explore tradeoffs between semi-automated harm envisioning processes (like Farsight) and more traditional processes like value-sensitive design [e.g., 57], participatory design [e.g., 16, 130, 37], and more.

7.3 Mitigating Harms during AI Prototyping

A limitation of Farsight is its focus on harm identification rather than harm mitigation. Participants from our co-design study (section 3.1) and evaluation study (6.6.3 Farsight to provide actionable items to help them prevent and mitigate identified harms. Some participants also suggested we develop an in situ prompt editing tool to address harms identified from Farsight (section 3.1). Interestingly, while using Farsight, some participants voluntarily thought about actions and strategies to take after envisioning harms, such as implementing an appeal process, collecting better data, and revising the prompts (6.5.2 2.2).
Looking ahead, we argue that it is crucial for future designers to provide users with harm mitigation suggestions and resources in systems similar to Farsight. Some participants in our study complained that Farsight is exploiting users’ “empathy quota” and potentially desensitizing them about LLM harms, because Farsight only warns users about harms without providing mitigation suggestions (6.6.3 2.4) and monitoring alarms in healthcare. Alarm fatigue occurs “when non-actionable alarms are in the majority, and clinicians develop decreased reactivity, causing them to ‘tune out’ or ignore the alarms” [82]. Therefore, to combat alarm fatigue and effectively promote responsible AI practices, future designers should make responsible AI alerts actionable and prioritize actionable warnings in their systems.
Our findings highlight that Farsight users have a great appetite for mitigation strategies during AI prototyping. We have two hypotheses for this observation. First, as Farsight promotes human agency, it might also give participants a feeling of ownership of their identified harms. Prior research shows that triggering a feeling of ownership motivates users’ actions [26]. Another hypothesis is that Farsight elicits fear by exposing participants to diverse potential harms of their AI applications, evidenced by participant-reported discomfort (C3) and anxiety (6.6.3 fear appeals as a design strategy to motivate users to take security actions [148]. Therefore, our empirical findings highlight promising research opportunities in (1) providing in situ mitigation strategies during the early AI prototyping stage, and (2) investigating if in situ tools can increase users’ adoption of harm mitigation strategies.

8 Conclusion

We introduce Farsight, the first in situ interactive tool to address the challenges in anticipating potential harms in LLM-powered applications during prototyping. By highlighting relevant AI incident reports and enabling AI prototypers to curate and modify LLM-generated use cases, stakeholders, and harms, Farsight improves users’ ability to independently anticipate potential risks associated with their prompts. A user study with 42 AI prototypers shows that our tool is useful and usable. Farsight fosters a user-centric approach, encouraging creators to consider end-users, and cascading harms, and extend their awareness beyond immediate harms. Our tool is open-source and readily adoptable. We hope our work will inspire future research and development of responsible AI tools that target the early stages of the AI development process.

Acknowledgments

We express our gratitude to all anonymous participants who took part in our co-design and evaluation studies. A special thank you to Jaemarie Solyst and Savvas Petridis for piloting our studies. We are deeply thankful to our three anonymous raters for rating the harms collected during the evaluation study. We are immensely grateful for the invaluable feedback provided by Parker Barnes, Carrie Cai, Alex Fiannaca, Tesh Goyal, Ellen Jiang, Minsuk Kahng, Shaun Kane, Donald Martin, Alicia Parrish, Adam Pearce, Savvas Petridis, Mahima Pushkarna, Dheeraj Rajagopal, Kevin Robinson, Taylor Roper, Negar Rostamzadeh, Renee Shelby, Jaemarie Solyst, Vivian Tsai, James Wexler, Ann Yuan, and Andrew Zaldivar. Our gratitude also extends to the anonymous Google employees who generously allowed us to use their prompts to design and prototype Farsight. Furthermore, we are grateful to James Wexler, Paul Yang, Tulsee Doshi, and Marian Croak for their assistance in open-sourcing Farsight. Lastly, we would like to acknowledge the anonymous reviewers for their detailed and helpful feedback.

A Co-design User Study Interviews

A.1 Co-design Prototypes and Sketches

Fig. S1:
Fig. S1: To evaluate our early Farsight designs and generate more design ideas, we conducted a co-design study (section 3.1) with 10 AI prototypers. Participants were asked to use our very early-stage design prototypes (shown in cells labeled with 4.

A.2 Interview 1 Questions

Why do you use a prompt crafting tool?
How have you used it most recently? Can you walk me through one of your example prompts?
For your previous prompts, did you ever think about the potential societal impacts of your AI application / feature?
-
It’s OK if this wasn’t part of your process.
-
If yes, how did you think through or envision those potential impacts of your AI prototypes / ideas?
-
If yes, can you share some examples of the types of impacts of your AI application that you considered? Including both positive and negative impacts.
-
If no, how would you envision potential impacts of your AI prototypes / ideas?
What would motivate you to think more about the potential negative impacts of the applications you were prototyping?

A.3 Interview 2 Rating Forms and Questions

How do you think our design might fit into your prompting workflow?
-
How do you think our design might fit into your typical AI application prototyping workflow?
What would these suggested use cases, stakeholders, and harms prompt you to think or do differently? (if not answered already)
What would prevent you from using this tool?
Other design ideas?
-
What would encourage more use/engagement with the tool?
-
What are other ways you could imagine raising awareness of potential responsible AI and safety issues?

B Interface Details

B.1 Determine the Thresholds for Relevancy of AI Incident Reports

We collect 1000 random internal prompts written by real AI prototypers. Then, we compute the embedding similarity between these prompts and all AI incident reports [111]. We use the 20% and 70% cumulative density function cut-offs (0.69 and 0.75) of the max prompt-incident embedding distance as our thresholds for , , . Researchers can easily adjust these two thresholds (bounded between 0 and 1) to calibrate an article’s relevancy.
Fig. S2:
Fig. S2: A visualization of the PaLM 2 embeddings of 3,474 AI incident reports [111] (blue dots and contour) and 153 Awesome ChatGPT Prompts [1] (red dots and contour). The embeddings’ dimensions were reduced from 768 to 2 using UMAP [113] with default parameters. The rectangles and labels show the summaries of AI incident reports in high-density embedding neighborhoods. The summaries are automatically generated by WizMap [186]. The visualization reveals different clusters in the AI incident reports, such as incidents related to autonomous driving cars in the bottom left and machine translation on the right. The overlap between the red and blue contours indicates that user prompts can be in close proximity to AI incident reports in the 2D embedding space. This observation inspires us to use high-dimensional embedding similarities to calculate the alert levels in Farsight (section 4.1). Note that in this example, the 153 user prompts form a cluster due to the primary focus of AwesomeGPT prompts on conversational agents. The distribution of our 1,000 internal prompts (featuring classification, translation, code generation, etc.) is more spread out. For an interactive version of this visualization, visit WizMap.

B.2 Example Farsight Output

Table S1:
   
Use CasesStakeholdersHarms
Online moderators use it to remove toxic content from social media.  Online moderatorsOnline moderators may be exposed to toxic content while using the AI tool.
  Online moderators may feel frustrated or anxious about the accuracy of the AI tool.
  Online moderators may lose their jobs due to AI-generated content being offensive.
 Social media usersSocial media users may be unfairly censored by the AI tool.
  Social media users may feel frustrated or anxious about the accuracy of the AI tool.
  Social media users may feel like they are not being heard by the moderators.
 The companyThe company may be sued for defamation if the AI tool mislabels a post as toxic.
  The company may lose customers if the AI tool is not accurate enough.
  The company may be accused of bias if the AI tool is not fair in its classification of toxic content.
 Social media users who do not use this AI productSocial media users who do not use this AI product may be exposed to toxic content.
  Social media users who do not use this AI product may feel like they are not being heard by the moderators.
  Social media users who do not use this AI product may feel frustrated or anxious about the amount of toxic content on social media.
 Social media advertisersSocial media advertisers may lose revenue due to toxic content being removed from social media.
  Social media advertisers may have to spend more time and money on creating non-toxic content.
  Social media advertisers may feel frustrated or anxious about the accuracy of the AI tool.
Customer service agents use it to identify abusive customers.  Customer service agentsCustomer service agents may feel stressed or anxious about the accuracy of the AI tool.
  Customer service agents may be accused of bias if the AI tool misidentifies abusive customers.
  Customer service agents may lose the skill to identify abusive customers independently.
 Abusive customersAbusive customers may be denied service due to being labeled as toxic.
  Abusive customers may feel frustrated or anxious about being labeled as toxic.
  Abusive customers may be more likely to engage in abusive behavior in the future.
 The companyThe company may be sued for defamation if the AI tool mislabels a customer as abusive.
  The company may face increased regulatory scrutiny if the AI tool is found to be inaccurate or biased.
  The company may lose customers if the AI tool is perceived as biased against certain groups.
 Customers who are not abusiveCustomers who are not abusive may be misidentified as abusive and treated poorly by customer service agents.
  Customers who are not abusive may feel frustrated or anxious about being misidentified as abusive.
  Customers who are not abusive may lose trust in customer service agents.
 Customers who are misidentified as abusiveCustomers who are misidentified as abusive may be denied service or treated poorly by customer service agents.
  Customers who are misidentified as abusive may feel frustrated or anxious about being treated poorly.
  Customers who are misidentified as abusive may lose trust in customer service agents.
 Law enforcementLaw enforcement may make false arrests or detentions of innocent people due to AI-generated toxicity labels.
  Law enforcement may lose trust in AI tools due to false positives and negatives.
  Law enforcement may be biased against certain groups of people due to AI-generated toxicity labels.
Law enforcement uses it to identify potential terrorists.  Potential terroristsPotential terrorists may be unfairly targeted by law enforcement due to AI-generated misclassifications.
  Potential terrorists may feel frustrated or anxious about being unfairly targeted by law enforcement.
  Potential terrorists may be denied opportunities due to AI-generated misclassifications.
 Victims of terrorismVictims of terrorism may be re-traumatized by AI-generated content that glorifies or incites violence against them.
  Victims of terrorism may feel unsafe or threatened by AI-generated content that glorifies or incites violence against them.
  Victims of terrorism may feel like their voices are not being heard by society.
 Civil rights groupsCivil rights groups may be unfairly targeted by law enforcement due to AI-generated false positives.
  Civil rights groups may lose trust in law enforcement due to AI-generated false positives.
  Civil rights groups may have to spend more time and resources fighting against AI-generated false positives.
 People who are falsely identified as potential terroristsPeople who are falsely identified as potential terrorists may be subject to increased surveillance, harassment, and discrimination.
  People who are falsely identified as potential terrorists may lose their jobs, housing, or other opportunities.
  People who are falsely identified as potential terrorists may be detained or imprisoned without due process.
Military uses it to identify potential insurgents.  MilitaryMilitary may use AI to target and harm innocent civilians.
  Military may lose trust from the public due to its use of AI.
  Military may be pressured to use AI to stay competitive with other militaries.
 InsurgencyInsurgency may be able to evade detection by using AI-generated text that is not classified as toxic.
  Insurgency may be able to recruit more members by using AI-generated text that is not classified as toxic.
  Insurgency may be able to spread propaganda more effectively by using AI-generated text that is not classified as toxic.
 GovernmentsGovernments may use AI to target and surveil innocent people.
  Governments may use AI to justify violence against innocent people.
  Governments may lose control over how AI is used to target and surveil people.
 Civilians who are wrongly identifiedCivilians who are wrongly identified may be subject to violence or discrimination.
  Civilians who are wrongly identified may lose their freedom or property.
  Civilians who are wrongly identified may experience emotional distress.
 Civilians who are not identified as insurgentsCivilians who are not identified as insurgents may be subject to violence or discrimination.
  Civilians who are not identified as insurgents may lose their freedom or property.
  Civilians who are not identified as insurgents may experience emotional distress.
 Hate groupsHate groups may be able to recruit more people due to the AI tool.
  Hate groups may be able to spread their ideology more effectively due to the AI tool.
  Hate groups may be able to avoid detection by law enforcement due to the AI tool.
 Potential recruitsPotential recruits may be exposed to toxic content that could lead to radicalization.
  Potential recruits may feel alienated and isolated from society due to their exposure to toxic content.
  Potential recruits may lose the opportunity to develop healthy relationships with people outside of the hate group.
Hate groups use it to identify potential recruits.  The companyThe company may be held liable for hate speech generated by its AI product.
  The company may face increased regulation due to its AI product’s misuse.
  The company may lose customers due to negative publicity.
 People who are targeted by hate groupsPeople who are targeted by hate groups may be harassed or threatened by hate groups.
  People who are targeted by hate groups may feel anxious or fearful of being targeted again.
  People who are targeted by hate groups may lose their sense of safety and belonging.
 People who are not targeted by hate groupsPeople who are not targeted by hate groups may be exposed to toxic content.
  People who are not targeted by hate groups may be harassed or threatened by hate groups.
  People who are not targeted by hate groups may feel anxious or fearful about the rise of hate groups.
Scammers use it to identify potential victims.  Victims of scamsVictims of scams may lose money or personal information.
  Victims of scams may experience emotional distress.
  Victims of scams may lose trust in others.
 ScammersScammers may be able to target more vulnerable people with their scams.
  Scammers may be able to avoid detection by using AI-generated text that is less likely to be flagged as toxic.
  Scammers may be able to increase their profits by using AI-generated text to target more people.
 The companyThe company may be held liable for the harm caused by scammers using the AI tool.
  The company may lose customers due to negative publicity.
  The company may face increased regulatory scrutiny.
 Victim’s financial institutionsVictim’s financial institutions may lose money due to scammers using AI to identify potential victims.
  Victim’s financial institutions may face increased legal liability due to scammers using AI to identify potential victims.
  Victim’s financial institutions may lose trust from their customers due to scammers using AI to identify potential victims.
 Family and friends of the victimFamily and friends of the victim may be harmed by the scammer’s actions.
  Family and friends of the victim may be harmed by the scammer’s emotional abuse.
  Family and friends of the victim may be harmed by the scammer’s financial abuse.
Table S1: Farsight uses LLMs to generate use cases, stakeholders, and harms associated with AI apps based on a given prompt. This table shows the outputs of Farsight for a toxicity classifier prompt (§ C.2). By default, the Harm Envisioner displays six use cases (arranged by two , two , and two ), five stakeholders (arranged by three direct stakeholders and two indirect stakeholders) for each use case, and one harm for each stakeholder. Users can modify or delete existing outputs and ask Farsight to generate additional content (e.g., generating harms for a newly added stakeholder).

C Evaluation User Study Prompts

C.1 Pre-task: Email Summarizer (H1)

AI feature functionality: Shorten and improve a user’s email
Context: Your team is adding a new AI feature to an email client
Text prompt:

C.2 Intervention: Toxicity Classifier (H2)

AI feature functionality: Detect toxic text content
Context: Your team is developing a social media content moderation tool (e.g., API service)
Text prompt:
Prompt engineering: Participants were asked to modify the prompt to alter the model’s prediction on the test input sentence: “He is not living proof that gay conversion therapy works.” from toxic to non-toxic. We intentionally formulated the initial prompt to make the LLM predict the given test input sentence as toxic, as we exclusively included toxic examples in the prompt.

C.3 Post-task: Article Summarizer (H3)

AI feature functionality: Summarize an article in one sentence
Context: Your team is adding an AI feature to a text editor software
Text prompt:

C.4 Alternative: Math Tutor (H4)

AI feature functionality: Answer math-related questions.
Context: Your team is developing a math tutoring app (e.g., mobile app)
Text prompt
Prompt Engineering: During prompt engineering, participants were asked to modify the prompt so that when a user posed non-math-related questions, the math tutor app’s response would be “Sorry, I’m not sure about the answer to this question. Try a different question.”

D Evaluation User Study Interviews

D.1 Interview 1 Questions

How did the harm envisioning tool (Farsight, Farsight Lite, or Envisioning Guide) influence your strategy for envisioning harms?
Is envisioning potential harms something you would typically do when writing prompts? If so, how do you typically do it?
What are the most relevant or important harms you have identified? Why?
What is the biggest challenge in envisioning harms?
What other tools or other types of support would you like to have to help you envision harms when prototyping LLM-powered applications?
What do you think you would do about those identified harms?
Do you have any feedback on the harm envisioning tool (Farsight, Farsight Lite, or Envisioning Guide)?

D.2 Interview 2 Rating Forms and Questions

[ For participants in CFG and CGF ]
How would you rate Farsight ?
Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.
Help me envision harms.
Easy to use.
Enjoyable to use.
I would use this tool in the future.
How would you rate the helpfulness of Farsight ’s different components?
Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.
AI incidents.
Use case sidebar.
AI assistance in harm envisioning.
Interactive tree visualization.
Customizability (add, edit, delete content).
[ For participants in CLG and CGL ]
How would you rate Farsight Lite ?
Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.
Help me envision harms.
Easy to use.
Enjoyable to use.
I would use this tool in the future.
How would you rate the helpfulness of Farsight Lite ’s different components?
Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.
AI incidents.
Use case sidebar.
AI assistance in harm envisioning.
[ For participants in CFG, CGF, CLG, and CGL ]
How would you rate Envisioning Guide ?
Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.
Help me envision harms.
Easy to use.
Enjoyable to use.
I would use this tool in the future.
How would you rate the helpfulness of Envisioning Guide ’s different components?
Participants could select strongly agree, agree, neutral, disagree, or strongly disagree. The default selection is neutral.
Harm modeling workflow table.
Text prompts to think about use cases, harms, and stakeholders.
Harm taxonomy.
[ For participants in CFG, CGF, CLG, and CGL ]
Overall, which tool do you prefer? Why?
Participants in CFG, CGF can choose Farsight or Envisioning Guide. Participants in CLG, and CGL can choose Farsight Lite or Envisioning Guide.

E Harm Rating Processing and Inter-rater Reliability

Fig. S3:
Fig. S3: We compute weighted pairwise Cohen’s kappa to measure inter-rater reliability for the ratings of harm likelihood (left) and harm severity (right). Raters rate each dimension on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree). We numericalized these four categories as ordinal data: 1, 2, 3, 4. Within each cell, the top number is the kappa score, and the bottom number is the count of common harms between the corresponding two raters. The average kappas for likelihood and severity ratings are 0.11 and 0.10, which can be interpreted as slight agreement.
In our evaluation user study, we recruited seven raters to rate the likelihood and severity of each collected harm (section 6.2.4). In total, we collected 989 harms with 895 unique harms (Table S2). We randomly assign each unique harm to three raters to rate its likelihood and severity on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree to statements “this harm is likely to happen” and “this harm is severe”). Raters could also choose an N/A option if they perceived a rating was not applicable. After collecting all ratings, we dropped all N/As and numericalized four rating categories as ordinal scores: 1, 2, 3, 4.
To measure the inter-rater reliability, we computed Cohen’s kappa [112] for each pair of raters. As the rating scores are ordinal, we used quadratically weighed kappa [32], so that the level of agreement between score 1 and score 2 is higher than score 1 and score 3. The pair-wise kappas and the count of common harms are shown in Figure S3. The average kappa for likelihood ratings is 0.14, and the average kappa for severity ratings is 0.09. Both scores can be interpreted as “slight agreement” [94].

E.1 Example Harms Collected from Participants

Table S2:
     
AI FeatureWho (could be harmed)How (the harm could happen)LikelihoodSeverity
Email summarizer (H1)AI companyAI company’s reputation may be harmed due to inaccurate summary3, 3, 32, 2, 3
 AuthorThe AI summary might lose context and information, critical context. It could cut out some crucial part of the email.4, 3, 33, 3, 3
 Owner of the emailRewrite could leak data from other users’ email.3, 1, 24, 4, 2
 People whose names / personal information is lost in the summarizationPotential data loss. The email could be very long, and the rewrite could miss some context (e.g., full name). The rewrite can cause confusion to the receivers.3, 3, 32, 3, 3
 Product team, companyPrompt injection attack.3, 1, 32, 1, 3
 ReceiverOmission of vital information and context. The feature can miss some very important information.3, 3, 33, 2, 3
 ReceiverThe initial rewrite is a bit open ended, which might cause anxiety.2, 3, 21, 3, 2
 RecipientsIf users can control the temperature of the LLM, they can generate more inaccurate emails.1, 3, 14, 2, 1
 Recipients due to loss of productivity and insight into their orgs metricsThe email may miss some metrics from the email.3, 3, 32, 2, 3
 Scam victimsScam emails are very common. Mass emailing a lot of people with easy automation can cause harm to victims.2, 3, 34, 3, 3
 SenderIn a professional setting, if the feature is not sufficient to capture the keywords, the summary can sound unprofessional. It can harm the reputation of the sender.2, 3, 32, 2, 2
 SenderThe summary is inaccurate, e.g., having the wrong time, names in the rewrite. It can harm the sender’s reputation.3, 3, 32, 3, 3
 Sender, RecipientThe feature introduces incorrect information that is not in the original email.3, 3, 33, 2, 2
 Sender, recipientQuality of the communication. If the tone of the rewrite is different from the original email, it can cause harm to both sender and recipient. Miscommunication.3, 2, 32, 2, 3
 Sender, recipientThe delicacy in the original email may be removed in the shorter email. The tone, emotion, and human elements may be changed.4, 3, 33, 2, 3
 Sender, recipientThe summary can have a different voice / style from the email (more directly, less soft). It can be perceived as negative.4, 3, 43, 2, 4
 Sender, recipient, company relationshipsAI can be culturally blind. It might not follow some typical email norm in the rewrite.3, 3, 32, 2, 3
 Senders, receiversThe AI feature puts words on other people’s behalf. It can have different viewpoints from the email. It can have misrepresented views.2, 3, 43, 3, 4
 Senders, recipientThe summary can lose some tone, voice of the email, becoming less personal.3, 3, 42, 3, 3
 Stakeholders of the decisionIf the communication is about decision making and the rewrite contains error, it can harm the stakeholders of the decision.1, 4, 31, 4, 3
 Team building the featureDamage to reputation, credibility to have a feature fail this way.3, 2, 33, 1, 2
Toxicity classifier (H2)AdvertisersIf the AI feature fails, ads might be present next to toxic content, losing revenues.3, 3, 31, 2, 3
 BulliesTeachers use it to identify students being bullied. The AI may make the teachers be biased against the bullies.2, 3, 23, 3, 2
 Company of the AI productThe company may lose users due to the presence of more moderation.3, 2, 14, 2, 1
 CustomerCustomer service agents use it to identify abusive customers. Customers may lose the opportunity to get help from customer service agents.3, 3, 33, 2, 4
 Customer service agentsIf customer service uses it to screen people, false positives: exposing customer service agents to toxic customers.3, 1, 33, 1, 3
 Employees of the companyHR departments use it to screen job applicants for toxic behavior. Employees may be unfairly rejected for jobs due to AI-generated toxicity labels.1, 3, 31, 3, 4
 End user (children & parents)Mislabeled outputs will not be blocked by parental controls.3, 3, 33, 4, 4
 GovernmentMilitary uses it to identify potential insurgents. Governments may use AI to target and surveil innocent people, leading to human rights violations.3, 2, 34, 3, 3
 Hate group targetsHate groups use it to identify potential recruits. Hate groups may be able to recruit more people due to the AI tool.3, 2, 33, 3, 3
 Job applicantsReddit has a karma score. Similarly, if a social media uses this feature to prioritize non-toxic content, misclassification can lose job opportunities for social media users (e.g, tweet not seen by companies).3, 4, 13, 3, 1
 Online moderatorsOnline moderators may lose their jobs because the AI tool can already distinguish toxic vs non-toxic so there is no need for a moderator.2, 1, 33, 1, 4
 Online moderatorsOnline moderators may miss some toxic content and post it on social media.3, 3, 33, 4, 3
 People being targeted by law enforcementLaw enforcement uses it to identify potential terrorists. Potential terrorists ([different name]) may be unfairly targeted by law enforcement due to AI-generated misclassifications.3, 1, 33, 1, 4
 People who are not members of hate groupsHate groups use it to identify potential recruits. People who are not members of hate groups may be affected by increased hate group activity.3, 3, 43, 3, 4
 Sender, ReceiversFalse positive. Flagging non-toxic content as toxic, causing harm to legitimate receivers of the information (filtered). Sender cannot post due to the filter.3, 3, 43, 3, 4
 Social media usersSocial media users may be exposed to toxic content that was mislabeled by the AI as non-toxic.3, 3, 33, 4, 3
 Social media usersSocial media users may feel frustrated or anxious about the accuracy of the AI tool.3, 3, 32, 4, 2
 Social media usersSocial media users who do not use this AI might be exposed to some distorted message.2, 1, 33, 1, 3
 Social media usersUnfairly flagged as toxic (false positive trigger).3, 3, 33, 2, 2
 StudentTeachers use it to help students identify toxic language. Students may feel frustrated or anxious about the accuracy of the AI tool.2, 1, 32, 1, 4
 VictimsScammers use it to identify potential victims. Victims of scams may lose money or personal information.2, 2, 14, 3, 1
Article Sumamrizer (H3)Anyone reading the documentIf the document is sent to anyone else, the summary may not well-represent the document. The readers of the summary might mis-understand the original document.4, 3, 33, 2, 3
 BusinessWrite marketing content, memo. Bad summary can lead to miscommunication, causing financial loss.4, 3, 33, 3, 4
 Company publishing the content (news paper)If the summary is wrong, people would lose confidence in the company.3, 4, 42, 4, 3
 Economic lossThe summary may contain misinformation about some critical article, causing economic loss.3, 3, 13, 3, 1
 EmployeesEmployees can use it to summarize documents in a workplace. Wrong summary can harm the stakeholders.3, 3, 34, 3, 3
 End userChange the meaning of the article.3, 4, 43, 4, 4
 Journalists / writersIf the summary is wrong, journalists’ reputation might be harmed.1, 4, 31, 3, 4
 KidsKids use it cheat in school for assignments.3, 2, 32, 2, 3
 ReadersAI fails to pick up key information (nuances) in the article => the readers miss the points from the article, miscommunication.3, 3, 32, 2, 2
 Readers, content providers, newIf there are two facts in the article that are true independently, then it’s possible that the summary combines them where the statement is no longer true.3, 2, 32, 2, 3
 Readers, societyA lot of content is missed, making readers less educated overtime, causing societal loss (some events being lost in history).3, 2, 13, 3, 1
 Readers, writersIt would lose a lot of context and details.3, 3, 32, 3, 2
 Social groupsRepresentational harm. It could stereotype some social groups in the output.3, 3, 33, 3, 3
 StudentsStudents can use this feature to summarize articles instead of reading for assignments.4, 3, 33, 3, 3
 StudentsStudents can use this tool to cheat and lose opportunities to learn.4, 1, 33, 1, 4
 StudentsStudents misuse this feature to cheat on school assignments. They might lose learning opportunities.4, 2, 33, 2, 3
 UsersThe context can be manipulated.2, 4, 32, 4, 3
 UsersThe summary can lose some information and context from the article.3, 4, 33, 4, 2
 UsersThe tool makes mistakes (e.g., wrong summary). It can harm the users of the tool by missing information in the article.3, 3, 33, 3, 3
 UsersUsers may lose trust if the summary is not right. Users would become more stressful as well.3, 3, 42, 2, 3
 WritersIf someone uses the tool to write, it may change people’s perception about how you write.3, 1, 12, 1, 2
Math tutor (H4)Company of the AI productEngineers use it to explain to non-technical stakeholders. The company may face increased legal liability due to AI-generated explanations being inaccurate or misleading.2, 1, 21, 1, 2
 EditorsJournal readers use it to get explanations of equations in inferential statistics sections. Journal editors may worry that the quality of their journal declines because the AI feature makes too many errors. Readers get frustrated over time.2, 1, 22, 1, 3
 Marginalized populationPeople with less resources are less likely to access this tool, losing opportunity to learn.4, 3, 33, 3, 3
 Minority social groupsIf you ask the math tool about important social groups, it can refuse to answer => ignoring important questions, marginalizing some social groups.3, 3, 13, 3, 1
 Minority social groupsThere are different ways to phrase things differently based on some social groups. If the user asks a question in non-profession English or non-English, it can cause alienation.4, 3, 33, 3, 3
 Non-technical stakeholderEngineers use it to explain complex math concepts to non-technical stakeholders. Non-technical stakeholders may be misled by AI-generated explanations of complex mathematical models.3, 2, 33, 2, 3
 Parents of studentsStudents use it to learn math. Parents may feel frustrated or anxious about their child’s education.2, 2, 32, 2, 3
 PublicThe public may be misled by AI-generated explanations of scientific concepts.3, 3, 33, 3, 3
 Rocket engineersWrong answers can harm the task.3, 3, 33, 2, 3
 StudentTutors use it to help students understand math concepts. Students may lose the opportunity to learn the material in a way that is tailored to their individual needs.4, 2, 44, 2, 4
 StudentsIf the students use this tool to cheat, they would have economic loss (not performing well on their jobs).2, 1, 32, 1, 3
 StudentsIncorrect math answers harm the user’s learning process.3, 3, 32, 3, 3
 StudentsStudents use it to cheat, losing opportunities to learn.4, 1, 33, 1, 3
 StudentsStudents use it to learn math. Students may feel like they are losing their ability to learn about math problems.2, 1, 22, 1, 2
 StudentsStudents use it to learn math. Students who do not use this AI product may feel like they are not getting the same quality of education as their peers.3, 3, 32, 3, 3
 StudentsWrong solution, less optimized solution, wrong information, students learn the wrong thing.3, 3, 33, 2, 3
 Students and teachersStudents can use this app to cheat. Students do not learn.4, 3, 33, 2, 4
 Students who do not use this AI productStudents use it to learn math. Students who do not use this AI product may feel left behind by their peers who do.3, 3, 43, 2, 4
 Students, societyIf the answer is based on some context of the math problem, it can harm minority students.2, 3, 13, 3, 1
 UsersIf you ask tax rates / vote count / interest rate / wages, the wrong answer can cause political harm and social harm.3, 2, 33, 2, 3
 UsersNot all questions are about numbers. Sometimes the app might refuse to answer hard math questions, and uses would feel upset.3, 3, 32, 2, 3
Table S2: A random subset (n=84) of 895 unique harms collected from 42 participants in our evaluation user study (§ 6). This random subset includes 14 harms for each of the four AI features (four prompts used in H1–H4). Depending on the experimental conditions, each harm was envisioned by a participant (H1–H4) or generated by Farsight and curated by a participant (H2, H4). Each harm was rated by three random raters out of seven raters in terms of likelihood and severity on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree to statements “this harm is likely to happen” and “this harm is severe”).

Footnotes

3
Farsight Python package: https://pypi.org/project/farsight/
4
The linguists in our study had roles involved in consulting on language-based data used by AI product teams.

Supplemental Material

MP4 File - Video Preview
Video Preview
MP4 File - Video Presentation
Video Presentation
Transcript for: Video Presentation
MP4 File - Demo video
A 5-minute long video to showcase Farsight's functionalities.
CRX File - Farsight Chrome extension
Farsight Chrome extension to integrate Farsight into Google AI Studio. It is also available at https://github.com/PAIR-code/farsight/releases.
GZ File - Farsight Python Library
Farsight Python library that integrates Farsight into computational notebooks (e.g., Jupyter Notebook, JupyterLab, Google Colab, and VS Code Notebook). It is also available at https://pypi.org/project/farsight/.
ZIP File - Farsight source code
Farsight source code. It is also available at https://github.com/PAIR-code/farsight.

References

[1]
Fatih Kadir Akın. 2022. Awesome ChatGPT Prompts. https://github.com/f/awesome-chatgpt-prompts
[2]
Sanna J Ali, Angèle Christin, Andrew Smart, and Riitta Katila. 2023. Walking the Walk of AI Ethics: Organizational Challenges and the Individualization of Risk among Ethics Entrepreneurs. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency.
[3]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. ICSE (2019). https://doi.org/10.1109/ICSE-SEIP.2019.00042
[4]
Saleema Amershi, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, Eric Horvitz, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, and Paul N. Bennett. 2019. Guidelines for Human-AI Interaction. CHI (2019). https://doi.org/10.1145/3290605.3300233
[5]
Bonnie Brinton Anderson, C. Brock Kirwan, Jeffrey L. Jenkins, David Eargle, Seth Howard, and Anthony Vance. 2015. How Polymorphic Warnings Reduce Habituation in the Brain: Insights from an fMRI Study. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. https://doi.org/10.1145/2702123.2702322
[6]
Anthropic. 2023. Core Views on AI Safety: When, Why, What, and How. https://www.anthropic.com/index/core-views-on-ai-safety
[7]
Apple. 2023. Human Interface Guidelines: Machine Learning. https://developer.apple.com/design/human-interface-guidelines/machine-learning
[8]
Carolyn Ashurst, Emmie Hine, Paul Sedille, and Alexis Carlier. 2022. Ai Ethics Statements: Analysis and Lessons Learnt from Neurips Broader Impact Statements. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency.
[9]
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a Laboratory for Alignment. arXiv 2112.00861 (2021). http://arxiv.org/abs/2112.00861
[10]
James Auger. 2013. Speculative Design: Crafting the Speculation. Digital Creativity 24 (2013).
[11]
Stephanie Ballard, Karen M. Chappell, and Kristen Kennedy. 2019. Judgment Call the Game: Using Value Sensitive Design and Design Fiction to Surface Ethical Concerns Related to Technology. In Proceedings of the 2019 on Designing Interactive Systems Conference. https://doi.org/10.1145/3322276.3323697
[12]
Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith Ringel Morris, Jennifer Wortman Vaughan, W. Duncan Wadsworth, and Hanna Wallach. 2021. Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and Tradeoffs. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society(AIES ’21). https://doi.org/10.1145/3461702.3462610
[13]
Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. 2018. AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. arXiv 1810.01943 (2018). http://arxiv.org/abs/1810.01943
[14]
Elena Beretta, Antonio Vetrò, Bruno Lepri, and Juan Carlos De Martin. 2021. Detecting Discriminatory Risk through Data Annotation Based on Bayesian Inferences. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445940
[15]
Robert Biddle, P. C. Van Oorschot, Andrew S. Patrick, Jennifer Sobey, and Tara Whalen. 2009. Browser Interfaces and Extended Validation SSL Certificates: An Empirical Study. In Proceedings of the 2009 ACM Workshop on Cloud Computing Security. https://doi.org/10.1145/1655008.1655012
[16]
Abeba Birhane, William Isaac, Vinodkumar Prabhakaran, Mark Diaz, Madeleine Clare Elish, Iason Gabriel, and Shakir Mohamed. 2022. Power to the People? Opportunities and Challenges for Participatory AI. In Equity and Access in Algorithms, Mechanisms, and Optimization. https://doi.org/10.1145/3551624.3555290
[17]
Rainer Böhme and Stefan Köpsell. 2010. Trained to Accept?: A Field Experiment on Consent Dialogs. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/1753326.1753689
[18]
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2022. On the Opportunities and Risks of Foundation Models. arXiv 2108.07258 (2022). http://arxiv.org/abs/2108.07258
[19]
Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3 Data-Driven Documents. IEEE TVCG 17 (2011). https://doi.org/10.1109/TVCG.2011.185
[20]
Margarita Boyarskaya, Alexandra Olteanu, and Kate Crawford. 2020. Overcoming Failures of Imagination in AI Infused System Development and Deployment. arXiv 2011.13416 (2020). http://arxiv.org/abs/2011.13416
[21]
Virginia Braun and Victoria Clarke. 2006. Using Thematic Analysis in Psychology. Qualitative Research in Psychology 3 (2006). https://doi.org/10.1191/1478088706qp063oa
[22]
Philip AE Brey. 2012. Anticipatory Ethics for Emerging Technologies. Nanoethics 6 (2012).
[23]
José Carlos Brustoloni and Ricardo Villamarín-Salomón. 2007. Improving Security Decisions with Polymorphic and Audited Dialogs. In Proceedings of the 3rd Symposium on Usable Privacy and Security. https://doi.org/10.1145/1280680.1280691
[24]
Zana Buçinca, Chau Minh Pham, Maurice Jakesch, Marco Tulio Ribeiro, Alexandra Olteanu, and Saleema Amershi. 2023. AHA!: Facilitating AI Impact Assessment by Generating Examples of Harms. arXiv 2306.03280 (2023). http://arxiv.org/abs/2306.03280
[25]
Angel Alexander Cabrera, Will Epperson, Fred Hohman, Minsuk Kahng, Jamie Morgenstern, and Duen Horng Chau. 2019. FAIRVIS: Visual Analytics for Discovering Intersectional Bias in Machine Learning. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST). https://doi.org/10.1109/VAST47406.2019.8986948
[26]
Ana Caraban, Evangelos Karapanos, Daniel Gonçalves, and Pedro Campos. 2019. 23 Ways to Nudge: A Review of Technology-Mediated Nudging in Human-Computer Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3290605.3300733
[27]
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying Memorization Across Neural Language Models. arXiv 2202.07646 (2023). http://arxiv.org/abs/2202.07646
[28]
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Language Models. arXiv 2012.07805 (2021). http://arxiv.org/abs/2012.07805
[29]
Shruthi Sai Chivukula, Ziqing Li, Anne C Pivonka, Jingning Chen, and Colin M Gray. 2021. Surveying the Landscape of Ethics-Focused Design Methods. arXiv preprint arXiv:2102.08909 (2021).
[30]
Alexandra Chouldechova. 2017. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data 5 (2017).
[31]
Andrew R Chow and Billy Perrigo. 2023. The AI Arms Race Is Changing Everything. https://time.com/6255952/ai-impact-chatgpt-microsoft-google/
[32]
Jacob Cohen. 1968. Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit.Psychological Bulletin 70 (1968). https://doi.org/10.1037/h0026256
[33]
Jacob Cohen. 2013. Statistical Power Analysis for the Behavioral Sciences (2nd ed ed.).
[34]
Ned Cooper, Tiffanie Horne, Gillian R Hayes, Courtney Heldreth, Michal Lahav, Jess Holbrook, and Lauren Wilcox. 2022. A Systematic Review and Thematic Analysis of Community-Collaborative Approaches to Computing Research. In CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3491102.3517716
[35]
Aida Mostafazadeh Davani, Mark Diaz, Dylan Baker, and Vinodkumar Prabhakaran. 2023. Disentangling Disagreements on Offensiveness: A Cross-Cultural Study. In The 61st Annual Meeting of the Association for Computational Linguistics.
[36]
Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang. 2021. Stakeholder Participation in AI: Beyond "Add Diverse Stakeholders and Stir". arXiv 2111.01122 (2021). http://arxiv.org/abs/2111.01122
[37]
Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang. 2023. The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice. In Equity and Access in Algorithms, Mechanisms, and Optimization. https://doi.org/10.1145/3617694.3623261
[38]
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. 2023. MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots. arXiv 2307.08715 (2023). http://arxiv.org/abs/2307.08715
[39]
Emily Denton, Mark Díaz, Ian Kivlichan, Vinodkumar Prabhakaran, and Rachel Rosen. 2021. Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation. arXiv 2112.04554 (2021). http://arxiv.org/abs/2112.04554
[40]
Deque. 2023. Axe DevTools: Digital Accessibility Testing Tools Dev Teams Love. https://www.deque.com/axe/devtools/
[41]
Erik Derner and Kristina Batistič. 2023. Beyond the Safeguards: Exploring the Security Risks of ChatGPT. arXiv 2305.08005 (2023). http://arxiv.org/abs/2305.08005
[42]
Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in ChatGPT: Analyzing Persona-assigned Language Models. arXiv 2304.05335 (2023). http://arxiv.org/abs/2304.05335
[43]
Mark Díaz, Ian Kivlichan, Rachel Rosen, Dylan Baker, Razvan Amironesei, Vinodkumar Prabhakaran, and Emily Denton. 2022. CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation. In 2022 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3531146.3534647
[44]
Marc-Antoine Dilhac, Christophe Abrassart, and Nathalie Voarino. 2018. Report of the Montréal Declaration for a Responsible Development of Artificial Intelligence. (2018). https://doi.org/1866/27795
[45]
Kimberly Do, Rock Yuren Pang, Jiachen Jiang, and Katharina Reinecke. 2023. “That’s Important, but...”: How Computer Science Researchers Anticipate Unintended Consequences of Their Research Innovations. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems.
[46]
Doteveryone. 2019. Consequence Scanning – an Agile Practice for Responsible Innovators. https://doteveryone.org.uk/project/consequence-scanning/
[47]
Dovetail. 2023. Dovetail: All Your Customer Insights in One Place. https://dovetail.com/
[48]
Olive Jean Dunn. 1961. Multiple Comparisons among Means. J. Amer. Statist. Assoc. 56 (1961). https://doi.org/10.1080/01621459.1961.10482090
[49]
Anthony Dunne and Fiona Raby. 2013. Speculative Everything: Design, Fiction, and Social Dreaming.
[50]
Serge Egelman, Lorrie Faith Cranor, and Jason Hong. 2008. You’ve Been Warned: An Empirical Study of the Effectiveness of Web Browser Phishing Warnings. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/1357054.1357219
[51]
Serge Egelman and Stuart Schechter. 2013. The Importance of Being Earnest [in Security Warnings]. In Financial Cryptography and Data Security: 17th International Conference, FC 2013, Okinawa, Japan, April 1-5, 2013, Revised Selected Papers 17.
[52]
Upol Ehsan, Q Vera Liao, Samir Passi, Mark O Riedl, and Hal Daume III. 2022. Seamful XAI: Operationalizing Seamful Design in Explainable AI. arXiv preprint arXiv:2211.06753 (2022).
[53]
Adrienne Porter Felt, Alex Ainslie, Robert W. Reeder, Sunny Consolvo, Somas Thyagaraja, Alan Bettes, Helen Harris, and Jeff Grimes. 2015. Improving SSL Warnings: Comprehension and Adherence. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. https://doi.org/10.1145/2702123.2702442
[54]
Adrienne Porter Felt, Robert W. Reeder, Hazim Almuhimedi, and Sunny Consolvo. 2014. Experimenting at Scale with Google Chrome’s SSL Warning. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/2556288.2557292
[55]
Alexander J. Fiannaca, Chinmay Kulkarni, Carrie J Cai, and Michael Terry. 2023. Programming without a Programming Language: Challenges and Opportunities for Designing Developer Tools for Prompt Programming. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544549.3585737
[56]
Casey Fiesler, Natalie Garrett, and Nathan Beard. 2020. What Do We Teach When We Teach Tech Ethics?: A Syllabi Analysis. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education. https://doi.org/10.1145/3328778.3366825
[57]
Batya Friedman. 1996. Value-Sensitive Design. interactions 3 (1996).
[58]
Batya Friedman and David Hendry. 2012. The Envisioning Cards: A Toolkit for Catalyzing Humanistic and Technical Imaginations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/2207676.2208562
[59]
Batya Friedman, David G Hendry, Alan Borning, 2017. A Survey of Value Sensitive Design Methods. Foundations and Trends® in Human–Computer Interaction 11 (2017).
[60]
Batya Friedman, Peter Kahn, and Alan Borning. 2002. Value Sensitive Design: Theory and Methods. University of Washington technical report 2 (2002).
[61]
Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Scott Johnston, Andy Jones, Nicholas Joseph, Jackson Kernian, Shauna Kravec, Ben Mann, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Tom Brown, Jared Kaplan, Sam McCandlish, Christopher Olah, Dario Amodei, and Jack Clark. 2022. Predictability and Surprise in Large Generative Models. In 2022 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3531146.3533229
[62]
Natalie Garrett, Nathan Beard, and Casey Fiesler. 2020. More Than "If Time Allows": The Role of Ethics in AI Education. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. https://doi.org/10.1145/3375627.3375868
[63]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets for Datasets. arXiv:1803.09010 [cs] (2020). http://arxiv.org/abs/1803.09010
[64]
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. arXiv 2009.11462 (2020). http://arxiv.org/abs/2009.11462
[65]
Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel, William Isaac, John Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey Irving. 2022. Improving Alignment of Dialogue Agents via Targeted Human Judgements. arXiv 2209.14375 (2022). http://arxiv.org/abs/2209.14375
[66]
Nathaniel Good, Rachna Dhamija, Jens Grossklags, David Thaw, Steven Aronowitz, Deirdre Mulligan, and Joseph Konstan. 2005. Stopping Spyware at the Gate: A User Study of Privacy, Notice and Spyware. In Proceedings of the 2005 Symposium on Usable Privacy and Security - SOUPS ’05. https://doi.org/10.1145/1073001.1073006
[67]
Nathaniel S. Good, Jens Grossklags, Deirdre K. Mulligan, and Joseph A. Konstan. 2007. Noticing Notice: A Large-Scale Experiment on the Timing of Software License Agreements. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/1240624.1240720
[68]
Google. 2015. Lit: Simple Fast Web Components. https://lit.dev/
[69]
Google. 2016. Lighthouse. https://github.com/GoogleChrome/lighthouse
[70]
Google. 2023. Google Ai Studio: Prototype with Generative AI. https://aistudio.google.com/app
[71]
Google. 2023. PaLM API: Safety Guidance. https://developers.generativeai.google/guide/safety_guidance
[72]
Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S. Bernstein. 2022. Jury Learning: Integrating Dissenting Voices into Machine Learning Models. In CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3491102.3502004
[73]
Grammarly. 2023. Grammarly: Free Writing AI Assistance. https://www.grammarly.com/
[74]
Hans W. A. Hanley and Zakir Durumeric. 2023. Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites. arXiv 2305.09820 (2023). http://arxiv.org/abs/2305.09820
[75]
Christina Harrington, Sheena Erete, and Anne Marie Piper. 2019. Deconstructing Community-Based Collaborative Design: Towards More Equitable Participatory Design Engagements. Proceedings of the ACM on Human-Computer Interaction 3 (2019). https://doi.org/10.1145/3359318
[76]
Brent Hecht, Lauren Wilcox, Jeffrey P Bigham, Johannes Schöning, Ehsan Hoque, Jason Ernst, Yonatan Bisk, Luigi De Russis, Lana Yarosh, Bushra Anjum, 2021. It’s Time to Do Something: Mitigating the Negative Impacts of Computing through a Change to the Peer Review Process. arXiv preprint arXiv:2112.09544 (2021).
[77]
Fred Hohman, Haekyu Park, Caleb Robinson, and Duen Horng Chau. 2019. SUMMIT: Scaling Deep Learning Interpretability by Visualizing Activation and Attribution Summarizations. IEEE TVCG (2019). https://doi.org/10.1109/TVCG.2019.2934659
[78]
Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé, Miro Dudik, and Hanna Wallach. 2019. Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3290605.3300830
[79]
Matthew K. Hong, Adam Fourney, Derek DeBellis, and Saleema Amershi. 2021. Planning for Natural Language Failures with the AI Playbook. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3411764.3445735
[80]
Sungsoo Ray Hong, Jessica Hullman, and Enrico Bertini. 2020. Human Factors in Model Interpretability: Industry Practices, Challenges, and Needs. Proceedings of the ACM on Human-Computer Interaction 4 (2020). https://doi.org/10.1145/3392878
[81]
The White House. 2022. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People. https://www.whitehouse.gov/ostp/ai-bill-of-rights
[82]
Marilyn Hravnak, Tiffany Pellathy, Lujie Chen, Artur Dubrawski, Anthony Wertz, Gilles Clermont, and Michael R. Pinsky. 2018. A Call to Alarms: Current State and Future Directions in the Battle against Alarm Fatigue. Journal of Electrocardiology 51 (2018). https://doi.org/10.1016/j.jelectrocard.2018.07.024
[83]
Organizers Of Queer in AI, Anaelia Ovalle, Arjun Subramonian, Ashwin Singh, Claas Voelcker, Danica J. Sutherland, Davide Locatelli, Eva Breznik, Filip Klubička, Hang Yuan, Hetvi J, Huan Zhang, Jaidev Shriram, Kruno Lehman, Luca Soldaini, Maarten Sap, Marc Peter Deisenroth, Maria Leonor Pacheco, Maria Ryskina, Martin Mundt, Milind Agarwal, Nyx McLean, Pan Xu, A. Pranav, Raj Korpan, Ruchira Ray, Sarah Mathew, Sarthak Arora, St John, Tanvi Anand, Vishakha Agrawal, William Agnew, Yanan Long, Zijie J. Wang, Zeerak Talat, Avijit Ghosh, Nathaniel Dennler, Michael Noseworthy, Sharvani Jha, Emi Baylor, Aditya Joshi, Natalia Y. Bilenko, Andrew McNamara, Raphael Gontijo-Lopes, Alex Markham, Evyn Dǒng, Jackie Kay, Manu Saraswat, Nikhil Vytla, and Luke Stark. 2023. Queer In AI: A Case Study in Community-Led Participatory AI. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. http://arxiv.org/abs/2303.16972
[84]
Ellen Jiang, Kristen Olson, Edwin Toh, Alejandra Molina, Aaron Donsbach, Michael Terry, and Carrie J Cai. 2022. PromptMaker: Prompt-based Prototyping with Large Language Models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. https://doi.org/10.1145/3491101.3503564
[85]
Ben Kaiser, Jerry Wei, Eli Lucherini, Kevin Lee, J. Nathan Matias, and Jonathan Mayer. 2021. Adapting Security Warnings to Counter Online Disinformation. In 30th USENIX Security Symposium (USENIX Security 21). https://www.usenix.org/conference/usenixsecurity21/presentation/kaiser
[86]
Harmanpreet Kaur, Eytan Adar, Eric Gilbert, and Cliff Lampe. 2022. Sensible AI: Re-imagining Interpretability and Explainability Using Sensemaking Theory. In 2022 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3531146.3533135
[87]
Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. 2023. ProPILE: Probing Privacy Leakage in Large Language Models. arXiv 2307.01881 (2023). http://arxiv.org/abs/2307.01881
[88]
Joel Kiskola, Thomas Olsson, Aleksi H. Syrjämäki, Anna Rantasila, Mirja Ilves, Poika Isokoski, and Veikko Surakka. 2022. Online Survey on Novel Designs for Supporting Self-Reflection and Emotion Regulation in Online News Commenting. In Proceedings of the 25th International Academic Mindtrek Conference. https://doi.org/10.1145/3569219.3569411
[89]
Shamika Klassen and Casey Fiesler. 2022. " Run Wild a Little With Your Imagination" Ethical Speculation in Computing Education with Black Mirror. In Proceedings of the 53rd ACM Technical Symposium on Computer Science Education-Volume 1.
[90]
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent Trade-Offs in the Fair Determination of Risk Scores. arXiv 1609.05807 (2016). http://arxiv.org/abs/1609.05807
[91]
Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian E Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica B Hamrick, Jason Grout, and Sylvain Corlay. 2016. Jupyter Notebooks-a Publishing Format for Reproducible Computational Workflows.2016 (2016). https://doi.org/10.3233/978-1-61499-649-1-87
[92]
Kenneth R Koedinger, Jihee Kim, Julianna Zhuxin Jia, Elizabeth A McLaughlin, and Norman L Bier. 2015. Learning Is Not a Spectator Sport: Doing Is Better than Watching for Learning from a MOOC. In Proceedings of the Second (2015) ACM Conference on Learning@ Scale.
[93]
Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2021. GeDi: Generative Discriminator Guided Sequence Generation. In Findings of the Association for Computational Linguistics: EMNLP 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.424
[94]
J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics 33 (1977). https://doi.org/10.2307/2529310
[95]
Michelle Seng Ah Lee and Jatinder Singh. 2020. The Landscape and Gaps in Open Source Fairness Toolkits. SSRN Electronic Journal (2020). https://doi.org/10.2139/ssrn.3695002
[96]
Fei-Fei Li and John Etchemendy. 2022. Annual Report 2022: Stanford Institute for Human-centered Artificial Intelligence. https://hai-annual-report.stanford.edu
[97]
Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. 2023. Multi-Step Jailbreaking Privacy Attacks on ChatGPT. arXiv 2304.05197 (2023). http://arxiv.org/abs/2304.05197
[98]
Q. Vera Liao and Jennifer Wortman Vaughan. 2023. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. arXiv 2306.01941 (2023). http://arxiv.org/abs/2306.01941
[99]
David Liu, Priyanka Nanayakkara, Sarah Ariyan Sakha, Grace Abuhamad, Su Lin Blodgett, Nicholas Diakopoulos, Jessica R. Hullman, and Tina Eliassi-Rad. 2022. Examining Responsibility and Deliberation in AI Impact Statements and Ethics Reviews. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society. https://doi.org/10.1145/3514094.3534155
[100]
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. 2023. Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models’ Alignment. arXiv 2308.05374 (2023). http://arxiv.org/abs/2308.05374
[101]
Chiara Longoni, Andrey Fradkin, Luca Cian, and Gordon Pennycook. 2022. News from Generative Artificial Intelligence Is Believed Less. In 2022 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3531146.3533077
[102]
Scott M. Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS’17). https://doi.org/10.48550/arXiv.1705.07874
[103]
Michael Madaio, Lisa Egede, Hariharan Subramonyam, Jennifer Wortman Vaughan, and Hanna Wallach. 2022. Assessing the Fairness of AI Systems: AI Practitioners’ Processes, Challenges, and Needs for Support. Proceedings of the ACM on Human-Computer Interaction 6 (2022). https://doi.org/10.1145/3512899
[104]
Michael A. Madaio, Luke Stark, Jennifer Wortman Vaughan, and Hanna Wallach. 2020. Co-Designing Checklists to Understand Organizational Challenges and Opportunities around Fairness in AI. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3313831.3376445
[105]
Tambiama André Madiega. 2021. Artificial Intelligence Act. European Parliament: European Parliamentary Research Service (2021). https://artificialintelligenceact.eu
[106]
Kamil Malinka, Martin Perešíni, Anton Firc, Ondřej Hujňák, and Filip Januš. 2023. On the Educational Impact of ChatGPT: Is Artificial Intelligence Ready to Obtain a University Degree?, In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1. arXiv 2303.11146. https://doi.org/10.1145/3587102.3588827
[107]
H. B. Mann and D. R. Whitney. 1947. On a Test of Whether One of Two Random Variables Is Stochastically Larger than the Other. The Annals of Mathematical Statistics 18 (1947). https://doi.org/10.1214/aoms/1177730491
[108]
Sandra Matz, Jake Teeny, Sumer Sumeet Vaid, Gabriella M. Harari, and Moran Cerf. 2023. The Potential of Generative AI for Personalized Persuasion at Scale. Preprint. PsyArXiv. https://doi.org/10.31234/osf.io/rn97c
[109]
Max-Emanuel Maurer, Alexander De Luca, and Sylvia Kempe. 2011. Using Data Type Based Security Alert Dialogs to Raise Online Security Awareness. In Proceedings of the Seventh Symposium on Usable Privacy and Security. https://doi.org/10.1145/2078827.2078830
[110]
Kenneth O. McGraw and S. P. Wong. 1992. A Common Language Effect Size Statistic.Psychological Bulletin 111 (1992). https://doi.org/10.1037/0033-2909.111.2.361
[111]
Sean McGregor. 2020. Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database. arXiv 2011.08512 (2020). http://arxiv.org/abs/2011.08512
[112]
Marry L. McHugh. 2012. Interrater Reliability: The Kappa Statistic. Biochemia Medica (2012). https://doi.org/10.11613/BM.2012.031
[113]
Leland McInnes, John Healy, and James Melville. 2020. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426 (2020). http://arxiv.org/abs/1802.03426
[114]
MDN. 2011. WebGL: 2D and 3D Graphics for the Web - Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/WebGL_API
[115]
MDN. 2021. Web Components - Web APIs. https://developer.mozilla.org/en-US/docs/Web/API/Web_components
[116]
Sharan B Merriam 2002. Introduction to Qualitative Research. Qualitative research in practice: Examples for discussion and analysis 1 (2002).
[117]
Meta. 2023. Llama 2: Responsible Use Guide. https://ai.meta.com/llama-project/responsible-use-guide
[118]
Milagros Miceli, Martin Schuessler, and Tianling Yang. 2020. Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision. Proceedings of the ACM on Human-Computer Interaction 4 (2020). https://doi.org/10.1145/3415186
[119]
Microsoft. 2020. Responsible AI Toolbox. Microsoft. https://github.com/microsoft/responsible-ai-toolbox
[120]
Microsoft. 2022. Harms Modeling - Azure Application Architecture Guide. https://learn.microsoft.com/en-us/azure/architecture/guide/responsible-innovation/harms-modeling/
[121]
Microsoft. 2022. Microsoft Responsible AI Impact Assessment Guide. (2022). https://aka.ms/RAIImpactAssessmentGuidePDF
[122]
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3287560.3287596
[123]
Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. 2022. Dealing with Disagreements: Looking beyond the Majority Vote in Subjective Annotations. Transactions of the Association for Computational Linguistics 10 (2022). https://doi.org/10.1162/tacl_a_00449
[124]
David Munechika, Zijie J. Wang, Jack Reidy, Josh Rubin, Krishna Gade, Krishnaram Kenthapadi, and Duen Horng Chau. 2022. Visual Auditor: Interactive Visualization for Detection and Summarization of Model Biases. In VIS. https://doi.org/10.1109/VIS54862.2022.00018
[125]
Alexios Mylonas, Anastasia Kastania, and Dimitris Gritzalis. 2013. Delegate the Smartphone User? Security Awareness in Smartphone Platforms. Computers & Security 34 (2013). https://doi.org/10.1016/j.cose.2012.11.004
[126]
Priyanka Nanayakkara, Nicholas Diakopoulos, and Jessica Hullman. 2020. Anticipatory Ethics and the Role of Uncertainty. arXiv preprint arXiv:2011.13170 (2020).
[127]
Priyanka Nanayakkara, Jessica Hullman, and Nicholas Diakopoulos. 2021. Unpacking the Expressed Consequences of AI Research in Broader Impact Statements. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society.
[128]
Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. 2019. InterpretML: A Unified Framework for Machine Learning Interpretability. arXiv (2019). http://arxiv.org/abs/1909.09223
[129]
Donald A. Norman and Stephen W. Draper. 1986. User Centered System Design: New Perspectives on Human-Computer Interaction.
[130]
Organizers of QueerInAI, Nathan Dennler, Anaelia Ovalle, Ashwin Singh, Luca Soldaini, Arjun Subramonian, Huy Tu, William Agnew, Avijit Ghosh, Kyra Yee, Irene Font Peradejordi, Zeerak Talat, Mayra Russo, and Jess de Jesus de Pinho Pinhal. 2023. Bound by the Bounty: Collaboratively Shaping Evaluation Processes for Queer AI Harms. arXiv 2307.10223 (2023). http://arxiv.org/abs/2307.10223
[131]
Cathy O’Neil and Hanna Gunn. 2020. Near-Term Artificial Intelligence and the Ethical Matrix. https://doi.org/10.1093/oso/9780190905033.003.0009
[132]
OpenAI. 2023. GPT-4 Technical Report. arXiv 2303.08774 (2023). http://arxiv.org/abs/2303.08774
[133]
OpenAI. 2023. OpenAI Playground. https://platform.openai.com/playground
[134]
Google PAIR. 2019. People + AI Guidebook. https://pair.withgoogle.com/guidebook
[135]
Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. 2023. On the Risk of Misinformation Pollution with Large Language Models. arXiv 2305.13661 (2023). http://arxiv.org/abs/2305.13661
[136]
Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent Disagreements in Human Textual Inferences. Transactions of the Association for Computational Linguistics 7 (2019). https://doi.org/10.1162/tacl_a_00293
[137]
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red Teaming Language Models with Language Models. arXiv 2202.03286 (2022). http://arxiv.org/abs/2202.03286
[138]
Vinodkumar Prabhakaran, Christopher Homan, Lora Aroyo, Alicia Parrish, Alex Taylor, Mark Díaz, and Ding Wang. 2023. A Framework to Assess (Dis)Agreement Among Diverse Rater Groups. arXiv 2311.05074 (2023). http://arxiv.org/abs/2311.05074
[139]
Thomas W. Price, Joseph Jay Williams, Jaemarie Solyst, and Samiha Marwan. 2020. Engaging Students with Instructor Solutions in Online Programming Homework. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3313831.3376857
[140]
Carina EA Prunkl, Carolyn Ashurst, Markus Anderljung, Helena Webb, Jan Leike, and Allan Dafoe. 2021. Institutionalizing Ethics in AI through Broader Impact Requirements. Nature Machine Intelligence 3 (2021).
[141]
Inioluwa Deborah Raji, Morgan Klaus Scheuerman, and Razvan Amironesei. 2021. You Can’t Sit With Us: Exclusionary Pedagogy in AI Ethics Education. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445914
[142]
Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. 2020. Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3351095.3372873
[143]
Bogdana Rakova, Jingying Yang, Henriette Cramer, and Rumman Chowdhury. 2021. Where Responsible AI Meets Reality: Practitioner Perspectives on Enablers for Shifting Organizational Practices. Proceedings of the ACM on Human-Computer Interaction 5 (2021). https://doi.org/10.1145/3449081
[144]
Charvi Rastogi, Marco Tulio Ribeiro, Nicholas King, Harsha Nori, and Saleema Amershi. 2023. Supporting Human-AI Collaboration in Auditing LLMs with LLMs. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. https://doi.org/10.1145/3600211.3604712
[145]
Robert W. Reeder, Adrienne Porter Felt, Sunny Consolvo, Nathan Malkin, Christopher Thompson, and Serge Egelman. 2018. An Experience Sampling Study of User Reactions to Browser Warnings in the Field. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3173574.3174086
[146]
E.M. Reingold and J.S. Tilford. 1981. Tidier Drawings of Trees. IEEE Transactions on Software Engineering SE-7 (1981). https://doi.org/10.1109/TSE.1981.234519
[147]
Benjamin Reinheimer, Lukas Aldag, Peter Mayer, Mattia Mossano, Reyhan Duezguen, Bettina Lofthouse, Tatiana von Landesberger, and Melanie Volkamer. 2020. An Investigation of Phishing Awareness and Education over Time: When and How to Best Remind Users. In Sixteenth Symposium on Usable Privacy and Security (SOUPS 2020). https://www.usenix.org/conference/soups2020/presentation/reinheimer
[148]
Karen Renaud and Marc Dupuis. 2019. Cyber Security Fear Appeals: Unexpectedly Complicated. In Proceedings of the New Security Paradigms Workshop. https://doi.org/10.1145/3368860.3368864
[149]
Marco Tulio Ribeiro and Scott Lundberg. 2022. Adaptive Testing and Debugging of NLP Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2022.acl-long.230
[150]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/2939672.2939778
[151]
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.442
[152]
Shalaleh Rismani, Renee Shelby, Andrew Smart, Edgar Jatho, Joshua Kroll, Ajung Moon, and Negar Rostamzadeh. 2023. From Plane Crashes to Algorithmic Harm: Applicability of Safety Engineering Frameworks for Responsible ML. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544548.3581407
[153]
Samantha Robertson, Zijie J. Wang, Dominik Moritz, Mary Beth Kery, and Fred Hohman. 2023. Angler: Helping Machine Translation Practitioners Prioritize Model Improvements. In CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544548.3580790
[154]
Sayak Saha Roy, Krishna Vamsi Naragam, and Shirin Nilizadeh. 2023. Generating Phishing Attacks Using ChatGPT. arXiv 2305.05133 (2023). http://arxiv.org/abs/2305.05133
[155]
Pedro Saleiro, Benedict Kuester, Loren Hinkson, Jesse London, Abby Stevens, Ari Anisfeld, Kit T. Rodolfa, and Rayid Ghani. 2019. Aequitas: A Bias and Fairness Audit Toolkit. arXiv 1811.05577 (2019). http://arxiv.org/abs/1811.05577
[156]
Juan Pablo Sarmiento and Alyssa Friend Wise. 2022. Participatory and Co-Design of Learning Analytics: An Initial Review of the Literature. In LAK22: 12th International Learning Analytics and Knowledge Conference. https://doi.org/10.1145/3506860.3506910
[157]
Shlomo S. Sawilowsky. 2009. New Effect Size Rules of Thumb. Journal of Modern Applied Statistical Methods 8 (2009). https://doi.org/10.22237/jmasm/1257035100
[158]
Timo Schick, Sahana Udupa, and Hinrich Schütze. 2021. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. arXiv 2103.00453 (2021). http://arxiv.org/abs/2103.00453
[159]
Daniel Schiff, Bogdana Rakova, Aladdin Ayesh, Anat Fanti, and Michael Lennon. 2020. Principles to Practices for Responsible AI: Closing the Gap. arXiv 2006.04707 (2020). http://arxiv.org/abs/2006.04707
[160]
Christoph Schneider, Markus Weinmann, and Jan Vom Brocke. 2018. Digital Nudging: Guiding Online User Choices through Interface Design. Commun. ACM 61 (2018). https://doi.org/10.1145/3213765
[161]
Howard J Seltman. 2012. Experimental Design and Analysis.
[162]
S. S. Shapiro and M. B. Wilk. 1965. An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52 (1965). https://doi.org/10.1093/biomet/52.3-4.591
[163]
Filipo Sharevski, Amy Devine, Peter Jachim, and Emma Pieroni. 2022. Meaningful Context, a Red Flag, or Both? Preferences for Enhanced Misinformation Warnings Among US Twitter Users. In 2022 European Symposium on Usable Security. https://doi.org/10.1145/3549015.3555671
[164]
Renee Shelby, Shalaleh Rismani, Kathryn Henne, Ajung Moon, Negar Rostamzadeh, Paul Nicholas, N’Mah Yilla-Akbari, Jess Gallegos, Andrew Smart, Emilio Garcia, and Gurleen Virk. 2023. Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. https://doi.org/10.1145/3600211.3604673
[165]
Hong Shen, Wesley H. Deng, Aditi Chattopadhyay, Zhiwei Steven Wu, Xu Wang, and Haiyi Zhu. 2021. Value Cards: An Educational Toolkit for Teaching Social Impacts of Machine Learning through Deliberation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445971
[166]
Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. 2023. Model Evaluation for Extreme Risks. arXiv 2305.15324 (2023). http://arxiv.org/abs/2305.15324
[167]
Kate Sim, Andrew Brown, and Amelia Hassoun. 2021. Thinking through and Writing about Research Ethics beyond" Broader Impact". arXiv preprint arXiv:2104.08205 (2021).
[168]
Gabriel Simmons. 2023. Moral Mimicry: Large Language Models Produce Moral Rationalizations Tailored to Political Identity. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). https://doi.org/10.18653/v1/2023.acl-srw.40
[169]
Guy Simon. 2020. OpenWeb Tests the Impact of “Nudges” in Online Discussions. OpenWeb Blog (2020).
[170]
Daniel Smilkov, Nikhil Thorat, Yannick Assogba, Ann Yuan, Nick Kreeger, Ping Yu, Kangyi Zhang, Shanqing Cai, Eric Nielsen, David Soergel, Stan Bileschi, Michael Terry, Charles Nicholson, Sandeep N. Gupta, Sarah Sirajuddin, D. Sculley, Rajat Monga, Greg Corrado, Fernanda B. Viégas, and Martin Wattenberg. 2019. TensorFlow.Js: Machine Learning for the Web and Beyond. arXiv (2019). https://arxiv.org/abs/1901.05350
[171]
Jessie J. Smith, Saleema Amershi, Solon Barocas, Hanna Wallach, and Jennifer Wortman Vaughan. 2022. REAL ML: Recognizing, Exploring, and Articulating Limitations of Machine Learning Research. In 2022 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3531146.3533122
[172]
Jessie J. Smith, Blakeley H. Payne, Shamika Klassen, Dylan Thomas Doyle, and Casey Fiesler. 2023. Incorporating Ethics in Computing Courses: Barriers, Support, and Perspectives from Educators. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. https://doi.org/10.1145/3545945.3569855
[173]
Irene Solaiman and Christy Dennison. 2021. Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets. In Advances in Neural Information Processing Systems, Vol. 34. https://proceedings.neurips.cc/paper_files/paper/2021/file/2e855f9489df0712b4bd8ea9e2848c5a-Paper.pdf
[174]
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, 2022. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. arXiv preprint arXiv:2206.04615 (2022). http://arxiv.org/abs/2206.04615
[175]
Miriam Sturdee, Joseph Lindley, Conor Linehan, Chris Elsden, Neha Kumar, Tawanna Dillahunt, Regan Mandryk, and John Vines. 2021. Consequences, Schmonsequences! Considering the Future as Part of Publication and Peer Review in Computing Research. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems.
[176]
Harini Suresh and John Guttag. 2021. A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle. In Equity and Access in Algorithms, Mechanisms, and Optimization. https://doi.org/10.1145/3465416.3483305
[177]
Elham Tabassi. 2023. AI Risk Management Framework: AI RMF (1.0). Technical Report. National Institute of Standards and Technology. https://doi.org/10.6028/NIST.AI.100-1
[178]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805 (2023). https://arxiv.org/abs/2312.11805
[179]
Ian Tenney, James Wexler, Jasmijn Bastings, Tolga Bolukbasi, Andy Coenen, Sebastian Gehrmann, Ellen Jiang, Mahima Pushkarna, Carey Radebaugh, Emily Reif, and Ann Yuan. 2020. The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models. In EMNLP Demo. https://doi.org/10.18653/v1/2020.emnlp-demos.15
[180]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2307.09288 (2023). https://arxiv.org/abs/2307.09288
[181]
Rama Adithya Varanasi and Nitesh Goyal. 2023. “It Is Currently Hodgepodge”: Examining AI/ML Practitioners’ Challenges during Co-Production of Responsible AI Values. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems.
[182]
David Vilar, Jia Xu, Luis Fernando D’Haro, and Hermann Ney. 2006. Error Analysis of Statistical Machine Translation Output. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). http://www.lrec-conf.org/proceedings/lrec2006/pdf/413_pdf.pdf
[183]
Qiaosi Wang, Michael Madaio, Shaun Kane, Shivani Kapania, Michael Terry, and Lauren Wilcox. 2023. Designing Responsible AI: Adaptations of UX Practice to Meet Responsible AI Challenges. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544548.3581278
[184]
Zijie J. Wang, Aishwarya Chakravarthy, David Munechika, and Duen Horng Chau. 2024. Wordflow: Social Prompt Engineering for Large Language Models. arXiv 2401.14447 (2024). http://arxiv.org/abs/2401.14447
[185]
Zijie J. Wang, Katie Dai, and W. Keith Edwards. 2022. StickyLand: Breaking the Linear Presentation of Computational Notebooks. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. https://doi.org/10.1145/3491101.3519653
[186]
Zijie J. Wang, Fred Hohman, and Duen Horng Chau. 2023. WizMap: Scalable Interactive Visualization for Exploring Large Machine Learning Embeddings. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). https://aclanthology.org/2023.acl-demo.50
[187]
Zijie J. Wang, Alex Kale, Harsha Nori, Peter Stella, Mark E. Nunnally, Duen Horng Chau, Mihaela Vorvoreanu, Jennifer Wortman Vaughan, and Rich Caruana. 2022. Interpretability, Then What? Editing Machine Learning Models to Reflect Human Knowledge and Values. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD ’22). https://doi.org/10.1145/3534678.3539074
[188]
Zijie J. Wang, David Munechika, Seongmin Lee, and Duen Horng Chau. 2022. NOVA: A Practical Method for Creating Notebook-Ready Visual Analytics. arXiv:2205.03963 (2022). http://arxiv.org/abs/2205.03963
[189]
Hilde Weerts, Miroslav Dudík, Richard Edgar, Adrin Jalali, Roman Lutz, and Michael Madaio. 2023. Fairlearn: Assessing and Improving Fairness of AI Systems. arXiv 2303.16626 (2023). http://arxiv.org/abs/2303.16626
[190]
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2021. Ethical and Social Risks of Harm from Language Models. arXiv 2112.04359 (2021). https://doi.org/10.48550/arXiv.2112.04359
[191]
Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, and William Isaac. 2023. Sociotechnical Safety Evaluation of Generative AI Systems. arXiv 2310.11986 (2023). http://arxiv.org/abs/2310.11986
[192]
Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, Abeba Birhane, Lisa Anne Hendricks, Laura Rimell, William Isaac, Julia Haas, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2022. Taxonomy of Risks Posed by Language Models. In 2022 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3531146.3533088
[193]
Benjamin Weiser and Nate Schweber. 2023. The ChatGPT Lawyer Explains Himself. The New York Times (2023). https://www.nytimes.com/2023/06/08/nyregion/lawyer-chatgpt-sanctions.html
[194]
Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. 2021. Challenges in Detoxifying Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.210
[195]
James Wexler, Mahima Pushkarna, Tolga Bolukbasi, Martin Wattenberg, Fernanda Viegas, and Jimbo Wilson. 2019. The What-If Tool: Interactive Probing of Machine Learning Models. TVCG 26 (2019). https://doi.org/10.1109/TVCG.2019.2934619
[196]
Meredith Whittaker. 2021. The Steep Cost of Capture. Interactions 28 (2021). https://doi.org/10.1145/3488666
[197]
Richmond Y Wong and Vera Khovanskaya. 2018. Speculative Design in HCI: From Corporate Imaginations to Critical Orientations.
[198]
Richmond Y. Wong, Michael A. Madaio, and Nick Merrill. 2023. Seeing Like a Toolkit: How Toolkits Envision the Work of AI Ethics. Proceedings of the ACM on Human-Computer Interaction 7 (2023). https://doi.org/10.1145/3579621
[199]
Richmond Y Wong and Tonya Nguyen. 2021. Timelines: A World-Building Activity for Values Advocacy. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems.
[200]
Austin P. Wright, Omar Shaikh, Haekyu Park, Will Epperson, Muhammed Ahmed, Stephane Pinel, Duen Horng (Polo) Chau, and Diyi Yang. 2021. RECAST: Enabling User Recourse and Interpretability of Toxicity Detection Models with Interactive Visualization. Proceedings of the ACM on Human-Computer Interaction 5 (2021). https://doi.org/10.1145/3449280
[201]
Austin P. Wright, Zijie J. Wang, Haekyu Park, Grace Guo, Fabian Sperrle, Mennatallah El-Assady, Alex Endert, Daniel Keim, and Duen Horng Chau. 2020. A Comparative Analysis of Industry Human-AI Interaction Guidelines. http://arxiv.org/abs/2010.11761
[202]
Min Wu, Robert C. Miller, and Simson L. Garfinkel. 2006. Do Security Toolbars Actually Prevent Phishing Attacks?. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/1124772.1124863
[203]
Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2019. Errudite: Scalable, Reproducible, and Testable Error Analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1073
[204]
Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3491102.3517582
[205]
Albert Xu, Eshaan Pathak, Eric Wallace, Suchin Gururangan, Maarten Sap, and Dan Klein. 2021. Detoxifying Language Models Risks Marginalizing Minority Voices. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. https://doi.org/10.18653/v1/2021.naacl-main.190
[206]
Nur Yildirim, Mahima Pushkarna, Nitesh Goyal, Martin Wattenberg, and Fernanda Viégas. 2023. Investigating How Practitioners Use Human-AI Guidelines: A Case Study on the People + AI Guidebook. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544548.3580900
[207]
J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang. 2023. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3544548.3581388

Cited By

View all
  • (2025)Mixing Linters with GUIs: A Color Palette Design ProbeIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345631731:1(327-337)Online publication date: Jan-2025
  • (2024)Generative AI Going Awry: Enabling Designers to Proactively Avoid It in CSCW ApplicationsCompanion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing10.1145/3678884.3689133(125-127)Online publication date: 11-Nov-2024
  • (2024)AI Design: A Responsible Artificial Intelligence Framework for Prefilling Impact Assessment ReportsIEEE Internet Computing10.1109/MIC.2024.345135128:5(37-45)Online publication date: Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems
May 2024
18961 pages
ISBN:9798400703300
DOI:10.1145/3613904
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 May 2024

Check for updates

Badges

Author Tags

  1. Human-AI Collaboration
  2. Large Language Models
  3. Responsible AI

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CHI '24

Acceptance Rates

Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

Upcoming Conference

CHI 2025
ACM CHI Conference on Human Factors in Computing Systems
April 26 - May 1, 2025
Yokohama , Japan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3,636
  • Downloads (Last 6 weeks)708
Reflects downloads up to 30 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Mixing Linters with GUIs: A Color Palette Design ProbeIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345631731:1(327-337)Online publication date: Jan-2025
  • (2024)Generative AI Going Awry: Enabling Designers to Proactively Avoid It in CSCW ApplicationsCompanion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing10.1145/3678884.3689133(125-127)Online publication date: 11-Nov-2024
  • (2024)AI Design: A Responsible Artificial Intelligence Framework for Prefilling Impact Assessment ReportsIEEE Internet Computing10.1109/MIC.2024.345135128:5(37-45)Online publication date: Sep-2024
  • (2024)The Value-Sensitive Conversational Agent Co-Design FrameworkInternational Journal of Human–Computer Interaction10.1080/10447318.2024.2426737(1-32)Online publication date: 25-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media