1 Introduction
The boom of the PC and mobile internet has revolutionized the way information is consumed and disseminated. AI-powered search and recommendation systems are now a common feature on social media, news, and streaming platforms, analyzing users’ behavior, preferences, and interactions to provide personalized content. While these systems have greatly improved users’ experience, they can also exacerbate a phenomenon known as the “filter bubble” or “information cocoon”, where individuals tend to consume more information that confirms their existing beliefs, potentially narrowing their perspectives and reinforcing their potential biases.
To address that issue, researchers have proposed and studied three primary approaches, including optimizing recommendation algorithms, expanding users’ exposure to diverse information, and nudging them towards it. Recommendations algorithms were designed to take more information into account, such as inter-item correlation [
70], user profile [
30], social network information [
60], and diversity of the recommendations [
39]. Researchers also explored to expose users to a variety of information from other users in content platform [
6], or news from agencies [
51] or users [
28] with different ideological standings. Other explorations include presenting the credibility of content [
7], the reason of seeing a particular article [
64], and visualization of users’ political leanings [
48] to nudge users to broaden their content consumption range and reflect on the information they engage with. However, previous approaches tend to focus on increasing the diversity of information and content exposure without sufficiently taking into account one of the most important elements in this process: the user. Simply providing users with diverse perspectives is not sufficient by itself [
56], as it requires both the user’s willingness to explore and the capacity for in-depth processing of the content to truly move the needle. Therefore, it is crucial for users to discover, interact with, and reflect on diverse perspectives outside of their existing filter bubbles in order to effectively burst them.
Helping people deal with “filter bubble” is challenging for two main reasons. First, the quantity, quality, and diversity of perspectives are highly dependent on user-generated content (UGC) on online platforms. UGC may carry over their creators’ potential biases on a particular topic, which could further affect the availability of relevant perspectives to other users. Second, motivating users to engage with and deeply think about diverse perspectives requires a system to continuously understand dialogues with users, and provide instant, interactive, and inspiring feedback [
32], which was difficult to accomplish with previous Natural Language Processing technologies [
50].
Recent advancements in Large Language Models (LLMs) might provide opportunity to help overcome these challenges. These generative models possess the capability to effectively simulate a diverse array of viewpoints, personas, and expertise in a given domain [
11,
41,
54]. Additionally, LLMs have been explored and showed promising results in engaging users in continuous dialogues and promoting in-depth thinking in various interactive scenarios, such as fostering meaningful conversations between teachers and students [
61], as well as doctors and patients [
58], in schools and hospitals [
45].
Inspired by LLMs’ proficiency in generating contextually relevant text [
42,
44], we proposed to utilize GPT-4 to engage users in meaningful multi-round dialogues, in order to encourage them to contemplate perspectives beyond their own filter bubbles, rather than merely presenting them with diverse viewpoints. However, it remains unknown
how such an LLM-powered system should be designed and
whether and how such a system may help users access and reflect on diverse information, which are our two research questions (RQs).
To answer the first RQ, we adopted a human-centered approach and conducted a three-hour design workshop attended by a diverse group of participants, including HCI and UX researchers, designers, and psychologists, all of whom are also users of online content platforms. The workshop aimed to generate design ideas to address the research question. Following the workshop, we established three key design considerations around how to provide diverse perspectives, foster deliberate and critical thinking, and motivate user engagement. Based on these considerations, we designed interaction features that leverage LLM-powered multi-agent characters, a frictionless and progressive interaction flow, and gamification design to motivate users to interact with diverse perspectives and engage in thoughtful consideration while reading social media content.
To answer the second RQ, we developed a prototype to incorporate the aforementioned interaction features and conducted a user study with 18 participants selected from online content platform users. During the study, they were asked to participate in a range of activities including reading posts and comments and interacting with multi-agent characters within our prototype. Both quantitative and qualitative methods were employed to assess the participants’ levels of engagement and the depth of their information processing. Results showed that participants exhibited an inclination to engage with unexpected viewpoints when incorporated into human-like dialogues and enhanced by gamification incentives. This engagement, coupled with progressive assessment tasks, enriched their understanding and stimulated deeper reflection across a broader range of perspectives. In sum, our work made the following contributions:
•
We have identified three crucial design considerations for bursting filter bubbles through a participatory design workshop;
•
We have designed and developed a prototype with interaction features to promote deeper engagement and critical thinking in diverse information;
•
We have carried out an empirical laboratory study that evaluated the efficacy of these design considerations and features and present key design implications to guide future practice in assisting users to burst filter bubbles with Large Language Models.
3 Prototype Design
To answer the first RQ, we first conducted a design workshop to derive design considerations (DCs) to guide the design of an LLM-powered multi-agent system (Figure
2). Based on the DCs, we defined corresponding interaction features and incorporated them into the LLM-powered multi-agent system design.
3.1 Design Workshop
We first conducted a three-hour design workshop that brought together a multidisciplinary team consisting of three HCI and UX researchers, two designers, and two psychologists (referred as S1-S7 hereafter). All participants were also users of online content platforms.
During the workshop, participants were initially briefed on the concept of the “filter bubble effect” on social media and the proposal to use an LLM-powered system to help users reflect on diverse viewpoints. The target audience includes all online content consumers, regardless of their awareness of their position within filter bubbles. The goal of the workshop was to engage the participants in brainstorming the design of such a system, drawing upon their professional expertise as well as their personal experiences as social media users.
The workshop was structured into two sessions, each lasting approximately 1.5 hours. The first session is to discuss about the interaction flow, such as how to provide relevant information to users and how to encourage user reflection. The second session focused on the interaction format, such as visual design style, interface layout, etc. In both sessions, participants were also encouraged to identify potential issues that could arise during user interactions and to propose any possible solutions they could conceive. Each session consisted of three parts:
•
Part 1 (15 minutes): Brain-writing, during which participants individually brainstormed and wrote down their ideas.
•
Part 2 (40 minutes): Brain-sharing, where participants sequentially shared their ideas.
•
Part 3 (30 minutes): Discussion of the shared ideas, including the corresponding pros and cons, as well as the emergence of new ideas inspired by those presented in Part 2.
We recorded the entire workshop and transcribed it. We also retained the sketches and idea cards drawn by participants during the workshop. Subsequently, two HCI researchers independently coded the transcripts and sketches. They organized the data into a table with columns that encompassed potential issues, proposed solution ideas, as well as the advantages and disadvantages of the proposed solutions. They then discussed their codings until a consensus was reached. Based on these discussion, we derived design considerations and designed our prototype accordingly.
3.2 Design Considerations
During the workshop, one intriguing design concept surfaced and explored extensively was leveraging LLMs to anthropomorphize multiple AI agents (referred as multi-agent hereafter), that is, “generating vivid human-like characters with distinctive perspectives” (S1). Integrating these personalities could “foster user empathy towards the AI agents, thereby facilitating a deeper comprehension of the diverse ideas" (S1).
Expanding upon this concept further, what design techniques could be applied to ease resistance and nurture reflection on differing ideas was also discussed in the workshop. Participants contributed ideas such as “structuring discussions in a way that incrementally introduces alternative views might alleviate the discomfort often associated with encountering opposing perspectives” (S3), and “incorporating fun reward mechanics that promote active engagement with a sense of accomplishment” (S4).
Another recurring point was the inherent conflict between engaging users in consuming diverse content and promoting deep thinking. For example, a “frictionless interaction flow with minimal cognitive load is desirable for encouraging users to view more content, yet this approach may predispose users to superficially process information” (S1). Similarly, in terms of visual design, a “thoughtful visual style might prompt users to process information more seriously” (S4) but could also “diminish their willingness to use the system” (S6).
Based on these findings, we derived the following three design considerations (DCs) that an LLM-powered system should address.
3.2.1 DC1: Providing Diverse Perspectives through Multi-agent Characters.
In order to assist users in breaking out of their filter bubbles, the AI agents in the system should offer a wide and comprehensive range of perspectives. To achieve this, their persona (including age, gender, education level, profession, etc.) and their attitudes toward the topic should be sufficiently diverse, allowing users to be exposed to a rich variety of characters and viewpoints. “The personas created should be detailed... This not only ensures better prompting outcomes for GPT but also results in more vivid character representations.” (S2)
Moreover, it is recommended that we present a holistic view of the perspectives, enabling users to easily grasp the full picture of the viewpoints. In doing so, we could “reduce users’ cognitive load” (S6) by summarizing the information for them, while still preserving the richness of the content.
3.2.2 DC2: Fostering Deliberate and Critical Thinking through Progressive Interaction and Assessment Tasks.
Simply presenting users with a range of perspectives does not ensure that the information will be effectively absorbed. It is equally crucial to steer users towards more deliberate contemplation. As humans’ natural propensity to focus only on content that aligns with their pre-existing beliefs, it is recommended to introduce them gently to contrasting views. “When people use social media, encountering completely opposite opinions can be hard to accept and may even elicit anger” (S2). Thus, when AI agents present their viewpoints, we should prompt the LLMs to employ persuasive techniques. This approach aims to prevent the onset of cognitive dissonance, which could cause users to cling even more firmly to their existing beliefs.
Furthermore, we could incorporate assessment tasks to steer users toward a deeper “semantic processing, an indicator of deep processing” (S1) of the perspectives. Through feedback from these tasks, users could also check their comprehension of the presented views.
3.2.3 DC3: Motivating User Engagement through Natural Interaction and Gamification Design.
Viewing and and reflecting on opposing viewpoints is not a natural inclination for humans. They may struggle to stay focused and wish to shift to tasks that require less mental effort. “People are typically not primed for deep processing of information on social media; hence, it’s crucial for the system to be engaging.” (S4) As a result, it is imperative that we utilize design to encourage users to interact with perspectives that challenges their own beliefs. Firstly, natural and frictionless interaction design that do not disturb the user’s intended browsing experience is advised. Secondly, minimizing the cognitive demands required during their interaction with the system. Third, we could employ gamification incentives to motivate users to prolong their exploration within the system.
3.3 Prototype Features
Based on these design considerations, the design architecture of our prototype to resemble a mainstream text-based forum, aiming to simulate the user experience of browsing online media while minimizing distractions. More specifically, our prototype incorporates five core interaction features (Figure
3) that could potentially benefit navigating users out of their filter bubbles.
3.3.1 LLM-powered Multi-agent Characters.
Our design incorporates multi-agent characters with diverse perspectives generated by state-of-the-art Large Language Model GPT-4. Each character has a realistic background in terms of gender, age, occupation, and education. To enhance the sense of realism, each character is represented by an avatar that conforms to their portrait, aiming to provide users with the impression of talking to a real person. These avatars are displayed in the avatar panel at the top to encourage users to explore other characters and their perspectives. Upon selection, a character overview is presented to facilitate a better understanding of the characters (Figure
3 a).
3.3.2 Frictionless Interaction Flow.
To ensure a smooth and natural interaction, we integrated the primary entrance to initiate a dialogue with multi-agent characters directly within the comments section of the primary forum-like interface. This layout maintains users’ attention within the same visual field while reading posts and reduces the disruption caused by switching between different areas. Furthermore, we provided default response options generated by LLM during the conversation (Figure
3 b). These options, including seeking clarification or elaboration on viewpoints in greater detail, serve as a user-friendly guide, reducing the cognitive burden by minimizing the need for active input.
3.3.3 Viewpoints Jigsaw Puzzle.
To encourage users to engage with diverse perspectives, we have introduced a novel feature called the Viewpoints Jigsaw Puzzle (hereinafter referred to as the “Viewpoints Puzzle”). This feature runs parallel to the dialogue window and is designed to follow the reward mechanisms of games. As users continue to interact with the AI agent in dialogues, and a progress indicator for dialogue rounds has been added to the top of the dialogue window to encourage further conversation. When the conversation lasts for five or more rounds, the user is encouraged to explore more viewpoints by “lighting up” other avatars, which consists of five pieces, each representing a different character’s viewpoint. Users are encouraged to “light up” all pieces of the Puzzle by interacting with all characters with the required level of engagement (Figure
3 c).
3.3.4 Progressive Viewpoints Sequence.
To prevent users from becoming overwhelmed by an excessive number of viewpoints, we present various perspectives gradually. Initially, only one perspective is displayed in the entrance, with the option to expand additional perspectives if the user desires to do so. Each click reveals an additional AI agent along with their opinion, allowing for a more gradual and incremental understanding of the content. Furthermore, we programmed the sequence of presenting each character with attitudes from negative (mainstream attitudes in the posts) to positive, with the intention to facilitate a progressive understanding of the differing viewpoints. Specifically, we begin by presenting characters whose viewpoints are similar to the existing beliefs, and gradually introduce characters with increasingly contrasting viewpoints (Figure
3 d).
3.3.5 Assessment Task with Multi-choice Questions.
In addition to the gamification design, we have also incorporated multiple-choice questions on the Puzzle interface as a special assessment task. This task provides users with an opportunity to self-evaluate their understanding of the viewpoints they have interacted with. When the questions related to a particular viewpoint are answered correctly, it indicates that the user has grasped the concept, and a piece of the puzzle will be illuminated. Once the user has successfully completed all the assessment tasks related to all characters, the entire puzzle will be illuminated, symbolizing the bursting of the filter bubble and the acquisition of more comprehensive and diverse information (Figure
3 e). By combining this assessment task with the gamification incentives, we aim to encourage continuous engagement, motivate thoughtful consideration, and deepen users’ understanding of different perspectives.
4 User Study
To answer the second RQ, we developed a prototype with all the interaction features identified in the participatory design study. Then we evaluated users’ attitude for communicating with AI agents while viewing online posts using our prototype, as well as the effectiveness of LLM-generated opinions on the depth and diversity of users’ information-seeking results, through a user study with experienced social media and online forum users. This study was approved by the Institutional Review Board.
4.1 Participants
We recruited 18 participants (9 female, 9 male, aged 21-32, referred as P1-P18 hereafter) through word-of-mouth and snowball sampling. All participants have more than five years of experience in viewing posts on social media, and all have experience in using Large Language Model chatbots (e.g., ChatGPT). Participants were compensated $25 for an approximately 60-minute session.
4.2 Materials
We gathered posts regarding the “delayed retirement policy” from the internet. The “delayed retirement policy” is designed to incrementally increase the retirement age, addressing the nation’s aging population and associated economic challenges. This topic was selected for the following reasons:
•
It was a topic that had garnered widespread attention and discussion on the internet at the moment when the experiment was conducted.
•
The policy has a significant impact on a broad demographics, especially the younger generation as the policy is intended to be implemented progressively to allow for societal adaptation.
•
The public opinions in online discussions about this policy were predominantly skewed, marked by widespread concern and discontent regarding the extension of working years and delayed pension benefits [
31,
76].
Alongside, we prompted GPT-4 to generate five AI agents endowed with detailed and comprehensive personas and perspectives on this subject (Table
1).
4.3 Procedure
Participants were first informed of the aim of this study and signed a consent form. Experimenters then introduced the key features of the system and demonstrated their usage. Participants were asked to view posts under the topic of “retirement policy” and communicate with AI agents using our system, which was deployed as a web application, on a laptop for around 30 minutes. After finishing the viewing task, participants were asked to rate their experience of using this prototype on a set of 5-point Likert scales. The experimenters then conducted a semi-structured interview based on the results and observed use patterns.
4.4 Thematic Analysis
All study sessions were recorded and transcribed. Two authors read through the text script of three randomly selected participants together to understand their user experience of the prototype. Then, they independently coded the script using an open-coding approach [
10]. They combined deductive and inductive coding techniques to form the codebook. The two coders regularly discussed the codes and resolved disagreements to create a consolidated codebook. Further meetings were scheduled with the whole research team to discuss the codes and how they should be grouped into themes. The whole team iterated on the codes and their grouping until they reached consensus. In the end, we arrived at four themes: overall user behavioral patterns, engagement, diverse information, and in-depth information processing.
5 Results
In this section, we first outline the behavioral patterns of users and their perceptions of the system. Then we discuss the findings according to our three design considerations.
5.1 User Behavioral Patterns and System Perceptions
We first examined the behavioral patterns of the participant interactions. During the prototype testing, participants generally began by viewing some posts, followed by interacting with the AI agents and exploring the perspective Puzzle. Based on the sequence in which participants engaged with the AI agents, we classified them into three categories (Figure
4): seven (out of 18) participants initially chose to chat with AI agents based on their own interests (Figure
4 a,
interest-driven conversation order), three began by following the order presented in the system (including the order displayed in the entrance, the Viewpoints Puzzle, or the avatar panel) but circled back to engage with agents of their interest (Figure
4 b,
system-guided followed by interest-driven conversation order), and the remaining eight followed the system’s presenting order (Figure
4 c,
system-guided conversation order). Examples of conversation logs from three participants representing each category are provided in Appendix
B. 13 participants interacted with all the AI agents. One participant (P16) exhibited a unique behavior pattern that he chose to interact with two AI agents simultaneously, alternating between them and asking each to consider the perspective of the other.
The post-survey indicated mixed feedback among participants regarding the system. Among the 18 participants, 14 considered the system interesting, as indicated by their ratings of agree/strongly agree, and the other 4 rated it neutral (Mean = 3.83, SD = 0.51). 11 participants reported positive user experience, with ratings of agree/strongly agree, 6 rated neutral, and only 1 gave negative rating of disagree/strongly disagree (Mean = 3.56, SD = 0.62). When asked about their willingness to use the system in the future, 11 participants expressed positive attitudes, rating agree/strongly agree, 5 rated neutral, and 2 rated negatively as disagree/strongly disagree (Mean = 3.50, SD = 0.71). These ratings indicate that, overall, participants’ attitudes towards the system lean positive, though not without reservations and concerns: The favorable ratings primarily stemmed from the incorporation of multi-agent characters generated by LLMs and gamification design. Participants valued its novelty, describing it as “fun character design” (P3), “enhanced conversational experience similar to role-playing games” (P7), and “more engaging than regular social media browsing” (P8). The neutral and few negative ratings also suggested that for some participants, the system did not fully meet their expectations. Some perceived it as “not as effective as talking to a real person” (P12), and expressed concerns like “feels like taking a reading comprehension test when answering those questions” (P10).
Regarding conversations with AI agents, 12 out of 18 participants rated the conversational flow appropriate and smooth, with ratings of agree/strongly agree, 3 rated neutral, and 3 gave negative ratings of disagree/strongly disagree (Mean = 3.61, SD = 1.09). Participants in favor of the conversational flow praised the ability of LLMs to “understand the context and generate abundant content accordingly” (P16), as well as the design of pre-generated response options to “keep the dialogue moving smoothly” (P11). When engaging with the agents, participants opted for pre-generated response options 55% of the time, while they chose to manually type text for the remaining 45% of the interactions. Intriguingly, three participants were inspired by the content of certain posts and asked the agents about their opinions on those specific topics. And regarding neutral and negative feedback, participants primarily raised concerns about the format of the LLM-generated responses. Some noted that “the generated text may be too long and complex for people with lower levels of education to comprehend” (P3), and others expressed expectations, such as “adding pictures or visual elements to the current text-only conversation could enhance clarity” (P18), suggesting the possible refinements in the future.
Table
2 outlined participants’ number of interaction rounds, as well as their ratings for both the pleasure and helpfulness of conversations with each AI agent. Agent 5 got the most interaction rounds, but also received the lowest ratings for both pleasure and helpfulness. Post-interview revealed that as Agent 5’s viewpoints were markedly different from those expressed in the posts (and potentially from the participants’ own perspectives), some participants expressed a desire to
“debate with and convince him” (P3).
5.2 Diversity in Information Acquisition (DC1)
5.2.1 Role Settings Allow Conversations with Various Perspectives.
The core feature of our prototype is to provide varied perspectives with multiple AI agents. Results indicated that the role and perspective settings, which are generated by GPT-4 (Table
1), effectively
“offered distinct perspectives” (P7, P11). In the thematic analysis, the two coders identified that the responses of the AI agents consistently aligned with their pre-determined attitudes, ranging from dissatisfaction to support for the policy. When asked to evaluate their level of agreement regarding whether engaging in conversations with AI agents would help acquire more diverse information, 12 out of 18 participants rated
agree/
strongly agree, 4 rated
neutral, and 2 rated
disagree/
strongly disagree (Mean = 3.72, SD = 0.89). Those in favor highlighted the benefit of accessing diverse perspectives through interactions with AI agents, as P17 articulated:
“I could immediately see the differences in perspectives from different AI identities, which broadened my desire to explore a wider array of information.” Some discontent was noted regarding the
“predictability of the perspectives based on the characters’ identities” (P3) and the viewpoints
“not exceeding existing scope of knowledge” (P16).
Interestingly, we found the generated perspectives could also be novel and insightful. During the post-interviews, 11 participants pointed out that their interactions with the AI agents introduced them to previously unconsidered and inspiring viewpoints. Just as P4 stated that “some viewpoints of the agents were unfamiliar to me, which enriched my understanding of others and the society.” Additionally, two participants mentioned that the generated response options could also be inspiring. These options, as P4 noted, could “stimulate and guide directions of the conversation”.
However, 12 participants pointed out AI agents’ responses often tended to be broad and vague. Besides, both the tone and content of the AI agents appeared to “converge as the conversation progressed” (P6), focusing predominantly on the pros and cons of the policy and related policies around the world, so “they must all be driven by the same underlying AI” (P16). It is noteworthy that the reported “convergence” was specific to the dialogue content itself as conversations between AI agents and human users evolved, while the attitudes and perspectives of the AI agents towards the policy remained unchanged. This occurred because the AI agents did not articulate their assigned viewpoints in each response to users, similar to how we as humans might not find it necessary to constantly restate our stances throughout a conversation.
5.2.2 Viewpoints Puzzle Encourages Explorations with Various Perspectives.
Regarding the usefulness of the Viewpoints Puzzle design, there was a mix of positive, neutral, and negative ratings among the participants (Mean = 3.67, SD = 1.03). 12 out of 18 rated
agree/
strongly agree for this design being useful, recognizing it could provide
“a full picture of all the perspectives” (P18); 4 rated
neutral, and 2 rated
disagree/
strongly disagree, citing hesitations like
“it is only likely to be effective when I have a lot of free time” (P14). Several participants utilized the Viewpoints Puzzle as an index, navigating through it to engage in conversations with various AI agents. As evidenced in the transition diagram between various interface elements (Figure
5), a portion of participants accessed the dialogue window through the Viewpoints Puzzle, and some of them did so by clicking on individual agents’ puzzle pieces. Besides, two participants recommended enhancing the Viewpoints Puzzle by including additional information, thereby allowing them to
“more effectively understand the core ideas of the perspectives” (P18).
5.2.3 The Conversation Mode Facilitates a Rapid Understand of New Topics.
Several participants found the prototype particularly beneficial when exploring new topics. As P17 stated that the prototype was
“informative and valuable” for such endeavors and P6 stated that
“the conversation mode provided me a more convenient and efficient way to acquire new and diverse information”. Table
2 reveals that Agent 5 (the economist) and Agent 3 (the HR manager), who were generally considered more knowledgeable about the policy under discussion, was engaged in the highest number of conversational rounds. Additionally, when it came to asking questions in terms of policy, these two agents were the most frequently queried (Agent 3: 24%; Agent 5: 23%), on topics such as factors to consider in policy implementation and international policy practices. These suggest that participants tended to consult these potentially knowledgeable agents for basic knowledge on the topic, which was echoed by P13 who expressed a desire for the AI agents to
“provide some basic knowledge about this topic”.
5.3 Depth of Information Processing (DC2)
5.3.1 Talking with Multiple Characteristics could Stimulate Users’ Reflection.
Overall, participants believed that conversations with AI agents facilitated deeper contemplation on the topic (14 out of 18 rated agree/strongly agree, 3 rated neutral, and only 1 rated disagree/strongly disagree, Mean = 3.78, SD = 0.65). During the conversations, participants could opt to respond either by typing or by selecting from the system-generated options. Log data revealed that 17 out of 18 participants engaged in typing at some point, despite being explicitly requested to do so during the study, suggesting a certain level of deliberation, as opposed to a shallow interaction through clicks. Furthermore, semantic coding of the participants’ responses showed that 12 participants have typed their own opinions, oppositions, or counter-questions to the AI agents, thereby further substantiating the argument that there is a certain level of in-depth thinking during conversations.
We identified two primary reasons according to the feedbacks during the post-interviews. Firstly, the conversational nature was conducive to stimulate deeper thinking. As the participants pointed out, “it allows me to discuss with ‘people’ about the topic, and the very process could prompt me to think more deeply” (P18). P16 noted that “it’s not always convenient to discuss these topics with friends, and if you try to engage through forum posts or comments, there may not be immediate or any responses. On the contrary, interacting with the agents in the system is convenient and inspiring.” Secondly, due to the presence of multiple AI agents, the diversity of their viewpoints also served to facilitate users to “understand and reflect on the topic from various perspectives” (P6). P7 noted that “engaging in conversation with diverse characters with different viewpoints made me think more critically”.
However, 4 participants also expressed concerns about the credibility of the AI agents, as it “lacked evidence and concrete examples to substantiate their claims” (P12). In the post-survey, 7 out of 18 participants indicated that they disagree/strongly disagree with the statement that they could be persuaded by the AI agents (9 rated neutral, 2 rated positive, Mean = 2.61, SD = 0.85). The lack of credibility “somewhat limited my inclination for in-depth and meaningful discussions with the agents” (P18).
5.3.2 Role of Response Options.
Participants also pointed out that the response options allowed them to “think more extensively” (P11). Since we designed the provided options to be questions that could be asked based on the agent replies, these questions encouraged participants to “further inquire and engage in dialogues” (P5), probably because the options have stimulated participants’ curiosity.
5.3.3 Role of the Viewpoints Puzzle and Multi-choice Questions in Summarization.
Some participants stated that the Viewpoints Puzzle served as a useful tool for “summarizing and organizing various viewpoints” (P3, P11). P12 went further and suggested the map could be “organized according to the viewpoints, such as along an axis indicating support versus opposition.”
Regarding the multi-choice questions, 13 out of 18 participants rated that these questions facilitated a better understanding of each AI agent’s perspective, while 5 rated neutral (Mean = 3.78, SD = 0.55).
“The content of the questions are concise, helping me to easily grasp the main ideas” (P14) and
“enhance understanding” (P17), though P18 expressed that he did not think the assessment module to be necessary, because “Individuals naturally comprehend viewpoints that interest them without needing specific assessment tools. Viewpoints that fail to capture one’s interest are not seen as crucial to understand.” In addition, an intriguing behavioral pattern emerged: some participants would click back and forth to review the chat logs while answering the questions. A typical example is P6 (Figure
4 c, where P6 reviewed the dialogue history while answering the multiple-choice questions). This further attests to the role of multiple-choice questions in encouraging deeper processing of the contents.
5.4 User Engagement (DC3)
5.4.1 Effect of Gamification Design.
Overall, participants were willing to engage with the AI agents to acquire information (13 out of 18 rated agree/strongly agree, 3 rated neutral, 2 rated disagree/strongly disagree, Mean = 3.67, SD = 0.77). They were also motivated to interact through the gamified feature of “lighting up” the ViewPoints Puzzle. On average, participants illuminated 4.61 agent avatars by completing five or more rounds of conversation with each agent; notably, 12 participants successfully lit up all the avatars. Similarly, through answering multiple-choice questions, participants on average lit up 4.33 puzzle pieces of the agents; 11 participants successfully lit up all the puzzle pieces. Most participants acknowledged a desire to illuminate the ViewPoints Puzzle, with one participant noting that successfully doing so acted as “positive reinforcement that made him feel he have got to know the corresponding agents” (P16). However, a few participants suggested that “more tangible rewards would be more useful” (P1), while P18 felt that the map gave him “a sense of obligation rather than motivation.”
5.4.2 Effectiveness of Role-playing of the AI Agents.
During the conversations, all participants utilized the second-person pronoun to talk to the AI agents, such as “What do you think about this issue?” (P1) or “I don’t think you are right” (P5). Furthermore, 13 out of 18 participants had asked the agents at least one personal questions, such as “What are your plans?” (P1) or “Have you considered changing careers before retirement?” (P3). These indicate that our embodiment of the AI agents was effective, as users indeed treated them as distinct characters.
However, many participants pointed out that the effectiveness of AI agents in role-playing is still lacking in two aspects. Firstly, the authenticity of role-playing was inadequate, particularly for blue-collar workers. As expressed in the interview by participants, “A deliveryman being highly knowledgeable about policies seemed unrealistic and inconsistent with my expectations” (P17); “I prefer talking to agents whose statements align with their identity” (P10); “When I asked factual questions, like existing policies, the answers were quite similar across the agents” (P4).
Secondly, the AI agents fell short in their ability to convincingly playing “real humans”. “They don’t feel personal enough,” said one participant (P6). This limitation may be related to the fact that we limited the response length of the agents in our prompts. “Their responses were all of a uniform length, which is not very human-like. A mixture of long and short responses would be more realistic” (P16); “I wish the format of the responses could be more diverse, such as including images or emojis” (P18).
5.4.3 Cognitive Load.
Scores from the NASA-TLX scale indicated that the system did not impose a significant burden on the participants (Table
3). Specifically, the response options effectively reduced participants’ cognitive load, as evidenced by the fact that 55% of user responses were made by clicking on these options. Participant P18 noted,
“The setting of the options is great; they were different from one another, and I could basically always find what I want.”5.4.4 User Perceptions of the Entrance.
Regarding the setting of the entrance, participants have varying suggestions. Some participants suggested more interaction between the multi-agent system and the posts, such as “including AI agents’ responses in the post might make me feel more engaging” (P10) and “hoping to discuss the post content with the agents” (P2). Some participants hoped for a permanent entrance for on-demand access, which would bring “a sense of control. It currently looks like the posts, which can be accidentally clicked on” (P18). In addition, some participants suggested the system could “automatically detect if I’m currently in a filter bubble and provide new perspectives accordingly” (P12).
6 Discussion
Our research aimed to address the two RQs outlined earlier: how such an LLM-powered system should be designed, and whether and how such a system may help users access and reflect on diverse information. For the first RQ, we orchestrated a participatory design workshop to brainstorm ideas, from which we derived three design considerations. Then we defined key interaction features accordingly and finalized the prototype design. For the second RQ, we implemented this prototype featuring LLM-powered multi-agent characters that participants interacted with while reading social media content and ran an evaluative study. Our analysis, including participants’ rating scores, interaction patterns, and interviews, unveiled three main insights:
•
Participants demonstrated interest in interacting with the LLM-powered multi-agent system. Even when the AI agent’s viewpoints challenged their existing beliefs (e.g., Agent 5), they were willing, if not more inclined, to engage in dialogue, facilitated by well-designed gamification incentives and an inherent motivation probably driven by curiosity.
•
Progressive interactions with assessment tasks, could deepen participants’ understanding of opposing viewpoints and provoke thoughtful and careful considerations among them, an essential step towards escaping filter bubble.
•
Two main technical barriers were revealed based on participants’ concerns for leveraging current Large Language Models to effectively deliver diverse perspectives: inaccurate character representation, and over-generalization lacking contextual depth.
In this section, we delve into these insights one by one, and discuss design implications with the outlook of future work for better assisting users to burst their filter bubbles.
6.1 Motivating Engagement through Exploratory Time
Our study showed that when users conversed with AI-generated multi-agents possessing diverse viewpoints, they displayed a desire to understand the reasoning behind these perspectives and how they were formed, rather than dismissing them. Most users enjoyed the experience interacting with different roles and found it helpful for them to get diverse information. In fact, some participants even deviated from the predetermined sequence of conversations to prioritize interacting with the agents they found most intriguing, indicating a significant degree of motivation and engagement. Such motivation was further enhanced with the introduction of a small design feature where participants could lighten all five pieces of the Viewpoints Puzzle after completing interactions with all the multi-agent characters.
6.1.1 Design Implication 1: Provide Continuous Dialogue with Multi-agents that Offer Diverse Perspectives.
Large Language Models, like GPT-4, have demonstrated the ability to convincingly portray multi-agent characters with extensive domain-specific knowledge [
11,
53]. This breadth and expertise allow each character to generate distinctive viewpoints with compelling reasoning that is consistent with their character [
43,
73]. Our designed characters span a range of professions, from economists to blue-collar workers, ensuring that the perspectives presented will not be limited to or influenced by the background of any particular group of people.
Conversational interfaces could help people retrieve information quickly, as the natural conversational flow allows people to get concise and relevant information, and the interactive nature of conversation can adapt to users’ needs in real-time [
18]. In our study, participants also reported that dialogue flow design is helpful for their rapid understanding of a new topic. As the conversation developed, users had the opportunity to engage with each character by asking questions, and responses were generated instantly by GPT-4. This experience stood in contrast to traditional online content platforms, which were often difficult to interact with other users in real-time through comments or posts. Consequently, the direct and interactive mode of conversing with AI agents about specific topics emerged as a compelling option for users browsing online media content.
Participants exhibited interesting behavior by asking some AI agents to change their roles and answer the same question again (P16). Some participants were even curious about what would happen if AI-generated characters interacted and discussed their perspectives with each other. Prior work has explored the design and development of a virtual world using Large Language Models, in which generative agents have simulated minds with memories and experiences that allow them to interact with each other [
53]. Future work can explore how these social interactions among AI agents can benefit engaging people in diverse perspectives.
While the effectiveness of generating diverse outputs by LLMs is established, the consistency of these outputs over extended conversations remains an open question. Future work is necessary to explore the utilization of interactive design features and human-in-the-loop feedback to ensure that the same prompt consistently produces responses with similar viewpoints at different times, maintaining a consistent character voice.
6.1.2 Design Implication 2: Design Gamification Incentives to Promote.
Exploration
The integration of game design elements into applications has gained increasing interest in recent years [
40]. In education, healthcare, and customer engagement, gamification incentives can serve as a powerful tool to motivate users to achieve specific goals or outcomes through game design elements, such as points, badges, leaderboards, etc. [
14,
24,
34,
77].
In our prototype, we have designed gamification incentives accordingly, i.e., collecting all pieces of the Viewpoints Puzzle as a common form of badges, to encourage participants to engage more with new and challenging information. Our study suggested that the puzzle collecting design effectively motivated users to explore and seek out information with diverse perspectives. Participants frequently mentioned using the Viewpoints Puzzle as a navigation bar due to its intriguing nature, with many wanting to discover what happens next by clicking on different parts. Additionally, the design instilled a sense of “winning desire” in some participants, leading them to desire to collect all pieces of the puzzle through interacting with multi-agent characters. Therefore, the gamification incentives might make users more inclined to interrupt their habitual consumption of scrolling through media content and engage with the system.
It is worth to note that one participant exhibited an interesting behavior pattern that we referred to as “rushing to the finish line” (P2). She focused exclusively on assembling all the puzzle pieces as quickly as possible, rather than taking the time to ask questions and understand the perspectives generated by the characters. This finding suggests that excessive reliance on gamification may lead to some users prioritizing the completion of tasks or achieving rewards over the actual learning engagement itself. Previous research has found that extrinsic rewards might undermine users’ intrinsic motivations [
19]. Further work could explore designing interactions that tap into users’ intrinsic motivations to create experiences that prioritize genuine engagement and avoid the potential pitfalls of over-gamification. For example, providing generative feedback that highlights the user’s progress and understanding. Moreover, rewards and incentives can be properly designed to encourage collaboration among multi-agent and human users, shifting the focus from individual rewards to collective achievements and shared experiences.
6.2 Designing Progressive Interactions with Assessment Tasks to Enhance Deep Thinking and Understanding
Our work explored the use of progressive interactions with assessment tasks to encourage critical thinking and understanding of diverse perspectives. User study indicated that these types of interactions could encourage participants’ deliberate consideration of different viewpoints. The progressive interactions facilitated critical thinking by gradually increasing complexity and diversity, as participants engaged in careful considerations and thoughtful discussions while completing assessment tasks.
6.2.1 Design Implication 3: Providing Progressive Interactions.
Previous work suggests that structured progressive interactions could enhance critical thinking abilities among people [
26,
74,
80]. Our study extends prior work by showing that presenting diverse viewpoints through natural conversational interactions with AI-generated characters encouraged participants to give careful consideration to new information.
In our prototype, two progressive interaction designs were implemented to promote deliberate and critical thinking. Immediate feedback from AI-generated dialogue serves as a natural progression, providing contextual information such as reasoning chains, examples, and stories through questioning and answering, leading participants towards a deeper understanding of differing viewpoints. The progressive role setting, starting from the most similar viewpoint to the original, then gradually introducing more nuanced and diverse perspectives, guided participants from a basic understanding of their existing beliefs to more critical thinking. However, the evaluation of users’ perceptions towards the preset presentation order, as well as the effectiveness of this sequence for all users, remains unexplored. In addition, text-based dialogues alone may not fully capture the nuanced information present in human conversations [
35,
55]. Future work could incorporate more multi-modal interaction techniques such as vocal emotions, micro expressions, and body languages, in detecting users’ intent, attitude, and familiarity with the topic and viewpoints, to promote deep thinking through more customized feedback.
6.2.2 Design Implication 4: Designing Assessment Tasks.
As discussed previously, incorporating game elements into the design could foster user engagement in exploring diverse perspectives. By presenting assessment tasks, such as multi-choice question sessions, along with providing gamification incentives, we could further promote deep thinking and create a synergy where the total impact is greater than the sum of its parts [
12,
34,
36]. Our study demonstrated that some participants also frequently switched between conversations with AI agents and the assessment tasks, indicating a higher level of thinking and comprehension. However, the optimal balance between gamification incentives and the challenge of assessment tasks remains unclear. Some participants (P3) described the assessment tasks as “rigid like a quiz in high school class,” while others (P16) found the multi-choice questions to be “too easy without challenge.” Further research is necessary to develop adaptive assessments with feedback loops that align with the engagement and thinking levels of users, to facilitate continuous improvement.
6.3 Technical Barriers and How to Overcome
Although our work demonstrates the promising capabilities of LLMs in content generation and anthropomorphization, enriching user engagement and fostering deep thinking about diverse perspectives, some challenges and concerns also came to light. In our study, participants identified two technical barriers: inaccurate character representation, and lacking contextual depth. Some participants noted that the tone of some responses did not match the character’s personality. Additionally, some participants reported that some AI-generated content lacked necessary topic-specific details and elaborations, resulting in generic and shallow responses. These findings are in line with prior work that examined the performance and capabilities of LLMs in content generation and emulating characters [
57,
68].
These technical challenges need to be addressed to utilize these capabilities effectively and responsibly. We propose the following design implications for creating more inclusive and accurate experiences to navigate people out of their filter bubbles.
6.3.1 Design Implication 5: Improving Inaccurate or Biased Character Representation.
Large Language Models are trained on extensive data from the internet, which can lead them to reflect biases present in those datasets [
11]. When asked to represent a character or perspective that are underrepresented in their training data, the outputs can be inaccurate [
23,
57,
71]. For example, the GPT-generated characters in our study exhibited some gender biases, such as women play the role of Janitor and HR manager while men are Entrepreneurs and Economics Professors. These biases could be attributed to the stereotypes that model inherited from the internet data used for training. To mitigate such and similar issues in the future, it is crucial to employ more carefully curated data and fine-tuned models, adhere to ethical and responsible AI guidelines, and incorporate human oversight before deploying such system in real-world applications. In addition, engaging in discussions on a topic with users typically requires LLMs possess relevant background knowledge about the subject. For example, in our study some user asked the AI agents about retirement and pension policies in other countries. Although GPT accurately retrieved information on policies in Germany and Sweden, it sometimes incorrectly stated that Dutch residents had flexibility in choosing their pension age between 60 and 70, which was unverified according to our best online search. These unconfirmed information could lead to mistrust and potentially detrimental outcomes among users who rely on it. Thus, it is crucial to address issues such as hallucinations [
78] or factuality issues [
66] before deploying this system in practical settings. While prior work is limited in directly addressing these issues, research has shown potential in utilizing Reinforcement Learning from Human Feedback (RLHF) to fine-tune language models based on human feedback to better align with human intent [
29,
52], or prompt engineering techniques to generate contents that follow factual information [
63]. Future work can focus on collecting enriched information and building character-centered datasets to further fine-tune LLMs, generating more accurate and fair representations of characters.
6.3.2 Design Implication 6: Prompting with Interactive Design to Enhance Contextual Depth.
Due to the nature of generative language models, they do not possess the same level of human cognition to understand deep cultural, historical, or emotional contexts of characters and perspectives [
4], which can result in outputs that lack contextual depth. However, there is evidence that LLMs have significant potential in few-shot learning and in-context learning [
3,
17]. Just changing a few examples or prompts can help LLMs adjust their generated content, mitigating inaccuracies in generated content [
43,
73]. Future work should explore interactive design techniques to make it easier for users to edit and iterate prompts or provide examples that aligns with the detailed contextual background, personality, and nuances of a particular character.
6.4 Limitations and Future Work
There are three primary limitations in this study: the system’s usefulness awaits further evaluation, the limited scope of the prepared topic, and the limitations of the laboratory study setting. First, we aimed to explore how such a system should be designed (RQ1) and whether it could help users access and reflect on diverse information (RQ2) in our study. However, the extent to which encouraging users to access and contemplate diverse information contributes to effectively breaking the filter bubble is yet to be determined. Future research could compare participants’ initial opinions before using the system and their post-use opinions, or contrast experiences with and without using the system. Additionally, conducting a controlled study to compare the utility of our system with other methods aimed at helping users overcome the filter bubble would also be helpful. Second, the selected topic of delayed retirement policy, may impact the generalizability of the results, as participants may lack interest or motivation to engage with the LLM-powered multi-agent characters for in-depth discussion on such topic. Future work can consider exploring a broader range of topics to investigate whether and how such LLM-powered multi-agent system may help users burst filter bubbles in a variety of contexts. Third, we conducted the evaluative study in a laboratory setting, where participants were required to complete tasks independently within a limited timeframe. However, it is possible that participants’ preferences and behaviors may differ if they were to interact with the system in a more flexible and extended setting. For example, one of the most intriguing questions posed is whether users will voluntarily pause their online browsing activities to engage in 5-10 minute conversations with our multi-agent characters without explicit requests. Although these interactions may initially seem unnatural, reflecting our own experiences with social media, observations from our laboratory experiments suggest an interesting potential where users may feel motivated to interact with the system voluntarily when they could freely browse social media. For instance, many (N=9) participants reported that the desire to win the game led them frequently jumped back and forth to interact with different agents more than required, in order to correctly answer questions and collect puzzle pieces. P16 deviated from the task by asking some agents to play a different role and answer the same question again. Also notably, P3 and P16, on their spontaneous initiative, even proposed if they could have agents discuss among themselves and come back with new responses. However, we acknowledge that more systematic studies are warranted to further investigate this open question. For example, future work should consider conducting a longer-term field study to investigate how social norms, communication, and interactions among users may impact their information consumption and the potential of the system to promote diverse perspectives in real-world settings.
In future research and real-world applications, there are also several aspects of our system that can be improved. Firstly, the display of the entrance is determined by a predefined rule in our current system. For future research and practical implementation, the timing of the system’s entrance display deserves more careful consideration. To achieve the effect of displaying the entrance when users need it, future systems could assess the extent of the filter bubble, such as whether the attitudes are one-sided or the online voices are self-reinforcing, and then determine the timing of the system’s entrance accordingly. Secondly, future research could consider the states of users during their interaction when designing the system. On one hand, by monitoring user interaction behaviors, the agents’ responses could be dynamically adjusted. For instance, if semantic analysis detects a user becoming irritated with an AI agent, the agent could employ techniques to soothe and stabilize the user’s emotions. On the other hand, as continuous interaction [
2] and motivation to comprehend an agent’s behavior [
81] may enhance users’ tendency of anthropomorphism and result in over-trust towards the agents [
20,
37,
65], interventions should be implemented upon detecting signs of over-trust or negligence. Indicators could include but not limited to overly rapid responses, showing complete agreement in dialogue, or engaging predominantly with a single AI agent.
7 Conclusion
In today’s world, new technologies such as AI-powered search and recommendation systems are implicitly influencing the way people consume information. Unfortunately, this can result in people being trapped in isolated filter bubbles with narrowed perspectives and reinforced biases. Escaping these filter bubbles can be challenging, as it requires not only exposing users to diverse information but also motivating them to engage with that information, especially opposing viewpoints, through in-depth thinking. Our research aimed to understand how to design a system that leverages the power of Large Language Models to address the issue of filter bubbles, and whether and how such a system could help users broaden their perspectives. To achieve this, we conducted a participatory design workshop that involved various roles such as HCI and UX researchers, designers, and psychologists, all of whom are also users of online content platforms. Through this process, we identified three key design considerations with distinct interaction features that could promote users towards diverse perspectives. In light of these considerations, we designed and developed a prototype with LLM-powered multi-agent characters that users could interact with while reading social media content, and conducted an evaluative study with 18 participants. Based on our findings, we extracted six design implications and discussed future work outlook for researchers and designers to consider and explore when designing generative multi-agent systems to better assist people in bursting their filter bubbles.
C Examples of Assessment Tasks
Following are examples of the assessment tasks in the Viewpoints Puzzle, one for each AI agent. The tasks were generated and displayed in Chinese, and were subsequently translated.
Agent 1. What does Wu Xiaofei believe is the rationale behind the delayed retirement policy?
(A)
(Wrong) It is for the sake of the country’s economic development.
(B)
(Wrong) It aims to enhance the standard of living for its citizens.
(C)
(Correct) It seeks to address the issues of a declining population and insufficient pension funds.
Agent 2. What are Wang Yanli’s primary concerns regarding the delayed retirement policy?
(A)
(Wrong) She is worried about not having enough salary.
(B)
(Correct) She fears her physical strength will not sustain her until the delayed retirement age.
(C)
(Wrong) She is concerned about not having sufficient savings for retirement.
Agent 3. What kind of retirement policy does Zhang Xiaoning believe is more appropriate?
(A)
(Wrong) A one-size-fits-all delayed retirement policy.
(B)
(Wrong) Retirement at a uniform age as stipulated by the state.
(C)
(Correct) A flexible retirement policy, where employees can choose their retirement time based on their physical condition and retirement preparations.
Agent 4. What kind of system does Li Zehan hope the government will introduce to help people adapt to delayed retirement?
(A)
(Wrong) Increase pensions.
(B)
(Correct) Provide psychological support and technical training.
(C)
(Wrong) Reduce working hours.
Agent 5. What benefits does Professor Zhang Hua believe that the delayed retirement policy will bring to societal development?
(A)
(Wrong) Improve the quality of life for individuals.
(B)
(Wrong) Increase government fiscal revenue.
(C)
(Correct) Inject more vitality into social development.