In this section, we present the results of our analysis. First, we present descriptive statistics about the four cases, focusing on patterns of user participation and engagement. Second, we present our analysis of the top contributors in the four cases, focusing on top content contributors (in terms of number of contributions) and top content broadcasters (in terms of spreading the audit amongst their followers). Finally, we present our analysis of the five primary roles users’ tweets played in these audits, describing subdivisions of labor within each role as well as variations in terms of how these roles differ across the four user-driven auditing cases.
5.1 Patterns of Participation and Engagement in the Auditing Cases (RQ1)
Table
1 presents descriptive statistics of user participation and engagement in each of the four user-driven audit cases. In summary, we found: (1) three out of our four cases (Portrait AI being the exception) shared a similar pattern of user participation, where there was a large initial burst of tweeting activity, followed by a long tail of low activity; (2) the vast majority of participants in all four cases contributed just a single tweet.
Of the four cases, the Twitter Cropping case had the most total tweets contributed by users (11,323 tweets), followed by ImageNet Roulette (2,813), Apple Card (2,408), and Portrait AI (1,440). Second, the most tweets posted per day by users, e.g. the “peak” participation day, occurred in a day for the Twitter Cropping case, within a week for Apple Card, around 6 months for ImageNet Roulette, and over a year for Portrait AI. Note that for ImageNet Roulette, the first tweet we found about it was from about 6 months before its public launch date when it was still a physical exhibit. As such, the peak for ImageNet Roulette actually represents its public launch date.
Next, we observed three of our four cases – the Twitter Image Cropping, ImageNet Roulette, and Apple Card – had a similar pattern of user participation, where there was a large initial burst of tweets followed by a long tail of low user activity (see Figure
1). One possible explanation for this pattern is these respective audits engendered a great deal of attention from celebrities, public figures, and other influencers, which led to interventions by institutions or technology platform operators who then took actions to address the concerns users raised. For instance, in the Twitter Image Cropping case, the peak of user participation was just 3 days after a user posted the initial tweet pointing out the problem with the algorithm. The same day this first tweet was posted, Twitter apologized and outlined steps it would start taking to address the algorithm’s racial bias problem [
41]. Similarly, for Apple Card, the New York Department of Financial Services reported on November 9, 2019 it would start investigating the Goldman Sach’s creditworthiness algorithm, just a few days after the original tweet was posted by a user alleging that the algorithm was biased [
69].
In contrast, the Portrait AI case showed sporadic conversation and user tweeting activity throughout the observed time frame for the case. Compared to the other user-driven audits, only a few users raised concerns since the software’s inception about whether its algorithm was biased. Furthermore, the platform operator has not yet directly addressed the hypothesized bias issue, nor has any other authority intervened. We note that for the Portrait AI case, there have been very few news articles about potential bias (unlike the other three cases), nor have there been any celebrities discussing it (unlike the Apple Card case).
Last, we found the vast majority of participants in all four cases contributed just a single tweet. While all cases had many participants, very few of them had significant participation in terms of number of tweets. Furthermore, while any one of these single tweets likely had little impact, in aggregate we believe they increased the profile of these cases and drew the attention of journalists, which ultimately led to changes in the Twitter Image Cropping, ImageNet Roulette, and Apple Card cases.
5.2 Top Contributors: Who They Are and What They Do (RQ2)
In this section, we present the results of our analysis of the top contributors for each case, focusing on “content producers” and “content broadcasters”. As described in Section
4, content producers produced the most tweets in a case (using this as a proxy for contribution towards the user-driven audit), and content broadcasters had the highest engagement metrics of likes and retweets (using these as a proxy for spreading awareness of the case). We found that, among the tweeters who actively contributed to these audits, there was a clear division between “top content producers” (people who produced and posted the largest number of tweets) and “top broadcasters” (people who received a large number of likes and retweets). In other words, although top content producers might have contributed more tweets, their content garnered less engagement than top broadcasters, suggesting user-led audits may require both types of users to be successful.
For example, with the ImageNet Roulette, Portrait AI, and Apple Card cases, there was only one account for each case that was both a “top 10 producer’’ and “top 10 broadcaster’’. Furthermore, for the ImageNet Roulette and Apple Card cases, the overlapping accounts also happened to be the initiators of the two cases. Kate Crawford of the ImageNet Roulette case was ranked as the second-highest content producer and content broadcaster, while David Heinemeier Hansson was ranked as the second-highest content producer and the highest-ranked content broadcaster.
We also observed a difference in the number of influencers in top content producers and top content broadcasters. We defined “influencers’’ as users who have a current follower account greater than 1,000 at the time of our analysis (note that this might not be the same as at the time the user participated in the case). Top content broadcasters had a higher number of influencers than top content producers. Because content broadcasters are characterized by the number of likes and retweets their tweets received, it comes as no surprise that users with large followings on Twitter would become top content broadcasters. This could indicate that having the attention and contribution of influencers can help the spread of a user-driven audit case.
5.3 The Division of Labor in User-Generated Content (RQ3)
Our analysis of tweets led to five roles that tweets play in the auditing cases: hypothesizing, evidence collection, amplification, contextualization, and escalation. Note we treated all tweets associated with a case as a type of ‘harm identification’ since these audits were raising attention about algorithm bias (please see
4.2 for details on how we determined the relevance of tweets).
Table
3 shows the description and prevalence of each role in the four audit cases. All the roles existed across all the auditing cases; however, their prevalence varied from case to case. For example, the majority of tweets in the Twitter Cropping played the role of escalation by providing emotional reactions to what has been observed within the audit, while escalation tweets were almost negligible in the Apple Card case. We discuss these patterns in detail when describing each role below.
To understand the distribution of the tweet roles across the lifetime of each case, Figure
1 illustrates stacked bar charts to display what percent of tweet activity on a given day was playing each role. This shows that the participants and roles are intertwined, corroborating previous work that user-driven algorithm auditing is not necessarily a linear/staged process [
65]. For example, we observed that for the Twitter Cropping, ImageNet Roulette, Portrait AI and Apple Card cases users tweets played each of the roles over time, even in cases where one role dominated. For these cases, therefore, our data does not suggest users played different roles at different times or that they played certain roles early in the case but not later.
5.3.1 Hypothesizing.
The first category of tweets is
hypothesizing, in which users proposed different informal theories they developed about algorithmic systems [
22,
25] to explain how the algorithms operated to produce these harmful results [
34]. In general, we observed two kinds of hypothesizing tweets: theoretical hypothesizing and experimental hypothesizing. Across the four cases, “hypothesizing” tweets made up one of the smallest portions within the respective data set (refer to Table
3).
Theoretical Hypothesizing: In theoretical hypothesizing tweets, users stated what they believed caused the algorithm’s bias(es). Users’ beliefs about the sources of the bias were seemingly informed by their occupation or work experience, knowledge, and awareness of how algorithms generally make predictions or decisions. For example, one participant believed data set homogeneity explained why ImageNet Roulette produced seemingly biased image classifications: “Should this be a surprise? No. Most #AI is based on statistical merging of data from many people. Just like a corporation. We need approaches to AI that reflect the diversity of people, not burying it in big data. [URL redacted]” (T14034).
We also observed users offering hypotheses to counter others’ beliefs. For example, another user countered the data set bias above, instead proposing users lacked understanding that ImageNet Roulette was designed as an art exhibit to showcase how algorithms were biased, and this best explained why users observed bias: “Thought-provoking essay on the political issues of image recognition research: its not just about better datasets. (ImageNetRoulette was over-simplified/-sold in the media; I think it’s a better project when you understand that how it is exaggerating a real problem for effect.) [URL redacted]” (T13876). Some also expresses their belief that algorithmic bias stems from the biased people behind the algorithm, including developers and researchers, not that algorithm itself: "The only ML algorithms with race and gender biases are those concocted by "ethical AI" researchers." (T17807).
Experimental Hypothesizing: Experimental hypothesis tweets were characterized by users offering testing strategies or actions that others could employ to identify and evidence if an algorithm was biased. In other words, these tweets suggested further courses of action for users to adopt to support a hypothesis. For example, in response to one user’s tests, another participant suggested additional tests that could also be used evaluate Twitter’s auto-cropping algorithm. “@bascule Now try Obama in the first picture on top and Obama in the last picture at the bottom. Maybe it’s just how twitter organizes pictures first top last bottom.” (T2883).
An aspect of these hypotheses we found was that users proposed different theories, and they were influencing each other. The theories proposed would include and/or remix aspects of others’ beliefs about how the algorithm was working. For example, in the Twitter Image Cropping case, while some users hypothesized skin color or race was the reason for cutting out an image, some others proposed checking for other facial features by switching these features in images and seeing the results: “@XXXXX A better experiment would be to artificially darken Mitch’s face and whiten Obama’s face and have a black Mitch vs white Mitch or black Obama vs white Obama test. As others pointed out it could be other facial features that affect the algorithm selection, like smile or glasses.” (T2212).
5.3.2 Evidence Collection.
The second category of content was
evidence collection tweets, where users shared evidence they had collected to document a software algorithm’s bias(es). Their evidence typically supported hypotheses proposed by other users and some occasionally invalidated one of these hypotheses. In other words, with the evidence users collected, the auditing cases produced digital evidence for the claims made by other participants. It is also possible that evidence collection tweets laid the groundwork needed to garner the attention of authority figures, such as in the Apple Card case when the New York Department of Financial Services responded on Twitter
14.
Across all four cases, evidence collection tweets represented a substantial portion of the respective data sets, e.g. making up 52.44% of the ImageNet Roulette case and 72.5% of the Portrait AI case. The ImageNet Roulette case features a top-down user-driven audit, where the biased behavior has already been publicly established which could potentially influence participants to want to test and see the biased behavior themselves, possibly resulting in a larger volume of “evidence collection” tweets. The Portrait AI case is interesting here because many people often shared the results of their portrait without explicitly trying to audit the system.
For the Twitter Image Cropping, ImageNet Roulette, and Portrait AI cases, we found user-gathered data were typically altered images users produced, gathered and shared on Twitter by uploading an original image to a software app whose algorithm then transformed the image data in some way. For example, in Tweet 15, a user described A/B testing images to discern if the alleged algorithm bias in the Twitter Image Cropping case was observable: “OK, so I am conducting a systematic experiment to see if the cropping bias is real. I am programmatically tweeting (using tweepy) a (3 x 1) image grid consisting of a self identified Black-Male + blank image + self identified White-Male (h/t @XXXXX @XXXXX)” (T15).
In the Apple Card case though, the user-generated data or evidence was constituted largely of user-generated text content, e.g. tweet text, describing their testimonies of applying for an Apple Card. These tweets noted their observations of disparate decisions that Goldman Sach’s algorithm had made about their or others’ credit approvals or limits, mainly alleging that the algorithm was gender biased: “Fascinating. @Apple still has bias. My wife and I each applied separately for the Apple Card. Identical income and credit scores (hers is higher). I got a 25% higher limit. @XXXXX” (T15654). Another user in T17118, also accused Apple and Goldman Sach’s algorithm of racial bias, saying the algorithm determined they had a lower creditworthiness rating than what other credit bureaus had determined: “It’s racially biased as well. Stated my FICO was 100 points LESS than the other major reporting bureau #AppleCard Apple Card faces scrutiny following allegations of gender bias [URL Redacted]” (T17118).
Types of Media Data Collected and User Comparisons of the Data: Across three cases, all except Apple Card, we found users typically either tested and shared images of others or themselves. For the Twitter Image Cropping case, we found users commonly posted images of celebrities, public figures or politicians when testing the platform’s image cropping algorithm. In comparison, users engaged in evidence collection activities for ImageNet Roulette and Portrait AI often tested and shared images of themselves through the algorithms of these software. We observed this meant users collecting data for the Twitter Image Cropping case were producing and sharing images that were potentially more comparable. For example, many users tested and shared their testing results for cropped images of President Barack Obama and U.S. Senator Mitch McConnell. In comparison, ImageNet Roulette and Portrait AI users usually tested and shared classified images of themselves, typically a portrait of their face.
Once data collectors posted images to Twitter, we found users often compared and discussed their observations and beliefs about the pictures, sometimes proposing new hypotheses to explain why a software’s algorithm classified or cropped the images in the given manner.
5.3.3 Amplification.
The third category of tweets was amplification, in which users broadcast and amplify relevant information related to the audit. These tweets either include information and resources themselves, such as findings from other users, or provide direction towards relevant resources to support newcomers. When users post amplification tweets, they expose their followers to the ongoing audit, which increases the visibility of the case and potentially attracts others to take part in the audit themselves. Amplification tweets took up a large portion of ImageNet Roulette (30.61%), Portrait AI (17.57%), and especially Apple Card (85.05%). The Apple Card case had received a lot of media coverage from beginning to end, meaning there were a lot of articles that could potentially be shared. This could possibly explain why the vast majority of the Apple Card tweets were coded as amplification. ImageNet Roulette also garnered some media attention, potentially due to the influence of Kate Crawford and Trevor Paglen, the creators of the project. The ImageNet Roulette case had a large portion of evidence collection tweets, which could potentially prompt other Twitter users to try it themselves.
In each case, we believe there was a distinct demand for some users to explain and provide information about an audit to others. As seen with the amplification tweets across cases, this role is distinct from tweets sharing the data and evidence users collected, as users collecting data did not always connect the evidence to the broader narrative of the audit, as well as related societal issues of bias. From our observations, we believe the amplification tweets in these cases suggested users want this type of information too. In turn, we speculate amplification tweets could trigger users playing other roles to produce more content, for example, by collecting additional data. In other words, users tweeting amplification tweets could encourage other users to adopt different roles in these audits.
Ultimately, we observed amplification tweets made up the smallest portion of the Twitter Cropping case, despite having the largest volume of tweets in total. Though there may be many reasons for this observation, one possible explanation could be that the escalation tweets (described below) within the Twitter Cropping case, making up 69.21% of the data set, may generate the same effect that amplification tweets hope to accomplish. Below, we further identified three types of amplification tweets: inviting newcomers, linking news articles, and providing information and resources.
(1) Some users tag specific people, manually adding them to the discussion and pointing a potential new participant towards information about the audit. For example, in T3116, the user tags two other Twitter users to look at a thread that curates examples of Twitter’s auto-cropping algorithm: “@XXXXX @XXXXX @XXXXX must see this!” (T3116).
(2) Some users link news articles covering the progress of the audit. By doing so, they amplify the reach and exposure of relevant news articles. For example, in response to the events of the Twitter Image Cropping case, Twitter released a public apology for its algorithm, and the newspaper The Guardian [
41] covered it in an article that was tweeted out multiple times within the audit:
“Twitter apologises for ’racist’ image-cropping algorithm [URL redacted]” (T382).
Similarly, the ImageNet Roulette case had it’s own fair share of public press, such as in Frieze Magazine [
32]:
“I wrote for @frieze_magazine about @katecrawford and @trevorpaglen’s #imagenetroulette [URL redacted]” (T13194).
And in April of 2021, when Apple Card released a new feature in response [
53] to the accusations of gender bias spread on Twitter a few months prior, users included this update into the audit effort:
“Apple Card’s new feature aims to address the gender bias that’s all too common in the credit industry.[URL redacted]” (T15657).
(3) Some users include information and resources in the context of answering the question of others. By doing so, theses users point other users who have expressed interest in the events surrounding the audit towards relevant information and resources to help them get involved, if they so choose. For example, a Twitter user asks “@XXXXX What is this site I want to do it” (T12044), and in response, the participant introduced ImageNet Roulette: “@XXXXX Imagenet roulette” (T12048).
5.3.4 Contextualization.
The fourth category of tweets was contexualization, in which users place the ongoing audit into a larger social, technical, or cultural context, broadening the scope of what they’re observing. They do so by providing the general social, technical, and cultural information that may be helpful for others to better understand the underlying algorithms, the problematic machine behaviors, and why and how they might harm other social groups. By doing so, those tweets help to inform and direct other users who might not have the social background, technical expertise, or cultural knowledge necessary to effectively understand and particiate in the audit.
Users who post contextualization tweets might criticize the user-driven audit itself, placing the efforts of other participants within a larger social and cultural context. For example, in the below tweet, a participant points out that other participants were creating and testing hypotheses despite a lack of technical knowledge, and attributes this behavior to human nature: “@XXXXX Everyone in this thread is hopelessly probing an algorithm they don’t understand, and coming up with imperfect theories based on experiment. Dope, thats what humans do. interesting scroll.” (T3212).
Similarly, the following user throws doubt against the claims of bias within the Apple Card algorithm, stating that the audit participants were jumping to conclusions based on little evidence: “@XXXXX This is a great example of someone immediately playing the “censorship! Bias!” card based on very limited evidence...when (if you think about it) Apple would have no reason to do this. Anyone can find DC any # of other ways...all this would do is make people mad at Apple” (T15723).
Contextualization tweets may also include discussion of the larger landscape of algorithmic bias which provides the technical context surrounding the audit. As seen below, a participant comments on the relevancy of bias in artificial intelligence and where the ImageNet Roulette stands within it: “@XXXXX Bias in AI is a very interesting and relevant topic that will continue to gain importance as we move forward. Projects like #ImageNetRoulette will hopefully raise awareness around this issue and incite change. #BCSTT” (T11357).
In a similar fashion, the following participant within the Portrait AI user-driven audit comments on the possible technical and cultural influences that may have affected the development of the Portrait AI algorithm: “@XXXXX @XXXXX @XXXXX Sadly there not that much classic portraits of PoC. Thats sucks but i don’t think that this is deva fault, they’re probably took giant collection of portraits of classic artists and just trained AI with it” (T15566).
Contextualization tweets made up the second largest portion of the Apple Card case (10.63%), but doesn’t make up large portions of any of the other cases. Regardless, it isn’t the smallest portion of tweets for any of the four audit cases. These observations could indicate that making content with this content is not popular or that only a small percentage of participants are willing / able to create tweets with this role. Since “contextualization” tweets require some level of expertise, it may be the case that these kinds of tweets are being created by a small portion of the participants.
5.3.5 Escalation.
The final category of tweets is
escalation. We found that a large portion of tweets are either reactions to what they’ve observed in their own usage of the algorithm or what is being presented by others. Those tweets are emotional and reactive, expressing what the users are feeling about the observed algorithmic behaviors or the ongoing audit. Although escalation tweets might not contribute direct evidence to the ongoing audit or help form hypotheses, they might have nevertheless increased the visibility of the audit on algorithmic mediated platform like Twitter, in turn, increasing traffic and drawing more users towards the user-driven audit. By doing so, these tweets also help build a counter-public sphere [
24] among the auditors, as individual participants are now, possibly, able to emotionally relate to one another on a common topic and/or opinion, which can create a feeling of community around the audit. These tweets could also grow the number of users who engage with users tweeting about the audits, as reactive user-generated content is known to do this [
14].
The composition of escalation tweets was similar across all four cases. Escalation tweets are usually short, use acronyms, expletives, slang, include emojis, and often incorporate humor. Potentially, those elements helped escalate the emotional and subjective influence within the audit. For example, in the below tweet, the user makes another aware of what the Twitter Image Cropping algorithm did to their post, expressing their frustration with capital letters and expletives: “@XXXXX @XXXXX TWITTER CROPPING FUCKED U OVER” (T7450).
In a similar fashion, a participant in the ImageNet Roulette case makes known what they thought of how the ImageNet Roulette classified their images: “that imagenet roulette shit got me fucked up yo” (T12894). As another example, a participant in the Apple Card expressed their frustration at what they observed from the Apple Card algorithm: “@XXXXX @dhh @AppleCard it’s really fucked this shit goes on behind the scenes. like why tho.” (T17973). In contrast, other users communicated their opinions using humor: “Bruh why use PortraitAI when you can just paint yourself” (T14276).
Escalation tweets were the largest portion within the Twitter Cropping case (69.21%), but were relatively small for the other three cases, though it is not entirely clear why this difference exists. Also, in some cases, escalation tweets can serve as amplification, helping to express negative reactions to an algorithm and increasing the visibility of the case.