5.2.1 Set-up of the Annotation Process.
Instructions to the annotators. A binary question is typically asked to the annotators (the answer “undecided” is sometimes added), potentially with a rating [
45,
56]. However, it is argued in psychology literature that rating comments on a valence scale is too vague for the annotators, who prefer binary questions [
254,
255]. Closer to psychology that asks annotators to rate several propositions, Guberman et al. [
120] investigate perceived violence of tweets through an adapted version of the multiple proposition
Buss-Perry Aggression Questionnaire (BPAQ). Using six annotators on Amazon Mechanical Turk and 14 gold questions (12 correct answers required), they still found 30% disagreement that they partly explain with the non-adaptation of the questionnaire to tweet violence.
Out of the 74 papers using crowdsourcing, only 32% mention giving a definition of the concept to annotate to the annotators, such as detailed offensiveness criteria
11,12 and hate speech definition.
13 Gamback et al. [
106] through several crowdsourcing tests provide a detailed question to the annotators.
14 Not providing clear definitions is an issue, because the annotators might have different definitions of
OCLin mind, leading to collected data labels that would not be suited to the goal of the application.
Data annotators. The annotation tasks are conducted on crowdsourcing platforms or programs created by the authors of the publications. Certain papers show that the type of annotators employed influences the quality of the annotations. CrowdFlower (now
Appen.com), expert and manually recruited annotators are equally used (23.7% each), while students of universities (13.8%) and Amazon Mechanical Turk (15%) are less. The expert category comprehends authors themselves, researchers of similar fields, specialists in gender studies, and “non-activist feminist” for sexism annotations, persons with linguistic background, trained raters, educators working with middle-school children, and people with cyberbullying experience.
Annotation aggregation. Among the 50 papers for which the information is available (out of the 74 papers using crowdsourcing), 49 papers aggregate the annotations from multiple annotators into binary labels. Seventy-eight percent use majority-voting, 10% filter out samples for which there is no full agreement between the annotators, 8% create rules that define how to aggregate according to different scenarios of annotations (e.g., majority-voting and removal of the samples with the highest disagreement rates and the samples for which the annotators agreed they are undecided [
46]). One paper uses a weighted majority-vote scheme [
130]. Only Wulczyn et al. [
292] derive percentage from the annotations.
Annotation quality control. 32.4% of the papers mention techniques to obtain high-quality labels. Within the annotation task, they investigate using precise definitions and clear questions to remove ambiguities [
227]. After the task, annotations are aggregated to resolve disparities between annotators’ opinions, and low-quality annotations or annotators are filtered, with quality scores computed over the history of the annotators, the time they take to answer each question, or their answers to gold questions [
129].
Half of the tasks have 3 annotators, 15% make use of 5 annotators and 22% of 2 annotators . Using an odd number of annotators enables to break ties in annotations with majority voting, while using 2 annotators is cheap and fast. The rest of the tasks employ 1, 4, 6, or 10 annotators. The papers using more than 5 annotators per sample are rare, most probably because of the cost. Using only the cases of full agreement among amateur annotators produces relatively good annotations compared to expert annotators, and they suggest to use experts only to break the ties of the amateur annotators [
283].
Different metrics are employed to evaluate the annotation quality by measuring the agreement between annotators (fig.
8). Most papers use Cohen’s Kappa for 2 annotators and Fleiss’ Kappa for more. 22.9% of the papers mention “inter-annotator agreement” or “kappa” scores without further precision. Krippendorff’s alpha and the percentage agreement are less adopted, the second one making a possibly wrong assumption that the majority is correct [
170]. In the publications, we notice a high proportion of low Cohen’s Kappa and Fleiss’ Kappa scores (under 0.6) for tasks with 3 or 5 annotators, which proves the difficulty to design unambiguous tasks and hint at the subjectivity of the concepts to rate.
5.2.2 Biases in the Annotation Process.
The data annotation process introduces various types of biases with each of the design choices.
Identification of mismatches. Here, we take the hypothetical scenario of developing a dataset for aggression language. Certain definitions of aggression highlight the need for looking at the context of a sentence, at the behavior of its author, and at the person judging this language, to understand how a sentence would be perceived, e.g., aggression is “neither descriptive nor neutral. It deals much more with a judgmental attribute” [
177]. Psychology identified the variables that influence this judgment, mostly “cultural background” [
44], the role of the judge, i.e., aggressor, target, observer, and so on, “norm deviation, intent, and injury,” but also “the form and extent of injuries actually occurring” [
161]. To obtain a controlled and realistic dataset and reduce ambiguity, these pieces of information around the annotators of the language would be needed, the annotator’s role (e.g., victim or observer) should be decided, and the context of the sentence (e.g., harm caused by a sentence) displayed.
A similar example is the perceived offensiveness of group-based slurs, which depends on the perception of the status of the target group [
126]. In this case, both the context and observer are of importance, since the social status of a target group could be uncovered from context knowledge but can also depend on the perception of the observer.
These issues resonate with the historical biases in machine learning ethics literature [
261]. In the dataset, there is a mismatch between the judgments of the annotators, the judgments of the actual targets of an
OCL, and the judgments from external observers. Consequently, the dataset is not aligned with what the machine learning model is expected to learn.
Missing context information. Psychology literature showed that for many conflictual languages, the sample context influences the perception of a sample. Most crowdsourcing tasks, however, do not specify it, neither in the instructions nor within the sample presented to the annotator [
53,
240]. Guberman et al. [
120] put forward the insufficient context that leaves many aspects of the text to interpretation as a reason for disagreement in harassment annotations. Golbeck et al. [
116], while not including any context in their corpus, acknowledge this limitation and develop precise annotation guidelines that aim at removing ambiguities stemming from the absence of context. Ross et al. [
227] provide a definition of the
OCL to annotate and find that the task remains ambiguous, suggesting that even for objective tasks, context information might be missing to provide an objective rating.
The type of context to include and its framing (e.g., a conversation, structured information about multiple characteristics) remain to be investigated to address ambiguities, while controlling the cost of the annotations. Pavlopoulos et al. [
199] have already shown that annotations with conversational context (post and its parent comment, as well as the discussion title) significantly differ from annotations without it. Sap et al. [
236] have primed annotators with dialect and race information explicitly to reduce racial biases in annotations (more samples written in African American English than in general American English are labeled as offensive). Creating datasets that tackle single specific contexts such as “hate speech against immigrants and women” is also a direction to investigate [
28].
Lack of annotator control and information. Psychology highlights that many
OCL are subjective. Linguistics also shows the diversity of interpretation of
OCL by different communities or within a same population [
18]. For instance, a study shows that in Malta, participants typically identify homophobic comments as hate speech, but not necessarily xenophobic ones, and explains it with the recent acceptance of the LGBTQ community in the Maltese society, while “migrants are still very much left on the periphery.” Similar studies in other regions of the world would probably lead to different conclusions, illustrating the importance of the annotator background. Hence, choices in the crowdsourcing task design that impact the pool of annotators (country of origin of the annotators, language, expertise, educational background, and how they are filtered) integrate implicitly biases in a dataset.
Psychology indicates characteristics of an individual that impact one’s perception of a sentence relative to an
OCL. Some of these characteristics are also observed in computer science papers, such as the differences of annotations based on gender [
120]. Communication studies also investigate the characteristics of an individual that impact their willingness to censor hate speech and identify age (e.g., “older people are less willing to censor hate speech than younger people”), neuroticism, commitment to democratic principles, level of authoritarianism, level of religiosity, and gender [
151]. Such factors could possibly also impact one’s attitude toward annotating hate speech. While the design choices do not map to these characteristics, creating schemes to control, or at least measure them, is a valuable research direction. Certain crowdsourcing frameworks [
27] are a first step towards this control. Verifying that the same characteristics apply in the online and offline contexts is also important following previous contradictions, e.g., one computer science study observed that annotators from both genders usually agree for clear cases of misogyny and disagree for cases of general hate speech [
290], contradicting findings in psychology literature.
Additional properties of the annotators, not investigated in psychology, can bias the datasets. For instance, annotators from crowdsourcing platforms, who have no training on what hate speech is, are biased towards the hate label, contrary to expert annotators [
283]. Research is hence also needed in assessing the level of education around
OCL that annotators have, in educating them, and in maintaining them engaged for more annotation tasks.
Simplification of the annotations. The way the annotations are processed creates biases. Aggregating the annotations into single labels does not allow for subjectivity and skews datasets towards certain types of perceptions, generally the majority opinions [
22]. This might raise issues of unfairness—non-inclusion of certain opinions—and reinforce filter bubbles. For instance, Binns et al. [
34] show that a toxicity detection algorithm performs better on annotations from male users than from female ones and is consequently unfair to women. This reflects
aggregation biases [
261]: A single dataset to train a single machine learning model for a whole platform is collected, whereas different populations need adaptation.
Subjectivity brings new challenges in measuring and obtaining “high-quality” annotations. Measures of quality are now centered around agreement—the lowest the disagreement, the highest the quality—and post-processing methods use the majority opinion, yet the majority is only one perception of a subjective
OCL. Instead, methods should filter out annotations that are obviously incorrect—often due to spams—or erroneous for different individuals, while accounting for the existence of multiple relevant and disagreeing judgments. For that, works from the human computation community, such as CrowdTruth [
17], which provides metrics for the quality of annotations and annotators without assuming the existence of a unique ground truth, could be investigated. More annotators might be needed, and schemes to infer relevant clusters of annotators could be investigated to trade off between quality and cost considerations. Mishra et al. [
171] noted that in digital media, a small amount of users frequently give their opinions, ranking positively highly offensive posts—a form of bias towards the opinion of these few users. The researchers propose a semi-supervised method to identify these biased users and correct the ratings.
Leveraging psychology and human computation methods. Research from other fields could be adapted to improve
OCL annotation pipelines, as recommendations from crowdsourcing literature or psychology are not necessarily followed for now. Only 32% of papers mention methods to ensure a level of quality (e.g., golden questions, annotator quality score, precise definitions of the terms) and few papers employ more than five annotators per sample, whereas crowdsourcing literature encourages that. Taking inspiration from psychology and judgment collection methods can also be a promising direction. Psychology studies use multiple questions with scales, whose answers are aggregated to collect the perception of each person (e.g., 10, 6, 3 propositions on [1;9], [1;6], [1;12] scales [
37,
68,
186]). To measure offensiveness, participants rate images visualizing a scenario along how comfortable, acceptable, offensive, hurtful, and annoying they are on a 7-point Likert scale [
289]. Cunningham et al. [
70] show scenarios with four situations to participants, who select the most offensive one. Example scenario and situation are, respectively, attending a men’s basketball game and “A Caucasian, female said: ‘Of course we lost. We played like a bunch of girls.”’ While these studies are not specific to
OCLs, the general method could be used, and the specific questions investigated. The challenge of asking such questions while maintaining the cost low would become important.