Abstract
In this work an algorithm devoted to the detection of low quality annotations is proposed. It is mainly focused on subjective annotation tasks carried out by means of crowdsourcing platforms. In this kind of task, where a good response is not necessarily prefixed, several measures should be considered in order to pick the different behaviours of annotators associated to bad quality results: time, inter-annotator agreement and repeated patterns in responses. The proposed algorithm considers all these measures and provide a set of workers whose annotations should be removed. The experiments carried out, over a sarcasm annotation task, show that once the low quality annotations were removed and acquired again a better labeled set was achieved.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
Available for the scientific community under specific constraints. http://cz.efaber.net.
References
Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)
Bennet, E.M., Alpert, R., Goldstein, A.C.: Communications through limited response questioning. Public Opin. Q. 18, 303–308 (1954)
Buchholz, S., Latorre, J., Yanagisawa, K.: Crowdsourced Assessment of Speech Synthesis. Wiley, Chichester (2013)
Cohen, J.: Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Davies, M., Fleiss, J.L.: Measuring agreement for multinomial data. Biometrics 38(4), 1047–1051 (1982)
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the em algorithm. Appl. Stat. 28(1), 20–28 (1979)
Dress, M.L., Kreuz, R.J., Link, K.E., Caucci, G.M.: Regional variation in the use of sarcasm. J. Lang. Soc. Psychol. 27(1), 71–85 (2008)
Eickhoff, C., de Vries, A.P.: How crowdsourcable is your task? In: Workshop on Crowdsourcing for Search and Data Mining (CSDM), Hong Kong, China (2011)
Eickhoff, C., de Vries, A.P.: Increasing cheat robustness of crowdsourcing tasks. Inf. Retrieval 16(2), 121–137 (2013)
Filatova, E.: Irony and sarcasm: corpus generation and analysis using crowdsourcing. In: Proceedings of LREC 2012, Istanbul, Turkey, pp. 392–398, 23–25 May 2012
Fleiss, J., et al.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)
Gadiraju, U., Kawase, R., Dietze, S., Demartini, G.: Understanding malicious behavior in crowdsourcing platforms: the case of online surveys. In: Proceedings of the ACM CHI 2015, Seoul, Republic of Korea, pp. 1631–1640 (2015)
Gennaro, R., Gentry, C., Parno, B.: Non-interactive verifiable computing: outsourcing computation to untrusted workers. In: Rabin, T. (ed.) CRYPTO 2010. LNCS, vol. 6223, pp. 465–482. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14623-7_25
Ipeirotis, P.G., Provost, F., Wang, J.: Quality management on Amazon mechanical turk. In: Proceedings of the ACM SIGKDD, pp. 64–67. New York, USA (2010)
Justo, R., Alcaide, J.M., Torres, M.I.: Crowdscience: crowdsourcing for research and development. In: Proceedings of IberSpeech 2016, Portugal, pp. 403–410 (2016)
Kou, Z., Stanton, D., Peng, F., Beaufays, F., Strohman, T.: Fix it where it fails: pronunciation learning by mining error corrections from speech logs. In: Proceedings of ICASSP 2015, South Brisbane, Australia, pp. 4619–4623, 19–24 April 2015
Krippendorff, K.: Content Analysis: An Introduction to its Methodology. Sage, Thousand Oaks (2004)
Krippendorff, K.: Computing Krippendorff’s Alpha Reliability. Technical report, University of Pennsylvania, Annenberg School for Communication, June 2007
Nunberg, G.: The Way we Talk Now: Commentaries on Language and Culture from NPR’s “Fresh Air”. Houghton Mifflin, Boston (2001)
Rodrigues, F., Pereira, F.C., Ribeiro, B.: Learning from multiple annotators: distinguishing good from random labelers. Pattern Recogn. Lett. 34(12), 1428–1436 (2013)
Rothwell, S., Elshenawy, A., Carter, S., Iraga, D., Romani, F., Kennewick, M., Kennewick, B.: Controlling quality and handling fraud in large scale crowdsourcing speech data collections. In: Proceedings of Interspeech 2015, Dresden, Germany, pp. 2784–2788. ISCA, 6–10 September 2015
Scott, W.A.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955)
Swanson, R., Lukin, S.M., Eisenberg, L., Corcoran, T., Walker, M.A.: Getting reliable annotations for sarcasm in online dialogues. In: Proceedings of LREC 2014, Reykjavik, Iceland, pp. 4250–4257, 26–31 May 2014
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Justo, R., Torres, M.I., Alcaide, J.M. (2017). Measuring the Quality of Annotations for a Subjective Crowdsourcing Task. In: Alexandre, L., Salvador Sánchez, J., Rodrigues, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2017. Lecture Notes in Computer Science(), vol 10255. Springer, Cham. https://doi.org/10.1007/978-3-319-58838-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-58838-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58837-7
Online ISBN: 978-3-319-58838-4
eBook Packages: Computer ScienceComputer Science (R0)