Measuring the Quality of Annotations for a Subjective Crowdsourcing Task

Raquel Justo¹⁶,
M. Inés Torres¹⁶ &
José M. Alcaide¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10255))

Included in the following conference series:

Iberian Conference on Pattern Recognition and Image Analysis

2051 Accesses

Abstract

In this work an algorithm devoted to the detection of low quality annotations is proposed. It is mainly focused on subjective annotation tasks carried out by means of crowdsourcing platforms. In this kind of task, where a good response is not necessarily prefixed, several measures should be considered in order to pick the different behaviours of annotators associated to bad quality results: time, inter-annotator agreement and repeated patterns in responses. The proposed algorithm considers all these measures and provide a set of workers whose annotations should be removed. The experiments carried out, over a sarcasm annotation task, show that once the low quality annotations were removed and acquired again a better labeled set was achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Measuring the Expertise of Workers for Crowdsourcing Applications

An Autonomous Crowdsourcing System

Quality Assessment of Crowdwork via Eye Gaze: Towards Adaptive Personalized Crowdsourcing

Notes

1.
www.mturk.com.
2.
www.crowdflower.com.
3.
www.meneame.net.
4.
Available for the scientific community under specific constraints. http://cz.efaber.net.

References

Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)
Article Google Scholar
Bennet, E.M., Alpert, R., Goldstein, A.C.: Communications through limited response questioning. Public Opin. Q. 18, 303–308 (1954)
Article Google Scholar
Buchholz, S., Latorre, J., Yanagisawa, K.: Crowdsourced Assessment of Speech Synthesis. Wiley, Chichester (2013)
Book Google Scholar
Cohen, J.: Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968)
Article Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Article Google Scholar
Davies, M., Fleiss, J.L.: Measuring agreement for multinomial data. Biometrics 38(4), 1047–1051 (1982)
Article MATH Google Scholar
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the em algorithm. Appl. Stat. 28(1), 20–28 (1979)
Article Google Scholar
Dress, M.L., Kreuz, R.J., Link, K.E., Caucci, G.M.: Regional variation in the use of sarcasm. J. Lang. Soc. Psychol. 27(1), 71–85 (2008)
Article Google Scholar
Eickhoff, C., de Vries, A.P.: How crowdsourcable is your task? In: Workshop on Crowdsourcing for Search and Data Mining (CSDM), Hong Kong, China (2011)
Google Scholar
Eickhoff, C., de Vries, A.P.: Increasing cheat robustness of crowdsourcing tasks. Inf. Retrieval 16(2), 121–137 (2013)
Article Google Scholar
Filatova, E.: Irony and sarcasm: corpus generation and analysis using crowdsourcing. In: Proceedings of LREC 2012, Istanbul, Turkey, pp. 392–398, 23–25 May 2012
Google Scholar
Fleiss, J., et al.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)
Article Google Scholar
Gadiraju, U., Kawase, R., Dietze, S., Demartini, G.: Understanding malicious behavior in crowdsourcing platforms: the case of online surveys. In: Proceedings of the ACM CHI 2015, Seoul, Republic of Korea, pp. 1631–1640 (2015)
Google Scholar
Gennaro, R., Gentry, C., Parno, B.: Non-interactive verifiable computing: outsourcing computation to untrusted workers. In: Rabin, T. (ed.) CRYPTO 2010. LNCS, vol. 6223, pp. 465–482. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14623-7_25
Chapter Google Scholar
Ipeirotis, P.G., Provost, F., Wang, J.: Quality management on Amazon mechanical turk. In: Proceedings of the ACM SIGKDD, pp. 64–67. New York, USA (2010)
Google Scholar
Justo, R., Alcaide, J.M., Torres, M.I.: Crowdscience: crowdsourcing for research and development. In: Proceedings of IberSpeech 2016, Portugal, pp. 403–410 (2016)
Google Scholar
Kou, Z., Stanton, D., Peng, F., Beaufays, F., Strohman, T.: Fix it where it fails: pronunciation learning by mining error corrections from speech logs. In: Proceedings of ICASSP 2015, South Brisbane, Australia, pp. 4619–4623, 19–24 April 2015
Google Scholar
Krippendorff, K.: Content Analysis: An Introduction to its Methodology. Sage, Thousand Oaks (2004)
Google Scholar
Krippendorff, K.: Computing Krippendorff’s Alpha Reliability. Technical report, University of Pennsylvania, Annenberg School for Communication, June 2007
Google Scholar
Nunberg, G.: The Way we Talk Now: Commentaries on Language and Culture from NPR’s “Fresh Air”. Houghton Mifflin, Boston (2001)
Google Scholar
Rodrigues, F., Pereira, F.C., Ribeiro, B.: Learning from multiple annotators: distinguishing good from random labelers. Pattern Recogn. Lett. 34(12), 1428–1436 (2013)
Article Google Scholar
Rothwell, S., Elshenawy, A., Carter, S., Iraga, D., Romani, F., Kennewick, M., Kennewick, B.: Controlling quality and handling fraud in large scale crowdsourcing speech data collections. In: Proceedings of Interspeech 2015, Dresden, Germany, pp. 2784–2788. ISCA, 6–10 September 2015
Google Scholar
Scott, W.A.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955)
Article Google Scholar
Swanson, R., Lukin, S.M., Eisenberg, L., Corcoran, T., Walker, M.A.: Getting reliable annotations for sarcasm in online dialogues. In: Proceedings of LREC 2014, Reykjavik, Iceland, pp. 4250–4257, 26–31 May 2014
Google Scholar

Download references

Author information

Authors and Affiliations

Universidad del País Vasco UPV/EHU, Sarriena s/n, 48940, Leioa, Spain
Raquel Justo, M. Inés Torres & José M. Alcaide

Authors

Raquel Justo
View author publications
You can also search for this author in PubMed Google Scholar
M. Inés Torres
View author publications
You can also search for this author in PubMed Google Scholar
José M. Alcaide
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raquel Justo .

Editor information

Editors and Affiliations

Universidade da Beira Interior , Covilhã, Portugal
Luís A. Alexandre
University Jaume I , Castellón, Spain
José Salvador Sánchez
University of the Algarve , Faro, Portugal
João M. F. Rodrigues

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Justo, R., Torres, M.I., Alcaide, J.M. (2017). Measuring the Quality of Annotations for a Subjective Crowdsourcing Task. In: Alexandre, L., Salvador Sánchez, J., Rodrigues, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2017. Lecture Notes in Computer Science(), vol 10255. Springer, Cham. https://doi.org/10.1007/978-3-319-58838-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-58838-4_7
Published: 12 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58837-7
Online ISBN: 978-3-319-58838-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics