[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/1699510.1699548dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
research-article
Free access

Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical Turk

Published: 06 August 2009 Publication History

Abstract

Manual evaluation of translation quality is generally thought to be excessively time consuming and expensive. We explore a fast and inexpensive way of doing it using Amazon's Mechanical Turk to pay small sums to a large number of non-expert annotators. For $10 we redundantly recreate judgments from a WMT08 translation task. We find that when combined non-expert judgments have a high-level of agreement with the existing gold-standard judgments of machine translation quality, and correlate more strongly with expert judgments than Bleu does. We go on to show that Mechanical Turk can be used to calculate human-mediated translation edit rate (HTER), to conduct reading comprehension experiments with machine translation, and to create high quality reference translations.

References

[1]
Bogdan Babych and Anthony Hartley. 2004. Extending the Bleu MT evaluation method with frequency weightings. In Proceedings of ACL.
[2]
Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the role of Bleu in machine translation research. In Proceedings of EACL.
[3]
Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2008. Further meta-evaluation of machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation (WMT08).
[4]
Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation (WMT09), March.
[5]
David Chiang, Steve DeNeefe, Yee Seng Chan, and Hwee Tou Ng. 2008. Decomposability of translation metrics for improved evaluation and efficient algorithms. In Proceedings of EMNLP.
[6]
Douglas Jones, Wade Shen, Neil Granoien, Martha Herzog, and Clifford Weinstein. 2005. Measuring translation quality by testing English speakers with a new defense language proficiency test for Arabic. In Proceedings of the 2005 International Conference on Intelligence Analysis.
[7]
LDC. 2005. Linguistic data annotation specification: Assessment of fluency and adequacy in translations. Revision 1.5.
[8]
Evgeny Matusov, Nicola Ueffing, and Hermann Ney. 2006. Computing consensus translation for multiple machine translation systems using enhanced hypothesis alignment. In Proceedings of EACL.
[9]
NIST and LDC. 2007. Post editing guidelines for GALE machine translation evaluation. Guidelines developed by the National Institute of Standards and Technology (NIST), and the Linguistic Data Consortium (LDC).
[10]
Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of ACL.
[11]
Michael Paul. 2006. Overview of the IWSLT 2006 evaluation campaign. In Proceedings of International Workshop on Spoken Language Translation.
[12]
Antti-Veikko Rosti, Necip Fazil Ayan, Bing Xiang, Spyros Matsoukas, Richard Schwartz, and Bonnie Dorr. 2007. Combining outputs from multiple machine translation systems. In Proceedings of HLT/NAACL.
[13]
Markus Schulze. 2003. A new monotonic and clone-independent single-winner election method. Voting Matters, (17), October.
[14]
Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of AMTA.
[15]
Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast - but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of EMNLP.
[16]
Omar F. Zaidan and Chris Callison-Burch. 2009. Feasibility of human-in-the-loop minimum error rate training. In Proceedings of EMNLP.

Cited By

View all
  • (2022)Reliance and Automation for Human-AI Collaborative Data Labeling Conflict ResolutionProceedings of the ACM on Human-Computer Interaction10.1145/35552126:CSCW2(1-27)Online publication date: 11-Nov-2022
  • (2022)A Survey of Evaluation Metrics Used for NLG SystemsACM Computing Surveys10.1145/348576655:2(1-39)Online publication date: 18-Jan-2022
  • (2019)Crowdsourcing Images for Global DiversityProceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services10.1145/3338286.3347546(1-10)Online publication date: 1-Oct-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
August 2009
505 pages
ISBN:9781932432596

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 06 August 2009

Qualifiers

  • Research-article

Acceptance Rates

Overall Acceptance Rate 73 of 234 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)145
  • Downloads (Last 6 weeks)11
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Reliance and Automation for Human-AI Collaborative Data Labeling Conflict ResolutionProceedings of the ACM on Human-Computer Interaction10.1145/35552126:CSCW2(1-27)Online publication date: 11-Nov-2022
  • (2022)A Survey of Evaluation Metrics Used for NLG SystemsACM Computing Surveys10.1145/348576655:2(1-39)Online publication date: 18-Jan-2022
  • (2019)Crowdsourcing Images for Global DiversityProceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services10.1145/3338286.3347546(1-10)Online publication date: 1-Oct-2019
  • (2019)Learning to Predict Population-Level Label DistributionsCompanion Proceedings of The 2019 World Wide Web Conference10.1145/3308560.3317082(1111-1120)Online publication date: 13-May-2019
  • (2018)Towards More Robust Speech Interactions for Deaf and Hard of Hearing UsersProceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3234695.3236343(57-67)Online publication date: 8-Oct-2018
  • (2018)Can the crowd tell how I feel? Trait empathy and ethnic background in a visual pain judgment taskUniversal Access in the Information Society10.1007/s10209-018-0611-y17:3(649-661)Online publication date: 1-Aug-2018
  • (2017)Truth inference in crowdsourcingProceedings of the VLDB Endowment10.14778/3055540.305554710:5(541-552)Online publication date: 1-Jan-2017
  • (2017)CrowdPickUpProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/31309161:3(1-22)Online publication date: 11-Sep-2017
  • (2017)Task Routing and Assignment in Crowdsourcing based on Cognitive AbilitiesProceedings of the 26th International Conference on World Wide Web Companion10.1145/3041021.3055128(1023-1031)Online publication date: 3-Apr-2017
  • (2017)Eliciting Structured Knowledge from Situated Crowd MarketsACM Transactions on Internet Technology10.1145/300790017:2(1-21)Online publication date: 27-Mar-2017
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media