Abstract
Machine learning offers a systematic framework for developing metrics that use multiple criteria to assess the quality of machine translation (MT). However, learning introduces additional complexities that may impact on the resulting metric’s effectiveness. First, a learned metric is more reliable for translations that are similar to its training examples; this calls into question whether it is as effective in evaluating translations from systems that are not its contemporaries. Second, metrics trained from different sets of training examples may exhibit variations in their evaluations. Third, expensive developmental resources (such as translations that have been evaluated by humans) may be needed as training examples. This paper investigates these concerns in the context of using regression to develop metrics for evaluating machine-translated sentences. We track a learned metric’s reliability across a 5 year period to measure the extent to which the learned metric can evaluate sentences produced by other systems. We compare metrics trained under different conditions to measure their variations. Finally, we present an alternative formulation of metric training in which the features are based on comparisons against pseudo-references in order to reduce the demand on human produced resources. Our results confirm that regression is a useful approach for developing new metrics for MT evaluation at the sentence level.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Albrecht JS, Hwa R (2007a) A re-examination of machine learning approaches for sentence-level MT evaluation. In: ACL 2007 Proceedings of the 45th annual meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 880–887
Albrecht JS, Hwa R (2007b) Regression for sentence-level MT evaluation with pseudo references. In: ACL 2007 Proceedings of the 45th annual meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 296–303
Al-Onaizan Y, Curin J, Jahr M, Knight K, Lafferty J, Melamed ID, Och F-J, Purdy D, Smith NA, Yarowsky D (1999) Statistical machine translation. Technical report natural language engineering workshop final report. Johns Hopkins University, Baltimore
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Ann Arbor, Michigan, pp 65–72
Bishop CM (2006) Pattern recognition and machine learning. Springer-Verlag, New York
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2003) Confidence estimation for machine translation. Technical report natural language engineering workshop final report. Johns Hopkins University, Baltimore
Burbank A, Carpuat M, Clark S, Dreyer M, Groves D, Fox P, Hall K, Hearne M, Melamed ID, Shen Y, Way A, Wellington B, Wu D (2005) Final report of the 2005 language engineering workshop on statistical machine translation by parsing. Technical report natural language engineering workshop final report. Johns Hopkins University, Baltimore
Carbonell JG, Cullingford RE, Gershman AV (1981) Steps toward knowledge-based machine translation. IEEE Trans Pattern Anal Mach Intell 3(4): 376–392
Corston-Oliver S, Gamon M, Brockett C (2001) A machine learning approach to the automatic evaluation of machine translation. In: Association for Computational Linguistics, 39th annual meeting and 10th conference of the European chapter, proceedings of the conference, Toulouse, France, pp 148–155
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3): 273–297
Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the second conference on human language technology (HLT-2002), San Diego, California, pp 128–132
Dorr BJ, Jordan PW, Benoit JW (1999) A survey of current paradigms in machine translation. Adv Comput 49: 2–68
Duh K (2008) Ranking vs. regression in machine translation evaluation. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp 191–194
Gamon M, Aue A, Smets M (2005) Sentence-level MT evaluation without reference translations: beyond language modeling. In: 10th EAMT conference, practical applications of machine translation, proceedings, Budapest, Hungary, pp 103–111
Giménez J, Màrquez L (2008) A smorgasbord of features for automatic MT evaluation. In: ACL-08: HLT third workshop on statistical machine translation, Columbus, Ohio, pp 195–198
Gimpel K, Smith NA (2008) Rich source-side context for statistical machine translation. In: ACL-08: HLT third workshop on statistical machine translation, Columbus, Ohio, pp 9–17
Goldberg Y, Elhadad M (2007) SVM model tampering and anchored learning: a case study in Hebrew NP chunking. In: ACL 2007 Proceedings of the 45th annual meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 224–231
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer-Verlag, New York
Hovy E, King M, Popescu-Belis A (2002) Principles of context-based machine translation evaluation. Mach Transl 17(1): 43–75
Joachims T (1999) Making large-scale SVM learning practical. In: Schöelkopf B, Burges C, Smola A(eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 169–184
Kauchak D, Barzilay R (2006) Paraphrasing for automatic evaluation. In: HLT-NAACL 2006 Human language technology conference of the North American chapter of the Association for Computational Linguistics, New York, NY, pp 455–462
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, pp 388–395
Kulesza A, Shieber SM (2004) A learning approach to improving sentence-level MT evaluation. In: TMI-2004: Proceedings of the tenth conference on theoretical and methodological issues in machine translation, Baltimore, MD, pp 75–84
Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: ACL 2007: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 228–231
Leusch G, Ueffing N, Ney H (2006) CDER: efficient MT evaluation using block movements. In: EACL-2006, 11th conference of the European chapter of the Association for Computational Linguistics, proceedings of the conference, Trento, Italy, pp 241–248
Lin C-Y, Och FJ (2004a) Automatic evaluation of machine translation quality using longest common subsequence and Skip-Bigram statistics. In: ACL-04: 42nd annual meeting of the Association for Computational Linguistics, Barcelona, Spain, pp 605–612
Lin C-Y, Och FJ (2004b) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: Coling, 20th international conference on computational linguistics, proceedings, Geneva, Switzerland, pp 501–507
Lita LV, Rogati M, Lavie A (2005) Blanc: learning evaluation metrics for MT. In: HLT/EMNLP 2005 Human language technology conference and conference on empirical methods in natural language processing, Vancouver, British Columbia, Canada, pp 740–747
Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Proceedings of the ACL-05 workshop, Ann Arbor, MI, pp 25–32
Liu D, Gildea D (2006) Stochastic iterative alignment for machine translation evaluation. In: COLING·ACL 2006 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, Proceedings of the main conference poster sessions, Sydney, Australia, pp 539–546
Liu D, Gildea D (2007) Source-language features and maximum correlation training for machine translation evaluation. In: NAACL HLT 2007 Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics, Rochester, NY, pp 41–48
Melamed ID, Green R, Turian J (2003) Precision and recall of machine translation. In: HLT-NAACL 2003: conference combining human language technology conference series and the North American chapter of the Association for Computational Linguistics series, companion volume, Edmonton, Canada, pp 61–63
Owczarzak K, Groves D, Van Genabith J, Way A (2006) Contextual bitext-derived paraphrases in automatic MT evaluation. In: HLT-NAACL 06 Statistical machine translation, Proceedings of the workshop, New York City, pp 86–93
Owczarzak K, Van Genabith J, Way A (2007) Dependency-based automatic evaluation for machine translation. In: Proceedings of SSST, NAACL-HLT 2007/AMTA workshop on syntax and structure in statistical translation, Rochester, NY, pp 80–87
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: 40th annual meeting of the Association for Computational Linguistics (ACL-2002), Philadelphia, PA, pp 311–318
Quirk C (2004) Training a sentence-level machine translation confidence measure. In: Proceedings of the international conference on language resources and evaluation (LREC-2004), Lisbon, Portugal, pp 825–828
Riezler S, Maxwell JT III (2005) On some pitfalls in automatic evaluation and significance testing for MT. In: Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Proceedings of the ACL-05 workshop, Ann Arbor, MI, pp 57–64
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: Proceedings of the 7th conference of the Association for Machine Translation in the Americas, visions of the future of machine translation, Cambridge, MA, pp 223–231
Tillmann C, Vogel S, Ney H, Sawaf H, Zubiaga A (1997) Accelerated DP-based search for statistical translation. In: Proceedings of the 5th European conference on speech communication and technology (EuroSpeech’97), Rhodes, Greece, pp 2667–2670
Uchimoto K, Kotani K, Zhang Y, Isahara H (2007) Automatic evaluation of machine translation based on rate of accomplishment of sub-goals. In: NAACL HLT 2007 Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics, Rochester, New York, pp 33–40
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Ye Y, Zhou M, Lin C-Y (2007) Sentence level machine translation evaluation as a ranking. In: ACL 2007: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 240–247
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Albrecht, J.S., Hwa, R. Regression for machine translation evaluation at the sentence level. Machine Translation 22, 1–27 (2008). https://doi.org/10.1007/s10590-008-9046-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-008-9046-1