[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Using Sublexical Translations to Handle the OOV Problem in Machine Translation

Published: 01 September 2011 Publication History

Abstract

We introduce a method for learning to translate out-of-vocabulary (OOV) words. The method focuses on combining sublexical/constituent translations of an OOV to generate its translation candidates. In our approach, wildcard searches are formulated based on our OOV analysis, aimed at maximizing the probability of retrieving OOVs’ sublexical translations from existing resources of Machine Translation (MT) systems. At run-time, translation candidates of the unknown words are generated from their suitable sublexical translations and ranked based on monolingual and bilingual information. We have incorporated the OOV model into a state-of-the-art machine translation system and experimental results show that our model indeed helps to ease the impact of OOVs on translation quality, especially for sentences containing more OOVs (significant improvement).

References

[1]
Arora, K., Paul, M., and Sumita, E. 2008. Translation of unknown words in phrase-based statistical machine translation for languages of rich morphology. In Proceedings of the Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU’08).
[2]
Bird, S., Klein, E., and Loper, E. 2008. Natural language processing in Python. http://nltk.org/book.html.
[3]
Cao, Y. and Li, H. 2002. Base noun phrase translation using web data and the EM algorithm. In Proceedings of the International Conference on Computer Linguistics (COLING’02).
[4]
Eck, M., Vogel, S., and Waibel, A. 2008. Communicating unknown words in machine translation. In Proceedings of the Annual Conference on Language Resources and Evaluation (LREC’08).
[5]
Fung, P. N. and Cheung, P. 2004. Mining ver-non-parallal corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Joint Meeting of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04).
[6]
Hassan, H. and Sorensen, J. 2005. An integrated approach for Arabic-English named entity translation. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (CASL’05).
[7]
Huang, C. R., Chang, R. Y., and Lee, S. B. 2004. Sinica BOW (Bilingual Ontological Wordnet): Integration of bilingual WordNet and SUMO. In Proceedings of the Annual Conference on Language Resources and Evaluation (LREC’04).
[8]
Knight, K. and Graehl, J. 1997. Machine transliteration. In Proceedings of the Conference on the European Chapter of the Association for Computational Linguistics (EACL’97).
[9]
Koehn, P. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the Joint Meeting of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04).
[10]
Koehn, P. and Knight, K. 2003. Empirical methods for compound splitting. In Proceedings of the Conference on the European Chapter of the Association for Computational Linguistics (EACL’97).
[11]
Koehn, P., Axelrod, A., Mayne, A. B., Callison-Burch, C., Osborne, M., and Talbot, D. 2005. Edinburgh system description for the 2005 IWSLT speech translation evaluation.
[12]
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the Association for Computational Linguistics 2007 Demo and Poster (ACL’07).
[13]
Langlais, P. and Patry, A. 2007. Translating unknown words by analogical learning. In Proceedings of the Joint Meeting of the Conference on Empirical Methods in Natural Language Processing (EMNLP’07).
[14]
Li, Z. and Yarowsky, D. 2008. Unsupervised translation induction for Chinese abbreviations using monolingual corpora. In Proceedings of the Association for Computational Linguistics (ACL’08).
[15]
Ma, W. Y. and Chen, K. J. 2003. Introduction to CKIP Chinese word segmentation system for the first international Chinese word segmentation bakeoff. In Proceedings of the ACL Workshop on Chinese Language Processing (CLP’03).
[16]
Marton, Y., Callison-Burch, C., and Resnik, P. 2009. Improved statistical machine translation using monolingually-derived paraphrases. In Proceedings of the Joint Meeting of the Conference on Empirical Methods in Natural Language Processing (EMNLP’09).
[17]
Miller, G. A. 1995. WordNet: A lexical database for English. Comm. ACM 38, 11.
[18]
Mirkin, S., Specia, L., Cancedda, N., Dagan, I., Dymetman, M., and Szpektor, I. 2009. Source-language entailment modeling for translating unknown terms. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’09).
[19]
Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31, 4.
[20]
Nagata, M., Saito, T., and Suzuki, K. 2001. Using the Web as a bilingual dictionary. In Proceedings of the ACL Workshop on Data-driven Methods in Machine Translation (ACL’01).
[21]
Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29, 1.
[22]
Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics (ACL’02).
[23]
Stolcke, A. 2002. SRILM -- An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’02).
[24]
Sun, C. 2006. Chinese: A Linguistic Introduction. Cambridge University Press, Cambridge, UK.
[25]
Tanaka, T. and Baldwin, T. 2003. Noun-noun compound machine translation: A feasibility study on shallow processing. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition, and Treatment (ACL’03).
[26]
Vilar, D., Peter, J. T., and Ney, H. 2007. Can we translate letters? In Proceedings of the ACL Workshop on Statistical Machine Translation (SMT’07).
[27]
Yang, M. and Kirchhoff, K. 2006. Phrase-based backoff models for machine translation of highly inflected languages. In Proceedings of the Conference on the European Chapter of the Association for Computational Linguistics (EACL’06).

Cited By

View all
  • (2019)Automated software vulnerability assessment with concept driftProceedings of the 16th International Conference on Mining Software Repositories10.1109/MSR.2019.00063(371-382)Online publication date: 26-May-2019
  • (2015)Adding Multilingual Terminological Resources to Parallel Corpora for Statistical Machine Translation Deteriorates System PerformanceProceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 930210.1007/978-3-319-24033-6_57(506-514)Online publication date: 14-Sep-2015
  • (2014)Statistical machine translation enhancements through linguistic levelsACM Computing Surveys10.1145/251813046:3(1-28)Online publication date: 1-Jan-2014
  • Show More Cited By

Index Terms

  1. Using Sublexical Translations to Handle the OOV Problem in Machine Translation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian Language Information Processing
    ACM Transactions on Asian Language Information Processing  Volume 10, Issue 3
    September 2011
    114 pages
    ISSN:1530-0226
    EISSN:1558-3430
    DOI:10.1145/2002980
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2011
    Accepted: 01 April 2011
    Revised: 01 February 2011
    Received: 01 November 2010
    Published in TALIP Volume 10, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Out-of-vocabulary words
    2. language model
    3. machine translation
    4. phrase table
    5. sublexical translation
    6. translation model
    7. wildcard search query

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Automated software vulnerability assessment with concept driftProceedings of the 16th International Conference on Mining Software Repositories10.1109/MSR.2019.00063(371-382)Online publication date: 26-May-2019
    • (2015)Adding Multilingual Terminological Resources to Parallel Corpora for Statistical Machine Translation Deteriorates System PerformanceProceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 930210.1007/978-3-319-24033-6_57(506-514)Online publication date: 14-Sep-2015
    • (2014)Statistical machine translation enhancements through linguistic levelsACM Computing Surveys10.1145/251813046:3(1-28)Online publication date: 1-Jan-2014
    • (2013)Chinese-Japanese Machine Translation Exploiting Chinese CharactersACM Transactions on Asian Language Information Processing10.1145/2523057.252305912:4(1-25)Online publication date: 1-Oct-2013
    • (2013)A Computer-Assisted Translation and Writing SystemACM Transactions on Asian Language Information Processing10.1145/250598412:4(1-20)Online publication date: 1-Oct-2013
    • (2013)Protocol Responsibility Offloading to Improve TCP Throughput in Virtualized EnvironmentsACM Transactions on Computer Systems10.1145/249146331:3(1-34)Online publication date: 1-Aug-2013
    • (2013)SpannerACM Transactions on Computer Systems10.1145/249124531:3(1-22)Online publication date: 1-Aug-2013
    • (2012)Handling Unknown Words in Statistical Machine Translation from a New PerspectiveNatural Language Processing and Chinese Computing10.1007/978-3-642-34456-5_17(176-187)Online publication date: 2012

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media