Abstract
Parallel corpora are playing a crucial role in multilingual natural language processing. Unfortunately, the availability of such a resource is the bottleneck in most applications of interest. Mining the web for parallel corpora is a viable solution that comes at a price: it is not always easy to identify parallel documents among the crawled material. In this study we address the problem of automatically identifying the pairs of texts that are translation of each other in a set of documents. We show that it is possible to automatically build particularly efficient content-based methods that make use of very little lexical knowledge. We also evaluate our approach toward a front-end translation task and demonstrate that our parallel text classifier yields better performances than another approach based on a rich lexicon.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Langlais, P., Simard, M., Veronis, J.: Methods and practical issues in evaluating alignment techniques. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics (ACL), Montréal, Quebec, Canada, pp. 711–717 (1998)
Macklovitch, E., Simard, M., Langlais, P.: Transsearch: A free translation memory on the world wide web. In: Second International Conference On Language Resources and Evaluation (LREC), Athens Greece, vol. 3, pp. 1201–1208 (2000)
Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 263–311 (1993)
Martin, J., Johnson, H., Farley, B., Maclachlan, A.: Aligning and using an english-inuktitut parallel corpus. In: HLT-NAACL Workshop: Building and Using Parallel Texts - Data Driven Machine Translation and Beyond, Edmonton, Canada, pp. 115–118 (2003)
Oard, D.W., Och, F.J.: Rapid-reponse machine translation for unexpected languages. In: Machine Translation Summit IX, New Orleans, Louisiana, USA (2003)
Kraaij, W., Nie, J.Y., Simard, M.: Embedding web-based statistical translation models in cross-language information retrieval. Computational Linguistics 29, 381–419 (2003)
Resnik, P., Smith, N.A.: The web as a parallel corpus. Computational Linguistics 29, 349–380 (2003), Special Issue on the Web as a Corpus
Ma, X., Liberman, M.: Bits: A method for bilingual text search over the web. In: Machine Translation Summit VII, Kent Ridge Digital Labs, National University of Singapore (1999)
Munteanu, D.S., Fraser, A., Marcu, D.: Improved machine translation performace via parallel sentence extraction from comparable corpora. In: Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference, HLT/NAACL 2004 (2004)
Rapp, R.: Automatic identification of word translations from unrelated english and german corpora. In: Proceedings of the 37th conference on Association for Computational Linguistics, Association for Computational Linguistics, pp. 519–526 (1999)
Nadeau, D., Foster, G.: Real-time identification of parallel texts from bilingual news feed. In: CLINE 2004, Computational Linguistics in the North East (2004)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 6, 707–710 (1966)
Freund, Y.: A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14, 771–780 (1999): Appearing in Japanese, translation by Naoki Abe
Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, Oxford (1996)
Koehn, P.: Europarl: A multilingual corpus for evaluation of machine translation. Draft (2002)
Ouimet, M.: Transsearch anglais-espagnol (2002), http://www.iro.umontreal.ca/~ouimema/ift3051/README.html
Langlais, P., Carl, M., Streiter, O.: Experimenting with phrase-based statistical translation within the iwslt 2004 chinese-to-english shared translation task. In: International Workshop on Spoken Language Translation, Kytio, Japan (2004)
Koehn, P.: Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In: Frederking, R.E., Taylor, K.B. (eds.) AMTA 2004. LNCS (LNAI), vol. 3265, pp. 115–124. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Patry, A., Langlais, P. (2005). Automatic Identification of Parallel Documents With Light or Without Linguistic Resources. In: Kégl, B., Lapalme, G. (eds) Advances in Artificial Intelligence. Canadian AI 2005. Lecture Notes in Computer Science(), vol 3501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424918_37
Download citation
DOI: https://doi.org/10.1007/11424918_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25864-3
Online ISBN: 978-3-540-31952-8
eBook Packages: Computer ScienceComputer Science (R0)