Abstract
This paper presents an approach which improves the performance of word alignment for English-Hindi language pair. Longer sentences in the corpus create severe problems like the high computational requirements and poor quality of resulting word alignment. Here, we present a method to solve these problems by breaking the longer sentence pairs into shorter ones. Our approach first breaks the source and target sentences into clauses and then treats the resulting clause pairs as sentence pairs to train word alignment model. We also report preliminary work on automatically identifying clause boundaries which are appropriate for improvement of word alignment. This paper demonstrates the increase of precision, recall and F-measure by approximately 11%, 7%, 10% respectively and reduction in Alignment Error Rate (AER) by approximately 10% in the performance of IBM Model 1 for word alignment. These results are obtained by training on 270 sentence pair and testing on 30 sentence pairs. Experiments of this paper are based on TDIL corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Gale, W.A., Church, K.: Identifying word correspondences in parallel texts. In: Fourth DARPA Workshop on Speech and Natural Language, Asilomar, pp. 152–157 (1991)
Xu, J., Zens, R., Ney, H.: Sentence segmentation using IBM word alignment model 1. In: Proc. the 10th Annual Conference of the European Association for Machine Translation, Budapest, Hungary, pp. 280–287 (May 2005)
Meng, B., Huang, S., Dai, X., Chen, J.: Segmenting long sentence pairs for statistical machine translation. In: International Conference on Asian Language Processing, Singapore, December 7-9 (2009)
Hutchins, J., Somers, H.: An Introduction to Machine Translation, pp. 175–189. Academic Press (1992)
Wilks, Y.: The Stanford Machine Translation project, Natural Language Processing, pp. 243–290. Algorithmics Press (1973)
Chandrasekar, R.: A Hybrid Approach to Machine Translation using Man Machine Communication, Ph.D. thesis, Tata Institute of Fundamental Research, Mumbai (1994)
Rao, D., Mohanraj, K., Hegde, J., Mehta, V., Mahadane, P.: A practical framework for syntactic transfer of compound-complex sentences for English-Hindi machine translation. In: Proceedings of KBCS (2000)
Koehn, P., Knight, K.: Feature-rich statistical translation of noun phrases. In: Proceedings of ACL (2003)
Kim, Y.-B., Ehara, T.: A method for partitioning of long Japanese sentences with subject resolution in J/E machine translation. In: Proc. International Conference on Computer Processing of Oriental Languages, pp. 467–473 (1994)
Marcu, D.: The Rhetorical Parsing, Summarization and Generation of Natural Language Texts, Ph.D. thesis, Department of Computer Science, University of Toronto, Toronto, Canada (December 1997)
Sudoh, K., Duh, K., Tsukada, H., Hirao, T., Nagata, M.: Divide and translate: improving long distance reordering in statistical machine translation. In: Workshop on Statistical Machine Translation and Metrics (2010)
Ramanathan, A., Bhattacharyya, P., Visweswariah, K., Ladha, K., Gandhe, A.: Clause-Based Reordering Constraints to Improve Statistical Machine Translation. In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 1351–1355 (November 2011)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)
Koehn, P.: Statistical Machine Translation. Cambridge University Press, Published in the United States of America by Cambridge University Press, New York (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Srivastava, J., Sanyal, S. (2012). Segmenting Long Sentence Pairs to Improve Word Alignment in English-Hindi Parallel Corpora. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-33983-7_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33982-0
Online ISBN: 978-3-642-33983-7
eBook Packages: Computer ScienceComputer Science (R0)