Segmenting Long Sentence Pairs to Improve Word Alignment in English-Hindi Parallel Corpora

Jyoti Srivastava²⁰ &
Sudip Sanyal²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7614))

Included in the following conference series:

International Conference on NLP

1613 Accesses
1 Citations

Abstract

This paper presents an approach which improves the performance of word alignment for English-Hindi language pair. Longer sentences in the corpus create severe problems like the high computational requirements and poor quality of resulting word alignment. Here, we present a method to solve these problems by breaking the longer sentence pairs into shorter ones. Our approach first breaks the source and target sentences into clauses and then treats the resulting clause pairs as sentence pairs to train word alignment model. We also report preliminary work on automatically identifying clause boundaries which are appropriate for improvement of word alignment. This paper demonstrates the increase of precision, recall and F-measure by approximately 11%, 7%, 10% respectively and reduction in Alignment Error Rate (AER) by approximately 10% in the performance of IBM Model 1 for word alignment. These results are obtained by training on 270 sentence pair and testing on 30 sentence pairs. Experiments of this paper are based on TDIL corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Evaluating automatic sentence alignment approaches on English-Slovak sentences

Article Open access 17 November 2023

Construction of Large-Scale Chinese-English Bilingual Corpus and Sentence Alignment

Construction of Parallel Corpus of Foreign Publicity Based on Computer-Aided Translation Software

References

Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Google Scholar
Gale, W.A., Church, K.: Identifying word correspondences in parallel texts. In: Fourth DARPA Workshop on Speech and Natural Language, Asilomar, pp. 152–157 (1991)
Google Scholar
Xu, J., Zens, R., Ney, H.: Sentence segmentation using IBM word alignment model 1. In: Proc. the 10th Annual Conference of the European Association for Machine Translation, Budapest, Hungary, pp. 280–287 (May 2005)
Google Scholar
Meng, B., Huang, S., Dai, X., Chen, J.: Segmenting long sentence pairs for statistical machine translation. In: International Conference on Asian Language Processing, Singapore, December 7-9 (2009)
Google Scholar
Hutchins, J., Somers, H.: An Introduction to Machine Translation, pp. 175–189. Academic Press (1992)
Google Scholar
Wilks, Y.: The Stanford Machine Translation project, Natural Language Processing, pp. 243–290. Algorithmics Press (1973)
Google Scholar
Chandrasekar, R.: A Hybrid Approach to Machine Translation using Man Machine Communication, Ph.D. thesis, Tata Institute of Fundamental Research, Mumbai (1994)
Google Scholar
Rao, D., Mohanraj, K., Hegde, J., Mehta, V., Mahadane, P.: A practical framework for syntactic transfer of compound-complex sentences for English-Hindi machine translation. In: Proceedings of KBCS (2000)
Google Scholar
Koehn, P., Knight, K.: Feature-rich statistical translation of noun phrases. In: Proceedings of ACL (2003)
Google Scholar
Kim, Y.-B., Ehara, T.: A method for partitioning of long Japanese sentences with subject resolution in J/E machine translation. In: Proc. International Conference on Computer Processing of Oriental Languages, pp. 467–473 (1994)
Google Scholar
Marcu, D.: The Rhetorical Parsing, Summarization and Generation of Natural Language Texts, Ph.D. thesis, Department of Computer Science, University of Toronto, Toronto, Canada (December 1997)
Google Scholar
Sudoh, K., Duh, K., Tsukada, H., Hirao, T., Nagata, M.: Divide and translate: improving long distance reordering in statistical machine translation. In: Workshop on Statistical Machine Translation and Metrics (2010)
Google Scholar
Ramanathan, A., Bhattacharyya, P., Visweswariah, K., Ladha, K., Gandhe, A.: Clause-Based Reordering Constraints to Improve Statistical Machine Translation. In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 1351–1355 (November 2011)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)
Article MATH Google Scholar
Koehn, P.: Statistical Machine Translation. Cambridge University Press, Published in the United States of America by Cambridge University Press, New York (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Information Technology, Allahabad, India
Jyoti Srivastava & Sudip Sanyal

Authors

Jyoti Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
Sudip Sanyal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information and Media Center, Toyohashi Universtiy of Technology, 1-1 Hibarigaoka, Tenpakucho, 441-8580, Toyohashi, Japan
Hitoshi Isahara & Kyoko Kanzaki &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Srivastava, J., Sanyal, S. (2012). Segmenting Long Sentence Pairs to Improve Word Alignment in English-Hindi Parallel Corpora. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-33983-7_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33982-0
Online ISBN: 978-3-642-33983-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics