TamSiPara: A Tamil – Sinhala Parallel Corpus

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15048))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

257 Accesses

Abstract

This paper presents the development of a Sinhala-Tamil bilingual parallel corpus with sentence-level alignment. The corpus comprises source language text from contemporary writings, with all sentences translated manually. Active learning methods were employed to select sentences, ensuring the representation of effective language structures in both languages. The corpus is divided into two parts: one with translations from Sinhala to Tamil direction, consisting of 25k parallel sentences, while the other consists of translations from Tamil to Sinhala direction, comprising 22k parallel sentences. Manual translations were conducted by two teams of professional translators. The resulting final version of TamSiPara, the Tamil-Sinhala bilingual parallel corpus consists of a total of 47k parallel sentences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 39.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 49.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Parallel Corpora Preparation for English-Amharic Machine Translation

OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation

ParCoLab: A Parallel Corpus for Serbian, French and English

Notes

References

Paulussen, H., Macken, L., Vandeweghe, W., Desmet, P.: Dutch parallel corpus: a balanced parallel corpus for Dutch-English and Dutch-French. In: Essential Speech and Language Technology for Dutch: Results by the STEVIN Programme, pp. 185–199 (2013)
Google Scholar
Morishita, M., Suzuki, J., Nagata, M.: Jparacrawl: a large scale web-based English-Japanese parallel corpus. arXiv preprint arXiv:1911.10668 (2019)
Thampoe, H.D.: Sinhala and Tamil: a case of contact-induced restructuring. Ph.D. thesis, Newcastle University (2017)
Google Scholar
Daniels, P.T.: Writing systems. In: The Handbook of Linguistics, pp. 75–94 (2017)
Google Scholar
De Silva, N.: Survey on publicly available sinhala natural language processing tools and research. arXiv preprint arXiv:1906.02358 (2019)
Sarveswaran, K., Dias, G., Butt, M.: Thamizhi morph: a morphological parser for the Tamil language. Mach. Transl. 35(1), 37–70 (2021)
Article Google Scholar
Sripirakas, S.: Statistical Machine Translation for Sinhala and Tamil, unpublished BSc thesis, University of Colombo (2010)
Google Scholar
Jeyakaran, M.: A novel kernel regression based machine translation system for Sinhala-Tamil translation, unpublished BSc thesis, University of Colombo (2013)
Google Scholar
Hameed, R.A., et al.: Automatic creation of a sentence aligned Sinhala-Tamil parallel corpus. In: Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP 2016), pp. 124–132 (2016)
Google Scholar
Farhath, F., Theivendiram, P., Ranathunga, S., Jayasena, S., Dias, G.: Improving domain-specific SMT for low-resourced languages using data from different domains. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Vasantharajan, C., Tharmalingam, L., Thayasivam, U.: Adapting the tesseract open-source OCR engine for Tamil and Sinhala legacy fonts and creating a parallel corpus for Tamil-Sinhala-English. In: 2022 International Conference on Asian Language Processing (IALP), pp. 143–149. IEEE (2022)
Google Scholar
Language Resources of LTRL-UCSC: UCSC 10M Word Sinhala Text Corpus. Language Technology Research Laboratory, University of Colombo School of Computing, Sri Lanka. LTRL resources, 1.0 (2007)
Google Scholar
Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models. J. Artif. Intell. Res. 4, 129–145 (1996)
Article Google Scholar
Language Resources of LTRL-UCSC: 4M Word Sri Lanka Tamil Text Corpus. Language Technology Research Laboratory, University of Colombo School of Computing, Sri Lanka. LTRL resources, 1.0 (2013)
Google Scholar
Sinhala lēkhana rītiya - New Edition: NIE. National Institute of Education, Sri Lanka (2015)
Google Scholar
Devadath, V., Kurisinkel, L.J., Sharma, D.M., Varma, V.: A sandhi splitter for Malayalam. In: Proceedings of the 11th International Conference on Natural Language Processing, pp. 156–161 (2014)
Google Scholar

Download references

Acknowledgements

The first phase of this research was funded by the ICTA of Sri Lanka, and we appreciate their support. Furthermore, we acknowledge the partial funding received from the University of Colombo School of Computing through the Research Allocation for Research and Development. We also thank all the translators and the members of the LTRL of UCSC for their various contributions to making this work successful.

Author information

Authors and Affiliations

Language Technology Research Laboratory, University of Colombo School of Computing, Colombo, Sri Lanka
Randil Pushpananda, Chamila Liyanage & Ruvan Weerasinghe
Nara Institute of Science and Technology, Ikoma, Japan
Ashmari Pramodya

Authors

Randil Pushpananda
View author publications
You can also search for this author in PubMed Google Scholar
Chamila Liyanage
View author publications
You can also search for this author in PubMed Google Scholar
Ashmari Pramodya
View author publications
You can also search for this author in PubMed Google Scholar
Ruvan Weerasinghe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Randil Pushpananda .

Editor information

Editors and Affiliations

Friedrich-Alexander-Universität, Erlangen, Germany
Elmar Nöth
Masaryk University, Brno, Czech Republic
Aleš Horák
Masaryk University, Brno, Czech Republic
Petr Sojka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pushpananda, R., Liyanage, C., Pramodya, A., Weerasinghe, R. (2024). TamSiPara: A Tamil – Sinhala Parallel Corpus. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15048. Springer, Cham. https://doi.org/10.1007/978-3-031-70563-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-70563-2_13
Published: 01 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70562-5
Online ISBN: 978-3-031-70563-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TamSiPara: A Tamil – Sinhala Parallel Corpus

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Parallel Corpora Preparation for English-Amharic Machine Translation

OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation

ParCoLab: A Parallel Corpus for Serbian, French and English

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

TamSiPara: A Tamil – Sinhala Parallel Corpus

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Parallel Corpora Preparation for English-Amharic Machine Translation

OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation

ParCoLab: A Parallel Corpus for Serbian, French and English

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation