Abstract
This paper presents the development of a Sinhala-Tamil bilingual parallel corpus with sentence-level alignment. The corpus comprises source language text from contemporary writings, with all sentences translated manually. Active learning methods were employed to select sentences, ensuring the representation of effective language structures in both languages. The corpus is divided into two parts: one with translations from Sinhala to Tamil direction, consisting of 25k parallel sentences, while the other consists of translations from Tamil to Sinhala direction, comprising 22k parallel sentences. Manual translations were conducted by two teams of professional translators. The resulting final version of TamSiPara, the Tamil-Sinhala bilingual parallel corpus consists of a total of 47k parallel sentences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
References
Paulussen, H., Macken, L., Vandeweghe, W., Desmet, P.: Dutch parallel corpus: a balanced parallel corpus for Dutch-English and Dutch-French. In: Essential Speech and Language Technology for Dutch: Results by the STEVIN Programme, pp. 185–199 (2013)
Morishita, M., Suzuki, J., Nagata, M.: Jparacrawl: a large scale web-based English-Japanese parallel corpus. arXiv preprint arXiv:1911.10668 (2019)
Thampoe, H.D.: Sinhala and Tamil: a case of contact-induced restructuring. Ph.D. thesis, Newcastle University (2017)
Daniels, P.T.: Writing systems. In: The Handbook of Linguistics, pp. 75–94 (2017)
De Silva, N.: Survey on publicly available sinhala natural language processing tools and research. arXiv preprint arXiv:1906.02358 (2019)
Sarveswaran, K., Dias, G., Butt, M.: Thamizhi morph: a morphological parser for the Tamil language. Mach. Transl. 35(1), 37–70 (2021)
Sripirakas, S.: Statistical Machine Translation for Sinhala and Tamil, unpublished BSc thesis, University of Colombo (2010)
Jeyakaran, M.: A novel kernel regression based machine translation system for Sinhala-Tamil translation, unpublished BSc thesis, University of Colombo (2013)
Hameed, R.A., et al.: Automatic creation of a sentence aligned Sinhala-Tamil parallel corpus. In: Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP 2016), pp. 124–132 (2016)
Farhath, F., Theivendiram, P., Ranathunga, S., Jayasena, S., Dias, G.: Improving domain-specific SMT for low-resourced languages using data from different domains. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Vasantharajan, C., Tharmalingam, L., Thayasivam, U.: Adapting the tesseract open-source OCR engine for Tamil and Sinhala legacy fonts and creating a parallel corpus for Tamil-Sinhala-English. In: 2022 International Conference on Asian Language Processing (IALP), pp. 143–149. IEEE (2022)
Language Resources of LTRL-UCSC: UCSC 10M Word Sinhala Text Corpus. Language Technology Research Laboratory, University of Colombo School of Computing, Sri Lanka. LTRL resources, 1.0 (2007)
Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models. J. Artif. Intell. Res. 4, 129–145 (1996)
Language Resources of LTRL-UCSC: 4M Word Sri Lanka Tamil Text Corpus. Language Technology Research Laboratory, University of Colombo School of Computing, Sri Lanka. LTRL resources, 1.0 (2013)
Sinhala lēkhana rītiya - New Edition: NIE. National Institute of Education, Sri Lanka (2015)
Devadath, V., Kurisinkel, L.J., Sharma, D.M., Varma, V.: A sandhi splitter for Malayalam. In: Proceedings of the 11th International Conference on Natural Language Processing, pp. 156–161 (2014)
Acknowledgements
The first phase of this research was funded by the ICTA of Sri Lanka, and we appreciate their support. Furthermore, we acknowledge the partial funding received from the University of Colombo School of Computing through the Research Allocation for Research and Development. We also thank all the translators and the members of the LTRL of UCSC for their various contributions to making this work successful.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pushpananda, R., Liyanage, C., Pramodya, A., Weerasinghe, R. (2024). TamSiPara: A Tamil – Sinhala Parallel Corpus. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15048. Springer, Cham. https://doi.org/10.1007/978-3-031-70563-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-70563-2_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70562-5
Online ISBN: 978-3-031-70563-2
eBook Packages: Computer ScienceComputer Science (R0)