[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

TamSiPara: A Tamil – Sinhala Parallel Corpus

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2024)

Abstract

This paper presents the development of a Sinhala-Tamil bilingual parallel corpus with sentence-level alignment. The corpus comprises source language text from contemporary writings, with all sentences translated manually. Active learning methods were employed to select sentences, ensuring the representation of effective language structures in both languages. The corpus is divided into two parts: one with translations from Sinhala to Tamil direction, consisting of 25k parallel sentences, while the other consists of translations from Tamil to Sinhala direction, comprising 22k parallel sentences. Manual translations were conducted by two teams of professional translators. The resulting final version of TamSiPara, the Tamil-Sinhala bilingual parallel corpus consists of a total of 47k parallel sentences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 39.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 49.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.ethnologue.com/insights/ethnologue200/.

  2. 2.

    https://www.worlddata.info/languages/tamil.php.

  3. 3.

    https://www.treasury.gov.lk/web/annual-reports-financial-statements-of-key-soes.

  4. 4.

    https://www.parliament.lk/business-of-parliament/order-papers.

  5. 5.

    http://www.cpalanka.org/.

  6. 6.

    http://www.edupub.gov.lk/.

  7. 7.

    http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/.

References

  1. Paulussen, H., Macken, L., Vandeweghe, W., Desmet, P.: Dutch parallel corpus: a balanced parallel corpus for Dutch-English and Dutch-French. In: Essential Speech and Language Technology for Dutch: Results by the STEVIN Programme, pp. 185–199 (2013)

    Google Scholar 

  2. Morishita, M., Suzuki, J., Nagata, M.: Jparacrawl: a large scale web-based English-Japanese parallel corpus. arXiv preprint arXiv:1911.10668 (2019)

  3. Thampoe, H.D.: Sinhala and Tamil: a case of contact-induced restructuring. Ph.D. thesis, Newcastle University (2017)

    Google Scholar 

  4. Daniels, P.T.: Writing systems. In: The Handbook of Linguistics, pp. 75–94 (2017)

    Google Scholar 

  5. De Silva, N.: Survey on publicly available sinhala natural language processing tools and research. arXiv preprint arXiv:1906.02358 (2019)

  6. Sarveswaran, K., Dias, G., Butt, M.: Thamizhi morph: a morphological parser for the Tamil language. Mach. Transl. 35(1), 37–70 (2021)

    Article  Google Scholar 

  7. Sripirakas, S.: Statistical Machine Translation for Sinhala and Tamil, unpublished BSc thesis, University of Colombo (2010)

    Google Scholar 

  8. Jeyakaran, M.: A novel kernel regression based machine translation system for Sinhala-Tamil translation, unpublished BSc thesis, University of Colombo (2013)

    Google Scholar 

  9. Hameed, R.A., et al.: Automatic creation of a sentence aligned Sinhala-Tamil parallel corpus. In: Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP 2016), pp. 124–132 (2016)

    Google Scholar 

  10. Farhath, F., Theivendiram, P., Ranathunga, S., Jayasena, S., Dias, G.: Improving domain-specific SMT for low-resourced languages using data from different domains. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  11. Vasantharajan, C., Tharmalingam, L., Thayasivam, U.: Adapting the tesseract open-source OCR engine for Tamil and Sinhala legacy fonts and creating a parallel corpus for Tamil-Sinhala-English. In: 2022 International Conference on Asian Language Processing (IALP), pp. 143–149. IEEE (2022)

    Google Scholar 

  12. Language Resources of LTRL-UCSC: UCSC 10M Word Sinhala Text Corpus. Language Technology Research Laboratory, University of Colombo School of Computing, Sri Lanka. LTRL resources, 1.0 (2007)

    Google Scholar 

  13. Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models. J. Artif. Intell. Res. 4, 129–145 (1996)

    Article  Google Scholar 

  14. Language Resources of LTRL-UCSC: 4M Word Sri Lanka Tamil Text Corpus. Language Technology Research Laboratory, University of Colombo School of Computing, Sri Lanka. LTRL resources, 1.0 (2013)

    Google Scholar 

  15. Sinhala lēkhana rītiya - New Edition: NIE. National Institute of Education, Sri Lanka (2015)

    Google Scholar 

  16. Devadath, V., Kurisinkel, L.J., Sharma, D.M., Varma, V.: A sandhi splitter for Malayalam. In: Proceedings of the 11th International Conference on Natural Language Processing, pp. 156–161 (2014)

    Google Scholar 

Download references

Acknowledgements

The first phase of this research was funded by the ICTA of Sri Lanka, and we appreciate their support. Furthermore, we acknowledge the partial funding received from the University of Colombo School of Computing through the Research Allocation for Research and Development. We also thank all the translators and the members of the LTRL of UCSC for their various contributions to making this work successful.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Randil Pushpananda .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pushpananda, R., Liyanage, C., Pramodya, A., Weerasinghe, R. (2024). TamSiPara: A Tamil – Sinhala Parallel Corpus. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15048. Springer, Cham. https://doi.org/10.1007/978-3-031-70563-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70563-2_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70562-5

  • Online ISBN: 978-3-031-70563-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics