[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

TinyLLM Efficacy in Low-Resource Language: An Experiment on Bangla Text Classification Task

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15319))

Included in the following conference series:

  • 102 Accesses

Abstract

Delving into the realm of Bangla text analysis, our study ventures to unlock the potential of both Large and Tiny Language Models across a range of classification tasks, from deciphering sentiment to detecting sarcasm, emotion, hate speech, and fake news. In a linguistic landscape where resources are scarce, we fill a crucial gap by meticulously evaluating model performance. Our findings unveil Gemma-2B and Bangla-BERT as top performers, with Gemma-2B excelling in detecting hate speech and sarcasm, while BanglaBERT shines in sentiment analysis and emotion detection. Notably, TinyLlama emerges as a standout, showcasing exceptional prowess in fake news detection. We emphasize the importance of selecting models attuned to the intricacies of Bangla text, with Gemma-2B, TinyLlama, and BanglaBERT exhibiting notable accuracy improvements, surpassing other contenders. Furthermore, we uncover performance disparities influenced by dataset origins, with Bangla Language Models adept at capturing social media sentiments, and Large Language Models excelling in identifying misinformation and abusive language in formal sources. Our comparison with ChatGPT’s zero-shot prompting underscores the necessity for advanced NLP methodologies. By spotlighting TinyLLM, we showcase the potential of advanced NLP in Bangla text classification, paving the way for broader advancements in NLP research.

F. N. Dehan and Md. Fahim—These authors contributed Equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 99.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 129.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/csebuetnlp/normalizer.

References

  1. Alam, F., et al.: A review of Bangla natural language processing tasks and the utility of transformer models. arXiv preprint arXiv:2107.03844 (2021)

  2. Alam, T., Khan, A., Alam, F.: Bangla text classification using transformers. CoRR arXiv:2011.04446 (2020). https://arxiv.org/abs/2011.04446

  3. Apon, T.S., Anan, R., Modhu, E.A., Suter, A., Sneha, I.J., Alam, M.G.R.: Banglasarc: a dataset for sarcasm detection. In: 2022 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), pp. 1–5. IEEE (2022)

    Google Scholar 

  4. Bhattacharjee, A., et al.: BanglaBERT: language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla. In: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (eds.) Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1318–1327. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.findings-naacl.98, https://aclanthology.org/2022.findings-naacl.98

  5. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.747, https://aclanthology.org/2020.acl-main.747

  6. Corrêa, N.K., Falk, S., Fatimah, S., Sen, A., de Oliveira, N.: Teenytinyllama: open-source tiny language models trained in Brazilian Portuguese (2024)

    Google Scholar 

  7. Dehan, F., Fahim, M., Ali, A.A., Amin, M.A., Rahman, A.: Investigating the effectiveness of graph-based algorithm for bangla text classification. In: Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pp. 104–116 (2023)

    Google Scholar 

  8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423

  9. Fahim, M.: Aambela at blp-2023 task 2: Enhancing banglabert performance for bangla sentiment analysis task with in task pretraining and adversarial weight perturbation. In: Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pp. 317–323 (2023)

    Google Scholar 

  10. Fahim, M., Ali, A.A., Amin, M.A., Rahman, A.: Contextual Bangla neural stemmer: Finding contextualized root word representations for Bangla words. In: Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pp. 94–103 (2023)

    Google Scholar 

  11. Fahim, M., Ali, A.A., Amin, M.A., Rahman, A.M.: Edal: entropy based dynamic attention loss for hatespeech classification. In: Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, pp. 775–785 (2023)

    Google Scholar 

  12. Hasan, M.A., Das, S., Anjum, A., Alam, F., Anjum, A., Sarker, A., Noori, S.R.H.: Zero-and few-shot prompting with LLMS: a comparative study with fine-tuned models for bangla sentiment analysis. arXiv preprint arXiv:2308.10783 (2023)

  13. He, P., Gao, J., Chen, W.: Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing (2021)

    Google Scholar 

  14. He, P., Liu, X., Gao, J., Chen, W.: Deberta: decoding-enhanced BERT with disentangled attention. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=XPZIaotutsD

  15. Hoffmann, J., et al.: Training compute-optimal large language models (2022)

    Google Scholar 

  16. Hossain, M.Z., Rahman, M.A., Islam, M.S., Kar, S.: BanFakeNews: a dataset for detecting fake news in Bangla. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 2862–2871. European Language Resources Association, Marseille, France (2020). https://aclanthology.org/2020.lrec-1.349

  17. Hu, E.J., et al.: Lora: low-rank adaptation of large language models (2021)

    Google Scholar 

  18. Islam, K.I., Kar, S., Islam, M.S., Amin, M.R.: Sentnob: a dataset for analysing sentiment on noisy bangla texts. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3265–3271 (2021)

    Google Scholar 

  19. Kabir, M., Islam, M.S., Laskar, M.T.R., Nayeem, M.T., Bari, M.S., Hoque, E.: Benllmeval: a comprehensive evaluation into the potentials and pitfalls of large language models on bengali nlp. arXiv preprint arXiv:2309.13173 (2023)

  20. Karim, M.R., Chakravarthi, B.R., Arcan, M., McCrae, J.P., Cochez, M.: Classification benchmarks for under-resourced bengali language based on multichannel convolutional-lstm network. 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pp. 390–399 (2020). https://api.semanticscholar.org/CorpusID:215786049

  21. Li, X., Nie, E., Liang, S.: Crosslingual retrieval augmented in-context learning for Bangla. In: Alam, F., Kar, S., Chowdhury, S.A., Sadeque, F., Amin, R. (eds.) Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pp. 136–151. Association for Computational Linguistics, Singapore (2023). https://doi.org/10.18653/v1/2023.banglalp-1.15, https://aclanthology.org/2023.banglalp-1.15

  22. Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S., Bossan, B.: Peft: state-of-the-art parameter-efficient fine-tuning methods (2022). https://github.com/huggingface/peft

  23. Penedo, G., et al.: The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023). https://arxiv.org/abs/2306.01116

  24. Sarker, S.: Banglabert: Bengali mask language model for bengali language understanding (2020). https://github.com/sagorbrur/bangla-bert

  25. Team, G.: Gemma: open models based on gemini research and technology (2024)

    Google Scholar 

  26. Trinto, N.I., Ali, M.E.: Detecting multilabel sentiment and emotions from bangla youtube comments. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1–6 (2018). https://api.semanticscholar.org/CorpusID:54440144

  27. Zhang, P., Zeng, G., Wang, T., Lu, W.: Tinyllama: an open-source small language model (2024)

    Google Scholar 

  28. Zhang, S., et al.: OPT: open pre-trained transformer language models (2022)

    Google Scholar 

Download references

Acknowledgments

We are thankful to Independent University, Bangladesh, for their support of this project. We would also like to express our gratitude to the Center for Computational & Data Sciences (CCDS Lab) for providing computational facilities and supervising this project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md Fahim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dehan, F.N., Fahim, M., Rahman, A.K.M.M., Amin, M.A., Ali, A.A. (2025). TinyLLM Efficacy in Low-Resource Language: An Experiment on Bangla Text Classification Task. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15319. Springer, Cham. https://doi.org/10.1007/978-3-031-78495-8_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78495-8_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78494-1

  • Online ISBN: 978-3-031-78495-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics