TinyLLM Efficacy in Low-Resource Language: An Experiment on Bangla Text Classification Task

Farhan Noor Dehan¹³,
Md Fahim¹³,
A. K. M. Mahabubur Rahman¹³,
M. Ashraful Amin¹³ &
…
Amin Ahsan Ali¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15319))

Included in the following conference series:

International Conference on Pattern Recognition

102 Accesses

Abstract

Delving into the realm of Bangla text analysis, our study ventures to unlock the potential of both Large and Tiny Language Models across a range of classification tasks, from deciphering sentiment to detecting sarcasm, emotion, hate speech, and fake news. In a linguistic landscape where resources are scarce, we fill a crucial gap by meticulously evaluating model performance. Our findings unveil Gemma-2B and Bangla-BERT as top performers, with Gemma-2B excelling in detecting hate speech and sarcasm, while BanglaBERT shines in sentiment analysis and emotion detection. Notably, TinyLlama emerges as a standout, showcasing exceptional prowess in fake news detection. We emphasize the importance of selecting models attuned to the intricacies of Bangla text, with Gemma-2B, TinyLlama, and BanglaBERT exhibiting notable accuracy improvements, surpassing other contenders. Furthermore, we uncover performance disparities influenced by dataset origins, with Bangla Language Models adept at capturing social media sentiments, and Large Language Models excelling in identifying misinformation and abusive language in formal sources. Our comparison with ChatGPT’s zero-shot prompting underscores the necessity for advanced NLP methodologies. By spotlighting TinyLLM, we showcase the potential of advanced NLP in Bangla text classification, paving the way for broader advancements in NLP research.

F. N. Dehan and Md. Fahim—These authors contributed Equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 99.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 129.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

OLID-BR: offensive language identification dataset for Brazilian Portuguese

Article 03 May 2023

SOLD: Sinhala offensive language dataset

Article Open access 06 March 2024

Predicting the type and target of offensive social media posts in Marathi

Article 09 July 2022

Notes

1.
https://github.com/csebuetnlp/normalizer.

References

Alam, F., et al.: A review of Bangla natural language processing tasks and the utility of transformer models. arXiv preprint arXiv:2107.03844 (2021)
Alam, T., Khan, A., Alam, F.: Bangla text classification using transformers. CoRR arXiv:2011.04446 (2020). https://arxiv.org/abs/2011.04446
Apon, T.S., Anan, R., Modhu, E.A., Suter, A., Sneha, I.J., Alam, M.G.R.: Banglasarc: a dataset for sarcasm detection. In: 2022 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), pp. 1–5. IEEE (2022)
Google Scholar
Bhattacharjee, A., et al.: BanglaBERT: language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla. In: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (eds.) Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1318–1327. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.findings-naacl.98, https://aclanthology.org/2022.findings-naacl.98
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.747, https://aclanthology.org/2020.acl-main.747
Corrêa, N.K., Falk, S., Fatimah, S., Sen, A., de Oliveira, N.: Teenytinyllama: open-source tiny language models trained in Brazilian Portuguese (2024)
Google Scholar
Dehan, F., Fahim, M., Ali, A.A., Amin, M.A., Rahman, A.: Investigating the effectiveness of graph-based algorithm for bangla text classification. In: Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pp. 104–116 (2023)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Fahim, M.: Aambela at blp-2023 task 2: Enhancing banglabert performance for bangla sentiment analysis task with in task pretraining and adversarial weight perturbation. In: Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pp. 317–323 (2023)
Google Scholar
Fahim, M., Ali, A.A., Amin, M.A., Rahman, A.: Contextual Bangla neural stemmer: Finding contextualized root word representations for Bangla words. In: Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pp. 94–103 (2023)
Google Scholar
Fahim, M., Ali, A.A., Amin, M.A., Rahman, A.M.: Edal: entropy based dynamic attention loss for hatespeech classification. In: Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, pp. 775–785 (2023)
Google Scholar
Hasan, M.A., Das, S., Anjum, A., Alam, F., Anjum, A., Sarker, A., Noori, S.R.H.: Zero-and few-shot prompting with LLMS: a comparative study with fine-tuned models for bangla sentiment analysis. arXiv preprint arXiv:2308.10783 (2023)
He, P., Gao, J., Chen, W.: Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing (2021)
Google Scholar
He, P., Liu, X., Gao, J., Chen, W.: Deberta: decoding-enhanced BERT with disentangled attention. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=XPZIaotutsD
Hoffmann, J., et al.: Training compute-optimal large language models (2022)
Google Scholar
Hossain, M.Z., Rahman, M.A., Islam, M.S., Kar, S.: BanFakeNews: a dataset for detecting fake news in Bangla. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 2862–2871. European Language Resources Association, Marseille, France (2020). https://aclanthology.org/2020.lrec-1.349
Hu, E.J., et al.: Lora: low-rank adaptation of large language models (2021)
Google Scholar
Islam, K.I., Kar, S., Islam, M.S., Amin, M.R.: Sentnob: a dataset for analysing sentiment on noisy bangla texts. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3265–3271 (2021)
Google Scholar
Kabir, M., Islam, M.S., Laskar, M.T.R., Nayeem, M.T., Bari, M.S., Hoque, E.: Benllmeval: a comprehensive evaluation into the potentials and pitfalls of large language models on bengali nlp. arXiv preprint arXiv:2309.13173 (2023)
Karim, M.R., Chakravarthi, B.R., Arcan, M., McCrae, J.P., Cochez, M.: Classification benchmarks for under-resourced bengali language based on multichannel convolutional-lstm network. 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pp. 390–399 (2020). https://api.semanticscholar.org/CorpusID:215786049
Li, X., Nie, E., Liang, S.: Crosslingual retrieval augmented in-context learning for Bangla. In: Alam, F., Kar, S., Chowdhury, S.A., Sadeque, F., Amin, R. (eds.) Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pp. 136–151. Association for Computational Linguistics, Singapore (2023). https://doi.org/10.18653/v1/2023.banglalp-1.15, https://aclanthology.org/2023.banglalp-1.15
Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S., Bossan, B.: Peft: state-of-the-art parameter-efficient fine-tuning methods (2022). https://github.com/huggingface/peft
Penedo, G., et al.: The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023). https://arxiv.org/abs/2306.01116
Sarker, S.: Banglabert: Bengali mask language model for bengali language understanding (2020). https://github.com/sagorbrur/bangla-bert
Team, G.: Gemma: open models based on gemini research and technology (2024)
Google Scholar
Trinto, N.I., Ali, M.E.: Detecting multilabel sentiment and emotions from bangla youtube comments. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1–6 (2018). https://api.semanticscholar.org/CorpusID:54440144
Zhang, P., Zeng, G., Wang, T., Lu, W.: Tinyllama: an open-source small language model (2024)
Google Scholar
Zhang, S., et al.: OPT: open pre-trained transformer language models (2022)
Google Scholar

Download references

Acknowledgments

We are thankful to Independent University, Bangladesh, for their support of this project. We would also like to express our gratitude to the Center for Computational & Data Sciences (CCDS Lab) for providing computational facilities and supervising this project.

Author information

Authors and Affiliations

Center for Computational and Data Sciences, Independent University, Bangladesh, Dhaka, 1229, Bangladesh
Farhan Noor Dehan, Md Fahim, A. K. M. Mahabubur Rahman, M. Ashraful Amin & Amin Ahsan Ali

Authors

Farhan Noor Dehan
View author publications
You can also search for this author in PubMed Google Scholar
Md Fahim
View author publications
You can also search for this author in PubMed Google Scholar
A. K. M. Mahabubur Rahman
View author publications
You can also search for this author in PubMed Google Scholar
M. Ashraful Amin
View author publications
You can also search for this author in PubMed Google Scholar
Amin Ahsan Ali
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md Fahim .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
IIT Bombay, Powai, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
ISI Kolkata, kolkata, West Bengal, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dehan, F.N., Fahim, M., Rahman, A.K.M.M., Amin, M.A., Ali, A.A. (2025). TinyLLM Efficacy in Low-Resource Language: An Experiment on Bangla Text Classification Task. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15319. Springer, Cham. https://doi.org/10.1007/978-3-031-78495-8_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-78495-8_30
Published: 04 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78494-1
Online ISBN: 978-3-031-78495-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

TinyLLM Efficacy in Low-Resource Language: An Experiment on Bangla Text Classification Task

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others