Computer Science > Artificial Intelligence
[Submitted on 19 Oct 2022]
Title:Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classification in Bangla Language
View PDFAbstract:Research in Natural Language Processing (NLP) has increasingly become important due to applications such as text classification, text mining, sentiment analysis, POS tagging, named entity recognition, textual entailment, and many others. This paper introduces several machine and deep learning methods with manual and automatic labelling for news classification in the Bangla language. We implemented several machine (ML) and deep learning (DL) algorithms. The ML algorithms are Logistic Regression (LR), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbour (KNN), used with Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Doc2Vec embedding models. The DL algorithms are Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), and Convolutional Neural Network (CNN), used with Word2vec, Glove, and FastText word embedding models. We develop automatic labelling methods using Latent Dirichlet Allocation (LDA) and investigate the performance of single-label and multi-label article classification methods. To investigate performance, we developed from scratch Potrika, the largest and the most extensive dataset for news classification in the Bangla language, comprising 185.51 million words and 12.57 million sentences contained in 664,880 news articles in eight distinct categories, curated from six popular online news portals in Bangladesh for the period 2014-2020. GRU and Fasttext with 91.83% achieve the highest accuracy for manually-labelled data. For the automatic labelling case, KNN and Doc2Vec at 57.72% and 75% achieve the highest accuracy for single-label and multi-label data, respectively. The methods developed in this paper are expected to advance research in Bangla and other languages.
Submission history
From: Rashid Mehmood PhD [view email][v1] Wed, 19 Oct 2022 21:53:49 UTC (8,571 KB)
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.