AFE-MERT: imbalanced text classification with abstract feature extraction

834 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

The class imbalance problem occurs when the distribution among classes is not balanced. This can be a problem that causes classifier models to bias toward classes with many training samples. The class imbalance problem is inherent in text classification. The abstract feature extraction method is a versatile term weighting scheme. It serves not only as a feature extractor to form a structural form from unorganized text data but also as a dimension reduction technique and classifier. In this study, we tackle the problem of class imbalance in abstract feature extraction. The proposed method utilizes relative imbalance ratio as a factor to elevate the representation of minority classes. Besides, we also integrate relevant term factors to boost the general accuracy. Experiments conducted with three different data sets, one of which is collected for this study, show that the original abstract feature extraction method indeed suffers from the class imbalance problem and the proposed methods demonstrate significant improvements in terms of f1-micro, f1-macro, and Matthew’s correlation coefficient. The experimental results also suggest that the proposed method is a competitive classifier and term weighting scheme when compared to the well-known classifiers (KNN, SVM, and Nearest Centroid) and term weighting schemes (TF-IDF, TF-ICF, TF-ICSDF, TF-RF, TF-PROB, TF-IGM, and TF-MONO).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

An Improved Text Feature Selection for Clustering Using Binary Grey Wolf Optimizer

Feature Selection in Text Mining

A Filter Based Feature Selection for Imbalanced Text Classification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Reuters-21578 retrieved from Python Natural Language Toolkit
20_newsgroups retrieved from http://qwone.com/~jason/20Newsgroups/
https://www.nltk.org/
http://uk.retuers.com
https://github.com/mrtokk
scikit-learn: https://scikit-learn.org/

References

Aggarwal C C, Zhai C (2012) A survey of text classification algorithms. In: Mining text data, Springer, Boston, pp 163–222
Alsaeedi A (2020) A survey of term weighting schemes for text classification. Int J Data Min Model Manag 12(2):237
Google Scholar
Biricik G, Diri B, Sonmez A C (2012) Abstract feature extraction for text classification. Turk J Electr Eng Comput Sci 20(Sup. 1):1137–1159
Google Scholar
Buda M, Maki A, Mazurowski M A (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259
Article Google Scholar
Chen K, Zhang Z, Long J, Zhang H (2016) Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst Appl 66:245–260
Article Google Scholar
Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. In: Text mining and its applications, Springer, Berlin, pp 81–97
Deisy C, Gowri M, Baskar S, Kalaiarasi S M A, Ramraj N (2010) A novel term weighting scheme midf for text categorization. J Eng Sci Technol 5(1):94–107
Google Scholar
Dogan T, Uysal AK (2019a) Improved inverse gravity moment term weighting for text classification. Expert Syst Appl 130:45–59
Dogan T, Uysal AK (2019b) On term frequency factor in supervised term weighting schemes for text classification. Arab J Sci Eng 44(11):9545–9560
Dogan T, Uysal AK (2020) A novel term weighting scheme for text classification: Tf-mono. J Inf 14(4):101076. https://doi.org/10.1016/j.joi.2020.101076, https://www.sciencedirect.com/science/article/pii/S1751157720300705https://www.sciencedi https://www.sciencedirect.com/science/article/pii/S1751157720300705rect.com/science/article/pii/S1751157720300705
Domo (2021) Domo resource - data never sleeps 8.0. https://www.domo.com/learn/data-never-sleeps-8
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl 73:220–239
Article Google Scholar
Han J, Kamber M, Pei J (2012) Data mining: Concepts and techniques. Morgan Kaufmann, Oxford
He H, Garcia E A (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21 (9):1263–1284
Article Google Scholar
Kowsari MJ, Heidarysafa M, Barnes B (2019) Text classification algorithms: a survey. Inf (Basel) 10(4):150
Google Scholar
Lan M, Tan C L, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Article Google Scholar
Li Y, Sun G, Zhu Y (2010) Data imbalance problem in text classification. In: 2010 Third International Symposium on Information Processing. IEEE
Liu Y, Loh H T, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701
Article Google Scholar
Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, García-Borroto M (2016) Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175:935–947
Matthews B W (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405(2):442–451
Article Google Scholar
Ortigosa-Hernández J, Inza I, Lozano J A (2017) Measuring the class-imbalance extent of multi-class problems. Pattern Recognit Lett 98:32–38
Article Google Scholar
Porter M F (2006) An algorithm for suffix stripping. Program 40(3):211–218
Article Google Scholar
Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125. https://doi.org/10.1016/j.ins.2013.02.029, https://www.sciencedirect.com/science/article/pii/S0020025513001461
Sabbah T, Selamat A, Selamat M H, Al-Anzi F S, Viedma E H, Krejcar O, Fujita H (2017) Modified frequency-based term weighting schemes for text classification. Appl Soft Comput 58:193–206
Article Google Scholar
Sun Y, Kamel M S, Wong A K C, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378
Article Google Scholar
Tan P N, Steinbach M, Kumar V (2005) Introduction to data mining. Pearson, Upper Saddle River
Tokunaga T, Iwayama M (1994) Text categorization based on weighted inverse document frequency. In: Special Interest Groups and Information Process Society of Japan, pp 33–39
Wang D, Zhang H, Wu W, Lin M (2010) Inverse category frequency based supervised term weighting scheme for text categorization. CoRR arXiv:1012.2609
Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor 6(1):80–89
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Comptuer Engineering, Yalova University, Yalova, Turkey
Murat Okkalioglu & Burcu Demirelli Okkalioglu
Cyber Security Application and Research Center, Yalova University, Yalova, Turkey
Murat Okkalioglu

Authors

Murat Okkalioglu
View author publications
You can also search for this author in PubMed Google Scholar
Burcu Demirelli Okkalioglu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Murat Okkalioglu.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Okkalioglu, M., Okkalioglu, B.D. AFE-MERT: imbalanced text classification with abstract feature extraction. Appl Intell 52, 10352–10368 (2022). https://doi.org/10.1007/s10489-021-02983-2

Download citation

Accepted: 05 November 2021
Published: 13 January 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s10489-021-02983-2

AFE-MERT: imbalanced text classification with abstract feature extraction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Improved Text Feature Selection for Clustering Using Binary Grey Wolf Optimizer

Feature Selection in Text Mining

A Filter Based Feature Selection for Imbalanced Text Classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

AFE-MERT: imbalanced text classification with abstract feature extraction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Improved Text Feature Selection for Clustering Using Binary Grey Wolf Optimizer

Feature Selection in Text Mining

A Filter Based Feature Selection for Imbalanced Text Classification

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation