Abstract
The class imbalance problem occurs when the distribution among classes is not balanced. This can be a problem that causes classifier models to bias toward classes with many training samples. The class imbalance problem is inherent in text classification. The abstract feature extraction method is a versatile term weighting scheme. It serves not only as a feature extractor to form a structural form from unorganized text data but also as a dimension reduction technique and classifier. In this study, we tackle the problem of class imbalance in abstract feature extraction. The proposed method utilizes relative imbalance ratio as a factor to elevate the representation of minority classes. Besides, we also integrate relevant term factors to boost the general accuracy. Experiments conducted with three different data sets, one of which is collected for this study, show that the original abstract feature extraction method indeed suffers from the class imbalance problem and the proposed methods demonstrate significant improvements in terms of f1-micro, f1-macro, and Matthew’s correlation coefficient. The experimental results also suggest that the proposed method is a competitive classifier and term weighting scheme when compared to the well-known classifiers (KNN, SVM, and Nearest Centroid) and term weighting schemes (TF-IDF, TF-ICF, TF-ICSDF, TF-RF, TF-PROB, TF-IGM, and TF-MONO).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Reuters-21578 retrieved from Python Natural Language Toolkit
20_newsgroups retrieved from http://qwone.com/~jason/20Newsgroups/
scikit-learn: https://scikit-learn.org/
References
Aggarwal C C, Zhai C (2012) A survey of text classification algorithms. In: Mining text data, Springer, Boston, pp 163–222
Alsaeedi A (2020) A survey of term weighting schemes for text classification. Int J Data Min Model Manag 12(2):237
Biricik G, Diri B, Sonmez A C (2012) Abstract feature extraction for text classification. Turk J Electr Eng Comput Sci 20(Sup. 1):1137–1159
Buda M, Maki A, Mazurowski M A (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259
Chen K, Zhang Z, Long J, Zhang H (2016) Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst Appl 66:245–260
Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. In: Text mining and its applications, Springer, Berlin, pp 81–97
Deisy C, Gowri M, Baskar S, Kalaiarasi S M A, Ramraj N (2010) A novel term weighting scheme midf for text categorization. J Eng Sci Technol 5(1):94–107
Dogan T, Uysal AK (2019a) Improved inverse gravity moment term weighting for text classification. Expert Syst Appl 130:45–59
Dogan T, Uysal AK (2019b) On term frequency factor in supervised term weighting schemes for text classification. Arab J Sci Eng 44(11):9545–9560
Dogan T, Uysal AK (2020) A novel term weighting scheme for text classification: Tf-mono. J Inf 14(4):101076. https://doi.org/10.1016/j.joi.2020.101076, https://www.sciencedirect.com/science/article/pii/S1751157720300705https://www.sciencedi https://www.sciencedirect.com/science/article/pii/S1751157720300705rect.com/science/article/pii/S1751157720300705
Domo (2021) Domo resource - data never sleeps 8.0. https://www.domo.com/learn/data-never-sleeps-8
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl 73:220–239
Han J, Kamber M, Pei J (2012) Data mining: Concepts and techniques. Morgan Kaufmann, Oxford
He H, Garcia E A (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21 (9):1263–1284
Kowsari MJ, Heidarysafa M, Barnes B (2019) Text classification algorithms: a survey. Inf (Basel) 10(4):150
Lan M, Tan C L, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Li Y, Sun G, Zhu Y (2010) Data imbalance problem in text classification. In: 2010 Third International Symposium on Information Processing. IEEE
Liu Y, Loh H T, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701
Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, García-Borroto M (2016) Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175:935–947
Matthews B W (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405(2):442–451
Ortigosa-Hernández J, Inza I, Lozano J A (2017) Measuring the class-imbalance extent of multi-class problems. Pattern Recognit Lett 98:32–38
Porter M F (2006) An algorithm for suffix stripping. Program 40(3):211–218
Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125. https://doi.org/10.1016/j.ins.2013.02.029, https://www.sciencedirect.com/science/article/pii/S0020025513001461
Sabbah T, Selamat A, Selamat M H, Al-Anzi F S, Viedma E H, Krejcar O, Fujita H (2017) Modified frequency-based term weighting schemes for text classification. Appl Soft Comput 58:193–206
Sun Y, Kamel M S, Wong A K C, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378
Tan P N, Steinbach M, Kumar V (2005) Introduction to data mining. Pearson, Upper Saddle River
Tokunaga T, Iwayama M (1994) Text categorization based on weighted inverse document frequency. In: Special Interest Groups and Information Process Society of Japan, pp 33–39
Wang D, Zhang H, Wu W, Lin M (2010) Inverse category frequency based supervised term weighting scheme for text categorization. CoRR arXiv:1012.2609
Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor 6(1):80–89
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Okkalioglu, M., Okkalioglu, B.D. AFE-MERT: imbalanced text classification with abstract feature extraction. Appl Intell 52, 10352–10368 (2022). https://doi.org/10.1007/s10489-021-02983-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02983-2