[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

AFE-MERT: imbalanced text classification with abstract feature extraction

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The class imbalance problem occurs when the distribution among classes is not balanced. This can be a problem that causes classifier models to bias toward classes with many training samples. The class imbalance problem is inherent in text classification. The abstract feature extraction method is a versatile term weighting scheme. It serves not only as a feature extractor to form a structural form from unorganized text data but also as a dimension reduction technique and classifier. In this study, we tackle the problem of class imbalance in abstract feature extraction. The proposed method utilizes relative imbalance ratio as a factor to elevate the representation of minority classes. Besides, we also integrate relevant term factors to boost the general accuracy. Experiments conducted with three different data sets, one of which is collected for this study, show that the original abstract feature extraction method indeed suffers from the class imbalance problem and the proposed methods demonstrate significant improvements in terms of f1-micro, f1-macro, and Matthew’s correlation coefficient. The experimental results also suggest that the proposed method is a competitive classifier and term weighting scheme when compared to the well-known classifiers (KNN, SVM, and Nearest Centroid) and term weighting schemes (TF-IDF, TF-ICF, TF-ICSDF, TF-RF, TF-PROB, TF-IGM, and TF-MONO).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Reuters-21578 retrieved from Python Natural Language Toolkit

  2. 20_newsgroups retrieved from http://qwone.com/~jason/20Newsgroups/

  3. https://www.nltk.org/

  4. http://uk.retuers.com

  5. https://github.com/mrtokk

  6. scikit-learn: https://scikit-learn.org/

References

  1. Aggarwal C C, Zhai C (2012) A survey of text classification algorithms. In: Mining text data, Springer, Boston, pp 163–222

  2. Alsaeedi A (2020) A survey of term weighting schemes for text classification. Int J Data Min Model Manag 12(2):237

    Google Scholar 

  3. Biricik G, Diri B, Sonmez A C (2012) Abstract feature extraction for text classification. Turk J Electr Eng Comput Sci 20(Sup. 1):1137–1159

    Google Scholar 

  4. Buda M, Maki A, Mazurowski M A (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259

    Article  Google Scholar 

  5. Chen K, Zhang Z, Long J, Zhang H (2016) Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst Appl 66:245–260

    Article  Google Scholar 

  6. Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. In: Text mining and its applications, Springer, Berlin, pp 81–97

  7. Deisy C, Gowri M, Baskar S, Kalaiarasi S M A, Ramraj N (2010) A novel term weighting scheme midf for text categorization. J Eng Sci Technol 5(1):94–107

    Google Scholar 

  8. Dogan T, Uysal AK (2019a) Improved inverse gravity moment term weighting for text classification. Expert Syst Appl 130:45–59

  9. Dogan T, Uysal AK (2019b) On term frequency factor in supervised term weighting schemes for text classification. Arab J Sci Eng 44(11):9545–9560

  10. Dogan T, Uysal AK (2020) A novel term weighting scheme for text classification: Tf-mono. J Inf 14(4):101076. https://doi.org/10.1016/j.joi.2020.101076, https://www.sciencedirect.com/science/article/pii/S1751157720300705https://www.sciencedi https://www.sciencedirect.com/science/article/pii/S1751157720300705rect.com/science/article/pii/S1751157720300705

  11. Domo (2021) Domo resource - data never sleeps 8.0. https://www.domo.com/learn/data-never-sleeps-8

  12. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874

  13. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl 73:220–239

    Article  Google Scholar 

  14. Han J, Kamber M, Pei J (2012) Data mining: Concepts and techniques. Morgan Kaufmann, Oxford

  15. He H, Garcia E A (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21 (9):1263–1284

    Article  Google Scholar 

  16. Kowsari MJ, Heidarysafa M, Barnes B (2019) Text classification algorithms: a survey. Inf (Basel) 10(4):150

    Google Scholar 

  17. Lan M, Tan C L, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735

    Article  Google Scholar 

  18. Li Y, Sun G, Zhu Y (2010) Data imbalance problem in text classification. In: 2010 Third International Symposium on Information Processing. IEEE

  19. Liu Y, Loh H T, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701

    Article  Google Scholar 

  20. Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, García-Borroto M (2016) Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175:935–947

  21. Matthews B W (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405(2):442–451

    Article  Google Scholar 

  22. Ortigosa-Hernández J, Inza I, Lozano J A (2017) Measuring the class-imbalance extent of multi-class problems. Pattern Recognit Lett 98:32–38

    Article  Google Scholar 

  23. Porter M F (2006) An algorithm for suffix stripping. Program 40(3):211–218

    Article  Google Scholar 

  24. Ren F, Sohrab MG (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125. https://doi.org/10.1016/j.ins.2013.02.029, https://www.sciencedirect.com/science/article/pii/S0020025513001461

  25. Sabbah T, Selamat A, Selamat M H, Al-Anzi F S, Viedma E H, Krejcar O, Fujita H (2017) Modified frequency-based term weighting schemes for text classification. Appl Soft Comput 58:193–206

    Article  Google Scholar 

  26. Sun Y, Kamel M S, Wong A K C, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378

    Article  Google Scholar 

  27. Tan P N, Steinbach M, Kumar V (2005) Introduction to data mining. Pearson, Upper Saddle River

  28. Tokunaga T, Iwayama M (1994) Text categorization based on weighted inverse document frequency. In: Special Interest Groups and Information Process Society of Japan, pp 33–39

  29. Wang D, Zhang H, Wu W, Lin M (2010) Inverse category frequency based supervised term weighting scheme for text categorization. CoRR arXiv:1012.2609

  30. Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor 6(1):80–89

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Murat Okkalioglu.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Okkalioglu, M., Okkalioglu, B.D. AFE-MERT: imbalanced text classification with abstract feature extraction. Appl Intell 52, 10352–10368 (2022). https://doi.org/10.1007/s10489-021-02983-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02983-2

Keywords

Navigation