[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

The impact of preprocessing on text classification

Published: 25 November 2019 Publication History

Abstract

Preprocessing is one of the key components in a typical text classification framework. This paper aims to extensively examine the impact of preprocessing on text classification in terms of various aspects such as classification accuracy, text domain, text language, and dimension reduction. For this purpose, all possible combinations of widely used preprocessing tasks are comparatively evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. In this way, contribution of the preprocessing tasks to classification success at various feature dimensions, possible interactions among these tasks, and also dependency of these tasks to the respective languages and domains are comprehensively assessed. Experimental analysis on benchmark datasets reveals that choosing appropriate combinations of preprocessing tasks, rather than enabling or disabling them all, may provide significant improvement on classification accuracy depending on the domain and language studied on.

References

[1]
Asuncion, A., & Newman, D. J. (2007). UCI machine learning repository. In C. A. Irvine (Ed.), University of California, Department of Information and Computer Science.
[2]
Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology. v59. 407-421.
[3]
Using chi-square statistics to measure similarities for text categorization. Expert Systems with Applications. v38. 3085-3090.
[4]
Author gender identification from text. Digital Investigation. v8. 78-88.
[5]
Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science and Technology. v60. 2347-2352.
[6]
Turkish anti-spam filtering using binary and probabilistic models. AWERProcedia Information Technology and Computer Science. v1. 1007-1012.
[7]
A Bayesian feature selection paradigm for text classification. Information Processing & Management. v48. 283-302.
[8]
An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research. v3. 1289-1305.
[9]
Automated text classification using a dynamic artificial neural network model. Expert Systems with Applications. v39. 10967-10976.
[10]
Gonçalves, C. A., Gonçalves, C. T., Camacho, R., & Oliveira, E. C. (2010). The impact of pre-processing on the classification of MEDLINE documents. In Proceedings of the 10th international workshop on pattern recognition in information systems (pp. 53-61).
[11]
Subspace based feature selection for pattern recognition. Information Sciences. v178. 3716-3726.
[12]
On feature extraction for spam e-mail detection. Lecture Notes in Computer Science. v4105. 635-642.
[13]
A probabilistic analysis of the Rocchio algorithm with tfidf for text categorization. In: 14th international conference on machine learning, Morgan Kaufmann Publishers Inc. pp. 143-151.
[14]
A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recognition Letters. v31. 1437-1444.
[15]
Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing & Management. v42. 155-165.
[16]
Feature selection with dynamic mutual information. Pattern Recognition. v42. 1330-1339.
[17]
A lexicon model for deep sentiment analysis and opinion mining applications. Decision Support Systems. v53. 680-688.
[18]
. Proceedings of the 11th spanish association conference on current topics in artificial intelligence, 2006.Santiago de Compostela, Spain, Springer-Verlag.
[19]
Metsis, V., Androutsopoulos, I., & Paliouras, G. (2006). Spam filtering with naive Bayes - Which naive Bayes? In 3rd conference on email and anti-spam (Vol. 17, pp. 28-69).
[20]
A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Systems with Applications. v38. 3407-3415.
[21]
The Influence of preprocessing parameters on text categorization. International Journal of Applied Science, Engineering and Technology. v4. 430-434.
[22]
An algorithm for suffix stripping. Program. v14. 130-137.
[23]
A vector space model for automatic indexing. Communications of the ACM. v18. 613-620.
[24]
A novel feature selection algorithm for text categorization. Expert Systems with Applications. v33. 1-5.
[25]
A comparative study on text representation schemes in text categorization. Pattern Analysis and Applications. v8. 199-209.
[26]
Adapting centroid classifier for document categorization. Expert Systems with Applications. v38. 10264-10273.
[27]
Pattern recognition. 4th ed. Academic Press.
[28]
Toman, M., Tesar, R., & Jezek, K. (2006). Influence of word normalization on text classification. In Proceedings of the 1st international conference on multidisciplinary information sciences & technologies (Vol. 2, pp. 354-358). Merida, Spain.
[29]
Toraman, C., Can, F., & Kocberber, S. (2011). Developing a text categorization template for Turkish news portals. In International symposium on innovations in intelligent systems and applications (INISTA) (pp. 379-383).
[30]
Torunoglu, D., Cakirman, E., Ganiz, M. C., Akyokus, S., & Gurbuz, M. Z. (2011). Analysis of preprocessing methods on classification of Turkish texts. In International Symposium on Innovations in Intelligent Systems and Applications (INISTA) (pp. 112-117).
[31]
Uysal, A. K., Gunal, S., Ergin, S., & Gunal, E. S. (2012). A novel framework for sms spam filtering. In Proceedings of the IEEE international symposium on innovations in intelligent systems and applications. Trabzon, Turkiye.
[32]
A novel probabilistic feature selection method for text classification. Knowledge-Based Systems. v36. 226-235.
[33]
A comparative study on feature selection in text categorization. In: 14th international conference on machine learning, Morgan Kaufmann Publishers Inc. pp. 412-420.
[34]
Zemberek. <http://code.google.com/p/zemberek/> (Accessed January 2013).

Cited By

View all
  • (2025)An empirical study of business process models and model clones on GitHubEmpirical Software Engineering10.1007/s10664-024-10584-z30:2Online publication date: 1-Mar-2025
  • (2024)Towards Extending XAI for Full Data Science PipelinesProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665967(1-7)Online publication date: 14-Jun-2024
  • (2024)Drug–Drug Interaction Relation Extraction Based on Deep Learning: A ReviewACM Computing Surveys10.1145/364508956:6(1-33)Online publication date: 7-Feb-2024
  • Show More Cited By
  1. The impact of preprocessing on text classification

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Information Processing and Management: an International Journal
    Information Processing and Management: an International Journal  Volume 50, Issue 1
    January, 2014
    234 pages

    Publisher

    Pergamon Press, Inc.

    United States

    Publication History

    Published: 25 November 2019

    Author Tags

    1. Pattern recognition
    2. Text categorization
    3. Text classification
    4. Text preprocessing

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 15 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)An empirical study of business process models and model clones on GitHubEmpirical Software Engineering10.1007/s10664-024-10584-z30:2Online publication date: 1-Mar-2025
    • (2024)Towards Extending XAI for Full Data Science PipelinesProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665967(1-7)Online publication date: 14-Jun-2024
    • (2024)Drug–Drug Interaction Relation Extraction Based on Deep Learning: A ReviewACM Computing Surveys10.1145/364508956:6(1-33)Online publication date: 7-Feb-2024
    • (2024)DARD: Deceptive Approaches for Robust Defense Against IP TheftIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.340243319(5591-5606)Online publication date: 17-May-2024
    • (2024)The Role of Preprocessing for Word Representation Learning in Affective TasksIEEE Transactions on Affective Computing10.1109/TAFFC.2023.327011515:1(254-272)Online publication date: 1-Jan-2024
    • (2024)A novel redistribution-based feature selection for text classificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.123119246:COnline publication date: 15-Jul-2024
    • (2024)A comprehensive review of cyberbullying-related content classification in online social mediaExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122644244:COnline publication date: 15-Jun-2024
    • (2024)Preprocessing-Based Approach for Prompt Intrusion Detection in SDN NetworksJournal of Network and Systems Management10.1007/s10922-024-09841-932:4Online publication date: 16-Aug-2024
    • (2024)Estimating vulnerability metrics with word embedding and multiclass classification methodsInternational Journal of Information Security10.1007/s10207-023-00734-723:1(247-270)Online publication date: 1-Feb-2024
    • (2024)Shedding Light on Greenwashing: Explainable Machine Learning for Green Ad DetectionAI 2024: Advances in Artificial Intelligence10.1007/978-981-96-0348-0_14(186-197)Online publication date: 25-Nov-2024
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media