Abstract
Document clustering is a well established technique used to segregate voluminous text corpora into distinct categories. In this paper we present an improved algorithm for clustering large text corpus. The proposed algorithm tries to overcome the challenges of clustering large corpora, while maintaining high ”goodness” values for the proposed clusters. The algorithm proceeds by optimizing a fitness function using Differential Evolution to form the initial clusters. The clusters obtained after the initial phase are then “refined” by re-evaluating the points that fall at the fringes of the clusters and reassigning them to other clusters, if necessary. Two different approaches e.g. Nearest Cluster Based Re-evaluation (N-CBR) and Multiple Cluster Based Re-evaluation (M-CBR) have been proposed to select candidates during the reassignment phase and their performances have been evaluated. The result of such a post processing phase has been demonstrated on a number of standard benchmark text corpora and the algorithm is found to be quite accurate and efficient. The results obtained by the proposed method have also been compared to other evolutionary strategies e.g. Genetic Algorithm(GA), Particle Swarm Optimization(PSO), Harmony Search(HS), and have been found to be quite satisfactory.
Similar content being viewed by others
References
Abbasi AA, Younis M (2007) A survey on clustering algorithms for wireless sensor networks. Comput Commun 30(14):2826–2841. https://doi.org/10.1109/NBiS.2010.59
Abraham A, Das S, Konar A (2006) Document clustering using differential evolution. In: IEEE congress on evolutionary computation, 2006, (CEC 2006). IEEE, pp 1784–1791, DOI https://doi.org/10.1109/CEC.2006.1688523, (to appear in print)
Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36. https://doi.org/10.1016/j.eswa.2017.05.002
Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Berlin, Springer, pp 1–165
Arellano-Verdejo J, Alba E, Godoy-Calderon S (2016) Efficiently finding the optimum number of clusters in a dataset with a new hybrid differential evolution algorithm. Dela Soft Comput 20(3):895–905
Chien YC, Lui MC, Wu TT (2014) Discussion-record-based prediction model for creativity education using clustering methods. In: Thinking skills and creativity, vol 36. Elsevier, p 100650
Chu TZ, Cheng L, Hau SW (2018) Corpus-based topic diffusion for short text clustering. Neurocomputing 275:2444–2458
Cobos C, Muñoz-Collazos H, Urbano-Muñoz R, Mendoza M, León E, Herrera-Viedma E (2014) Clustering of web search results based on the cuckoo search algorithm and balanced bayesian information criterion. Inf Sci 281:248–264
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27
Cutting DR, Karger DR, Pedersen JO, Tukey JW (2017) Scatter/gather: a cluster-based approach to browsing large document collections. In: ACM SIGIR Forum, ACM, vol 51, pp 148–159
Deng C, Liang CY, Zhao B, Yang Y, Deng AY (2011) Structure-encoding differential evolution for integer programming. JSW 6(1):140–147
Dong J, Wang F, Yuan B (2013) Accelerating birch for clustering large scale streaming data using cuda dynamic parallelism. In: International conference on intelligent data engineering and automated learning. Springer, pp 409–416
Dong L, Wang L, Khahro SF, Gao S, Liao X (2016) Wind power day-ahead prediction with cluster analysis of NWP. Renew Sust Energ Rev 60:1206–1212
Du R, Kuang D, Drake B, Park H (2017) DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling. J Glob Optim, 1–22
Feoktistov V (2006) Differential evolution, in search of solutions. Springer, Berlin
Forsati R, Mahdavi M, Shamsfard M, Meybodi MR (2013) Efficient stochastic algorithms for document clustering. Inf Sci 220:269–291
Gawad C, Koh W, Quake SR (2016) Single-cell genome sequencing: current state of the science. Nat Rev Genet 17(3):175
Guo G, Wang H, Bell D, Bi Y, Greer K (2006) Using knn model for automatic text categorization. Soft Comput 10(5):423–430
Han J, Micheline K (2007) Data mining concepts and techniques. Morgan Kaufmann, Burlington
Handl J, Meyer B (2007) Ant-based and swarm-based clustering. Swarm Intell 1(2):95–113
Hatamlou A (2013) Black hole: a new heuristic optimization approach for data clustering. Information sciences 222:175–184
He Z, Yu C (2019) Clustering stability-based evolutionary k-means. Soft Comput 23(1):305–321
Huang S, Xu Z, Lv J (2018) Adaptive local structure learning for document co-clustering. Knowl-Based Syst 148:74–84
Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008). Christchurch, New Zealand, pp 49–56
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666
Jensi R, Jiji DGW (2014) A survey on optimization approaches to text document clustering. arXiv:14012229
Jun S, Park SS, Jang DS (2014) Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Syst Appl 41(7):3204–3212
Kamel N, Ouchen I, Baali K (2014) A sampling PSO-k-means algorithm for document clustering. In: Genetic and evolutionary computing. Springer, pp 45–54
Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of intelligent optimization in biology and medicine. Springer, pp 267–287
Kaur SP, Madan N (2016) Document clustering using firefly algorithm. Artif Intell Syst Machine Learn 8(5):182–185
Kinariwala S, Kulkarni BM (2015) Text summarization using fuzzy relational clustering algorithm. Int J Scientif Res Educ, 4370–4378
Li X, He T, Ran H, Lu X (2016) A novel graph partitioning criterion based short text clustering method. In: International conference on intelligent computing. Springer, pp 338–348
Lulli A, Debatty T, Dell’Amico M, Michiardi P, Ricci L (2015) Scalable k-nn based text clustering. In: Big data (big data) 2015 IEEE International Conference on. IEEE, pp 958-963
Maulik U, Saha I (2010) Automatic fuzzy clustering using modified differential evolution for image classification. IEEE transactions on Geoscience and Remote Sensing 48(9):3503–3510
Moftah HM, Azar AT, Al-Shammari ET, Ghali NI, Hassanien AE, Shoman M (2014) Adaptive k-means clustering algorithm for MR breast image segmentation. Neural Comput Applic 24(7-8):1917–1928
Mukherjee H, Obaidullah SM, Santosh KC, Phadikar S, Roy K (2020) A lazy learning-based language identification from speech using MFCC-2 features. Int J Machine Learn Cybern 11(1):1–14
Mustafi D, Sahoo G (2018) A hybrid approach using genetic algorithm and the differential evolution heuristic for enhanced initialization of the k-means algorithm with applications in text clustering. Soft Comput, 1–18
Nie L, Zhao Y, Mohammad A, Shen J, Chua TS (2014) Bridging the vocabulary gap between health seekers and healthcare knowledge. In: IEEE Transactions on Knowledge and Data Engineering (TKDE), vol 27, pp 1041–4347
Patibandla RS, Veeranjaneyulu N (2018) Performance analysis of partition and evolutionary clustering methods on various cluster validation criteria. Arab J Sci Eng 43(8):4379–90
Peng T, Liu L (2015) A novel incremental conceptual hierarchical text clustering method using cfu-tree. Appl Soft Comput 27:269–278
Pompili F, Gillis N, Absil PA, Glineur F (2014) Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. Neurocomputing 141:15–25
Rüger SM, Gauch SE, et al. (2000) Feature reduction for document clustering and classification. Department of Computing. Imperial College of Science, Technology and Medicine
Saini N, Saha S, Bhattacharyya P (2019) Automatic scientific document clustering using self-organized multi-objective differential evolution. Cognit Comput 11(2):271–293
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Selosse M, Jacques J, Biernacki C (2020) Textual data summarization using the Self-Organized Co-Clustering model. Pattern Recogn 103:107315
Shanmugam Devi A, Siamala S, Dhivya Prabha E (2015) A proficient method for text clustering using harmony search method. Int J Sci Res Sci Eng Technol
Sherar M, Zulkernine F (2017) Particle swarm optimization for large-scale clustering on apache spark. In: IEEE symposium series on computational intelligence (SSCI), pp 1–8
Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200
Steinbach M, Karypis G, Kumar V et al (2000) A comparison of document clustering techniques. In: KDD Workshop on text mining, boston, vol 400, pp 525–526
Storn R, Price K (1997) Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J Global Optim 11(4):341–359
Verma P, Verma A, Pal S (2022) An approach for extractive text summarization using fuzzy evolutionary and clustering algorithms. Appl Soft Comput 8:108670
Willett P (2006) The porter stemming algorithm: then and now. Program 40(3):219–223
Xu Q, He D, Zhang N, Kang C, Xia Q, Bai J, Huang J (2015) A short-term wind power forecasting approach with adjustment of numerical weather prediction input by data mining. IEEE Transactions on Sustainable Energy 6(4):1283–1291
Yan Y, Chen L, Tjhi WC (2013) Fuzzy semi-supervised co-clustering for text documents. Fuzzy Sets Syst 215:74–89
Zaki MJ, Meira W Jr, Meira W (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, Cambridge
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors hereby declare that they have no conflict of interest. This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mustafi, D., Mustafi, A. A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe points. Multimed Tools Appl 82, 32177–32201 (2023). https://doi.org/10.1007/s11042-023-14716-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14716-3