A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe points

160 Accesses
2 Citations
Explore all metrics

Abstract

Document clustering is a well established technique used to segregate voluminous text corpora into distinct categories. In this paper we present an improved algorithm for clustering large text corpus. The proposed algorithm tries to overcome the challenges of clustering large corpora, while maintaining high ”goodness” values for the proposed clusters. The algorithm proceeds by optimizing a fitness function using Differential Evolution to form the initial clusters. The clusters obtained after the initial phase are then “refined” by re-evaluating the points that fall at the fringes of the clusters and reassigning them to other clusters, if necessary. Two different approaches e.g. Nearest Cluster Based Re-evaluation (N-CBR) and Multiple Cluster Based Re-evaluation (M-CBR) have been proposed to select candidates during the reassignment phase and their performances have been evaluated. The result of such a post processing phase has been demonstrated on a number of standard benchmark text corpora and the algorithm is found to be quite accurate and efficient. The results obtained by the proposed method have also been compared to other evolutionary strategies e.g. Genetic Algorithm(GA), Particle Swarm Optimization(PSO), Harmony Search(HS), and have been found to be quite satisfactory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

A hybrid approach using genetic algorithm and the differential evolution heuristic for enhanced initialization of the k-means algorithm with applications in text clustering

Article 07 June 2018

Multi-objective memetic differential evolution optimization algorithm for text clustering problems

Article 03 October 2022

A Multi Criteria Document Clustering Approach Using Genetic Algorithm

Notes

References

Abbasi AA, Younis M (2007) A survey on clustering algorithms for wireless sensor networks. Comput Commun 30(14):2826–2841. https://doi.org/10.1109/NBiS.2010.59
Article Google Scholar
Abraham A, Das S, Konar A (2006) Document clustering using differential evolution. In: IEEE congress on evolutionary computation, 2006, (CEC 2006). IEEE, pp 1784–1791, DOI https://doi.org/10.1109/CEC.2006.1688523, (to appear in print)
Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36. https://doi.org/10.1016/j.eswa.2017.05.002
Article Google Scholar
Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Berlin, Springer, pp 1–165
Google Scholar
Arellano-Verdejo J, Alba E, Godoy-Calderon S (2016) Efficiently finding the optimum number of clusters in a dataset with a new hybrid differential evolution algorithm. Dela Soft Comput 20(3):895–905
Article Google Scholar
Chien YC, Lui MC, Wu TT (2014) Discussion-record-based prediction model for creativity education using clustering methods. In: Thinking skills and creativity, vol 36. Elsevier, p 100650
Chu TZ, Cheng L, Hau SW (2018) Corpus-based topic diffusion for short text clustering. Neurocomputing 275:2444–2458
Article Google Scholar
Cobos C, Muñoz-Collazos H, Urbano-Muñoz R, Mendoza M, León E, Herrera-Viedma E (2014) Clustering of web search results based on the cuckoo search algorithm and balanced bayesian information criterion. Inf Sci 281:248–264
Article Google Scholar
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27
Article MATH Google Scholar
Cutting DR, Karger DR, Pedersen JO, Tukey JW (2017) Scatter/gather: a cluster-based approach to browsing large document collections. In: ACM SIGIR Forum, ACM, vol 51, pp 148–159
Deng C, Liang CY, Zhao B, Yang Y, Deng AY (2011) Structure-encoding differential evolution for integer programming. JSW 6(1):140–147
Article Google Scholar
Dong J, Wang F, Yuan B (2013) Accelerating birch for clustering large scale streaming data using cuda dynamic parallelism. In: International conference on intelligent data engineering and automated learning. Springer, pp 409–416
Dong L, Wang L, Khahro SF, Gao S, Liao X (2016) Wind power day-ahead prediction with cluster analysis of NWP. Renew Sust Energ Rev 60:1206–1212
Article Google Scholar
Du R, Kuang D, Drake B, Park H (2017) DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling. J Glob Optim, 1–22
Feoktistov V (2006) Differential evolution, in search of solutions. Springer, Berlin
MATH Google Scholar
Forsati R, Mahdavi M, Shamsfard M, Meybodi MR (2013) Efficient stochastic algorithms for document clustering. Inf Sci 220:269–291
Article MathSciNet Google Scholar
Gawad C, Koh W, Quake SR (2016) Single-cell genome sequencing: current state of the science. Nat Rev Genet 17(3):175
Article Google Scholar
Guo G, Wang H, Bell D, Bi Y, Greer K (2006) Using knn model for automatic text categorization. Soft Comput 10(5):423–430
Article Google Scholar
Han J, Micheline K (2007) Data mining concepts and techniques. Morgan Kaufmann, Burlington
MATH Google Scholar
Handl J, Meyer B (2007) Ant-based and swarm-based clustering. Swarm Intell 1(2):95–113
Article Google Scholar
Hatamlou A (2013) Black hole: a new heuristic optimization approach for data clustering. Information sciences 222:175–184
Article MathSciNet Google Scholar
He Z, Yu C (2019) Clustering stability-based evolutionary k-means. Soft Comput 23(1):305–321
Article MATH Google Scholar
Huang S, Xu Z, Lv J (2018) Adaptive local structure learning for document co-clustering. Knowl-Based Syst 148:74–84
Article Google Scholar
Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008). Christchurch, New Zealand, pp 49–56
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666
Article Google Scholar
Jensi R, Jiji DGW (2014) A survey on optimization approaches to text document clustering. arXiv:14012229
Jun S, Park SS, Jang DS (2014) Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Syst Appl 41(7):3204–3212
Article Google Scholar
Kamel N, Ouchen I, Baali K (2014) A sampling PSO-k-means algorithm for document clustering. In: Genetic and evolutionary computing. Springer, pp 45–54
Karaa WBA, Ashour AS, Sassi DB, Roy P, Kausar N, Dey N (2016) Medline text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of intelligent optimization in biology and medicine. Springer, pp 267–287
Kaur SP, Madan N (2016) Document clustering using firefly algorithm. Artif Intell Syst Machine Learn 8(5):182–185
Google Scholar
Kinariwala S, Kulkarni BM (2015) Text summarization using fuzzy relational clustering algorithm. Int J Scientif Res Educ, 4370–4378
Li X, He T, Ran H, Lu X (2016) A novel graph partitioning criterion based short text clustering method. In: International conference on intelligent computing. Springer, pp 338–348
Lulli A, Debatty T, Dell’Amico M, Michiardi P, Ricci L (2015) Scalable k-nn based text clustering. In: Big data (big data) 2015 IEEE International Conference on. IEEE, pp 958-963
Maulik U, Saha I (2010) Automatic fuzzy clustering using modified differential evolution for image classification. IEEE transactions on Geoscience and Remote Sensing 48(9):3503–3510
Article Google Scholar
Moftah HM, Azar AT, Al-Shammari ET, Ghali NI, Hassanien AE, Shoman M (2014) Adaptive k-means clustering algorithm for MR breast image segmentation. Neural Comput Applic 24(7-8):1917–1928
Article Google Scholar
Mukherjee H, Obaidullah SM, Santosh KC, Phadikar S, Roy K (2020) A lazy learning-based language identification from speech using MFCC-2 features. Int J Machine Learn Cybern 11(1):1–14
Article Google Scholar
Mustafi D, Sahoo G (2018) A hybrid approach using genetic algorithm and the differential evolution heuristic for enhanced initialization of the k-means algorithm with applications in text clustering. Soft Comput, 1–18
Nie L, Zhao Y, Mohammad A, Shen J, Chua TS (2014) Bridging the vocabulary gap between health seekers and healthcare knowledge. In: IEEE Transactions on Knowledge and Data Engineering (TKDE), vol 27, pp 1041–4347
Patibandla RS, Veeranjaneyulu N (2018) Performance analysis of partition and evolutionary clustering methods on various cluster validation criteria. Arab J Sci Eng 43(8):4379–90
Article Google Scholar
Peng T, Liu L (2015) A novel incremental conceptual hierarchical text clustering method using cfu-tree. Appl Soft Comput 27:269–278
Article Google Scholar
Pompili F, Gillis N, Absil PA, Glineur F (2014) Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. Neurocomputing 141:15–25
Article Google Scholar
Rüger SM, Gauch SE, et al. (2000) Feature reduction for document clustering and classification. Department of Computing. Imperial College of Science, Technology and Medicine
Saini N, Saha S, Bhattacharyya P (2019) Automatic scientific document clustering using self-organized multi-objective differential evolution. Cognit Comput 11(2):271–293
Article Google Scholar
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Selosse M, Jacques J, Biernacki C (2020) Textual data summarization using the Self-Organized Co-Clustering model. Pattern Recogn 103:107315
Article Google Scholar
Shanmugam Devi A, Siamala S, Dhivya Prabha E (2015) A proficient method for text clustering using harmony search method. Int J Sci Res Sci Eng Technol
Sherar M, Zulkernine F (2017) Particle swarm optimization for large-scale clustering on apache spark. In: IEEE symposium series on computational intelligence (SSCI), pp 1–8
Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst Appl 134:192–200
Article Google Scholar
Steinbach M, Karypis G, Kumar V et al (2000) A comparison of document clustering techniques. In: KDD Workshop on text mining, boston, vol 400, pp 525–526
Storn R, Price K (1997) Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J Global Optim 11(4):341–359
Article MathSciNet MATH Google Scholar
Verma P, Verma A, Pal S (2022) An approach for extractive text summarization using fuzzy evolutionary and clustering algorithms. Appl Soft Comput 8:108670
Article Google Scholar
Willett P (2006) The porter stemming algorithm: then and now. Program 40(3):219–223
Article Google Scholar
Xu Q, He D, Zhang N, Kang C, Xia Q, Bai J, Huang J (2015) A short-term wind power forecasting approach with adjustment of numerical weather prediction input by data mining. IEEE Transactions on Sustainable Energy 6(4):1283–1291
Article Google Scholar
Yan Y, Chen L, Tjhi WC (2013) Fuzzy semi-supervised co-clustering for text documents. Fuzzy Sets Syst 215:74–89
Article MathSciNet Google Scholar
Zaki MJ, Meira W Jr, Meira W (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, Cambridge
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of CSE, Birla Institute of Technology, Mesra, India
D. Mustafi & A. Mustafi

Authors

D. Mustafi
View author publications
You can also search for this author in PubMed Google Scholar
A. Mustafi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to D. Mustafi.

Ethics declarations

Conflict of Interests

The authors hereby declare that they have no conflict of interest. This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mustafi, D., Mustafi, A. A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe points. Multimed Tools Appl 82, 32177–32201 (2023). https://doi.org/10.1007/s11042-023-14716-3

Download citation

Received: 25 March 2021
Revised: 24 March 2022
Accepted: 04 February 2023
Published: 02 March 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11042-023-14716-3

A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe points

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A hybrid approach using genetic algorithm and the differential evolution heuristic for enhanced initialization of the k-means algorithm with applications in text clustering

Multi-objective memetic differential evolution optimization algorithm for text clustering problems

A Multi Criteria Document Clustering Approach Using Genetic Algorithm

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe points

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A hybrid approach using genetic algorithm and the differential evolution heuristic for enhanced initialization of the k-means algorithm with applications in text clustering

Multi-objective memetic differential evolution optimization algorithm for text clustering problems

A Multi Criteria Document Clustering Approach Using Genetic Algorithm

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now