Abstract
The Evolving tree (ETree) is a hierarchical clustering and visualization model that allows the number of clusters to grow and evolve with new data samples in an online learning manner. While many hierarchical clustering models are available in the literature, ETree stands out because of its visualization capability. It is an enhancement of the Self-Organizing Map, a famous and useful clustering and visualization model. ETree organises the trained data samples in the form of a tree structure for better presentation and visualization especially for high-dimensional data samples. Even though ETree has been used in a number of applications, its use in textual document clustering and visualization is limited. In this paper, ETree is modified and deployed as a useful model for undertaking textual documents clustering and visualization problems. We introduce a new local re-learning procedure that allows the tree structure to grow and adapt to new features, i.e., new words from new textual documents. The performance of the proposed ETree model is evaluated with two (one benchmark and one real) document data sets. A number of key aspects of the proposed ETree model, which include its topology representation, learning time, as well as recall and precision rates, are evaluated. The results show that the proposed local re-learning procedure is useful for handling increasing number of features incrementally. In summary, this study contributes towards a modified ETree model and its use in a new domain, i.e., textual document clustering and visualization.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Rui X, Wunsch DC (2009) Clustering. Wiley, IEEE Press
Kohonen T (2001) Self-organizing maps, 3rd edn. Springer, Berlin
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Rauber A, Merkl D, Dittenbachm M (2002) The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data. IEEE Trans Neural Netw 13(6):1331–1341
Carpenter GA, Grossberg S, Rosen DB (1991) ART 2-A: an adaptive resonance algorithm for rapid category learning and recognition. Neural Netw 4:493–504
Carpenter GA, Grossberg S, Markuzon N, Reynolds JH, Rosen DB (1992) Fuzzy ARTMAP: a neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Trans Neural Netw 3(5):698–713
Pal NR, Pal K, Keller JM, Bezdek JC (2005) A possibilistic fuzzy c-means clustering algorithm. IEEE Trans Fuzzy Syst 13(4):517–530
Kanungo T, Mount DM, Nethanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892
Xu C, Tao D, Xu C (2015) Multi-view self-paced learning for clustering. In: Proceedings of 24th international conference on artificial intelligence, pp 3974–3980
Arora R, Gupta MR, Kapila A, Fazel M (2013) Similarity-based clustering by left-stochastic matrix factorization. Mach Learn Res 14(1):1715–1746
Hsu CC, Lin SH, Tai WS (2011) Apply extended self-organizing map to cluster and classify mixed-type data. Neurocomputing 74(18):3832–3842
Tai WS, Hsu CC, Chen JC (2010) A mixed-type self-organizing map with a dynamic structure. In: International conference on neural networks, pp 1–8
Matharage S, Alahakoon D, Rajapakse J, Huang P (2011) Fast growing self-organizing map for text clustering. In: Lecturer notes computer science, neural information processing, 7063, pp 406–415
Kuo RJ, Wang CF, Chen ZY (2012) Integration of growing self-organizing and continuous genetic algorithm for grading lithium-ion battery cells. Appl Soft Comput 8(12):2012–2022
Huang SY, Tsaih RH (2012) The prediction approach with growing hierarchical self-organizing map. In: International conference on neural networks, pp 1–7
Hosseini HS (2011) Binary tree time adaptive self-organizing map. Neurocomputing 74(11):1823–1839
Allahyar A, Yazdi HS, Harati A (2015) Constrained semi-supervised growing self-organizing map. Neurocomputing 147:456–471
Pakkanen J, Iivarinen J, Oja E (2006) The evolving tree-analysis and applications. IEEE Trans Neural Netw 17(3):591–603
Pakkanen J, Iivarinen J, Oja E (2004) The evolving tree: a novel self-organizing network for data analysis. Neural Process Lett 20(33):199–211
Fabrizio S (2005) Text cetegorization. In: Alessandro Z (ed) Text mining and its applications. WIT Press, Southampton, pp 109–129
Fabrizio S (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Lagus K, Kaski S, Kohonen T (2004) Mining massive document collections by the WEBSOM method. Inf Sci 163(1):135–156
Kaski S, Honkela T, Lagus K, Kohonen T (1998) WEBSOM: self-organizing maps of document collections. Neurocomputing 21(1):101–117
Lewis DD (1998) Naïve Bayes at forty: the independence as assumption in information retrieval. Lect Notes Comp Sci 1398:4–15
Hotho A, Maedche A, Staab S (2002) Ontology-based text document clustering. KI 16(4):48–54
Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Burlington
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of 7th international conference on knowledge discovery data mining, pp 269–274
Liu Y, Loh HT, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Syst Appl 36(1):690–701
Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1–3):37–52
Ye J, Li Q (2004) LDA/QR: an efficient and effective dimension reduction algorithm and its theoretical foundation. Pattern Recognit 37(4):851–854
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6):1373–1396
Yu J, Tao D, Wang M (2012) Adaptive hypergraph learning and its application in image classification. IEEE Trans Image Process 21(7):3262–3272
Yu J, Hong R, Wang M, You J (2014) Image clustering based on sparse patch alignment framework. Pattern Recognit 47(11):3512–3519
Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099
Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032
Tao D, Li X, Wu X, Maybank SJ (2007) General tensor discriminant analysis and gabor features for gait recognition. IEEE Trans Pattern Anal Mach Intell 29(10):1700–1715
Luo Y, Tao D, Ramamohanarao K, Xu C, Wen Y (2015) Tensor canonical correlation analysis for multi-view dimension reduction. IEEE Trans Knowl Data Eng 27(11):3111–3124
Luo Y, Tang J, Yan J, Xu C, Chen Z (2014) Pre-trained multi-view word embedding using two-side neural network. In: Proceedings of 28th AAAI conference, pp 1982–1988
Moore BC (1981) Principle component analysis in linear systems: controllability, observability, and model reduction. IEEE Trans Automat Control 26(1):17–32
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of 7th international conference on knowledge discovery data mining, pp 245–250
Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18(5):401–409
Kohonen T, Kaski S, Lagus K, Salojarvi J, Honkela J, Paatero V, Saarela A (2000) Self organization of a massive document collection. IEEE Trans Neural Netw 11(3):574–586
Bourgeois N, Cottrell M, Deruelle B, Lamasse S, Letremy P (2015) How to improve robustness in Kohonen maps and display additional information in factorial analysis: application to text mining. Neurocomputing 147:120–135
Liu Y, Wang X, Wu C (2008) ConSOM: a conceptional self-organizing map model for text clustering. Neurocomputing 71(4):857–862
Lughofer E (2011) Evolving fuzzy systems-methodologies, advanced concepts and applications, 1st edn. Springer, Berlin
Kim HJ, Kim JU, Ra YG (2005) Boosting Naïve Bayes text classification using uncertainty-based selective sampling. Neurocomputing 67(4):403–410
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Bezdek JC, Keller J, Krisnapuram R, Pal NR (1999) Fuzzy models and algorithms for pattern recognition and image processing. Kluwer, Dordrecht
Chang WL, Tay KM, Lim CP (2014) A new evolving tree for text document clustering and visualization. In: Soft computing in industrial applications, Springer, pp 141–151
Chang WL, Tay KM, Lim CP (2013) Enhancing an evolving tree-based text document visualization model with fuzzy \(c\)-means clustering. In: IEEE international conference fuzzy, pp 1–6
The Reuters-21578, Distribution 1.0 test collection is available from http://www.daviddlewis.com/resources/testcollections/reuters21578
Porter MF (1980) An algorithm for suffix stripping. Program Electron Lib 14(3):130–137
The Default English Stop-words List is available from http://www.ranks.nl/resources/stopwords.html
Debole F, Sebastiani F (2005) An analysis of the relative hardness of Rueters-21578 subsets. J Am Soc Inf Sci Technol 56(6):584–586
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 42–49
King A (2012) Online k-means clustering of nonstationary data. Prediction Project Report
Lin YS, Jiang JY, Lee SJ (2014) A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng 26(7):1575–1590
Nagwani NK (2015) A comment on “a similarity measure for text classification and clustering”. IEEE Trans Knowl Data Eng 27(9):2589–2590
Acknowledgements
To 2nd Regional Engineering Conference 2008 (EnCon 2008), and the organizing committee. Special thanks to Miss Liew Hui Chang who had helped during information collections and compilations. The authors had the permission to use the collection of abstracts from EnCon 2008, in which the authors would like to express gratitude for.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chang, W.L., Tay, K.M. & Lim, C.P. A New Evolving Tree-Based Model with Local Re-learning for Document Clustering and Visualization. Neural Process Lett 46, 379–409 (2017). https://doi.org/10.1007/s11063-017-9597-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-017-9597-3