Abstract
Severe constraints imposed by the nature of endless sequences of data collected from unstable phenomena have pushed the understanding and the development of automated analysis strategies, such as data clustering techniques. However, current clustering validation approaches are inadequate to data streams due to they do not properly evaluate representation of behavior changes. This paper proposes a novel function to continuously evaluate data stream clustering inspired in Lyapunov energy functions used by techniques such as the Hopfield artificial neural network and the Bidirectional Associative Memory (Bam). The proposed function considers three terms: i) the intra-cluster distance, which allows to evaluate cluster compactness; ii) the inter-cluster distance, which reflects cluster separability; and iii) entropy estimation of the clustering model, which permits the evaluation of the level of uncertainty in data streams. A first set of experiments illustrate the proposed function applied to scenarios of continuous evaluation of data stream clustering. Further experiments were conducted to compare this new function to well-established clustering indices and results confirm our proposal reflects the same information obtained with external clustering indices.
Similar content being viewed by others
Notes
In our experiments, we set \(\delta = 0.001\).
In our experiments, we set \(\eta = 10^{-12}\).
x \(=\) rnorm(mean \(=\) 0, sd \(=\) 0.1, n \(=\) 10000) of the R tool (R Development Core Team 2011).
This is performed by using the R command \(x = sin(seq(1,50,\mathrm{length}=10{,}000)) * seq(1{,}1{,}000,\mathrm{length}=10{,}000)\).
References
Ackley D, Hinton G, Sejnowski T (1985) A learning algorithm for Boltzmann machines. Cogn Sci 9(1): 147–169
Aggarwal C, Han J, Wang J, Yu P (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, Berlin, Germany, pp 81–92
Albertini M, Mello R (2007) A self-organizing neural network for detecting novelties. In: Proceedings of the 2007 ACM symposium on Applied computing, New York, USA, pp 462–466. http://doi.acm.org/10.1145/1244002.1244110
Albertini M, Mello R (2008) Nature-inspired informatics for intelligent applications and knowledge discovery: implications in business, science and engineering. IGI Global, chap A self-organizing neural network to approach novelty detection, pp 49–71
Allcock B, Foster I, Nefedova V, Chervenak A, Deelman E, Kesselman C, Lee J, Sim A, Shoshani A, Williams B (2001) High-performance remote access to climate simulation data: a challenge problem for data grid technologies. In: Proceedings of the ACM/IEEE (2001) Conference supercomputing. Denver, USA, pp 1–20
Alligood K, Sauer T, Yorke J (1996) Chaos: an introduction to dynamical systems. Springer, Berlin
Amigó E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retriev 12:461–486
Biehl M, Ghosh A, Hammer B (2006) Learning vector quantization: The dynamics of winner-takes-all algorithms. Neurocomputing 69(7–9):660–670. http://www.sciencedirect.com/science/article/pii/S0925231205003218
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010a) Moa: Massive online analysis. J Mach Learn Res 11:1601–1604
Bifet A, Holmes G, Pfahringer B, Kranen P, Kremer H, Jansen T, Seidl T (2010b) Moa: Massive online analysis, a framework for stream classification and clustering. In: Journal of Machine Learning Research (JMLR) workshop and conference proceedings: workshop on applications of pattern analysis. J Mach Learn Res 11:44–50
Bifet A, Holmes G, Pfahringer B, Read J, Kranen P, Kremer H, Jansen T, Seidl T (2011) Moa: a real-time analytics open source framework. In: Machine learning and knowledge discovery in databases, lecture notes in computer science, vol 6913. Springer, Berlin, pp 617–620
Box GEP, Jenkins GM (1976) Time series analysis, forecasting, and control. Holden-Day, San Francisco
Cao F, Ester M, Qian W, Zhou A (2006) In density-based clustering over an evolving data stream with noise. SDM
Carpenter GA, Grossberg S, Rosen D (1991) Art 2-A: an adaptive resonance algorithm for rapid category learning and recognition. IJCNN-91-Seattle international joint conference on neural networks, 1991 ii, vol. 2, pp 151–156. http://dx.doi.org/10.1109/IJCNN.1991.155329
Freeman JA, Skapura DM (1991) Neural networks: algorithms, applications, and programming techniques. Addison Wesley, Redwood City
Fritzke B (1995) A growing neural gas network learns topologies. In: Advances in neural information processing systems, vol 7. MIT Press, Cambridge, pp 625–632. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4273
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of the (1998) ACM SIGMOD international conference on Management of data. ACM, New York, pp 73–84
Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: Theory and practice. IEEE Trans Knowl Data Eng 15(3):515–528
Hartigan J, Wong M (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C (Applied Statistics) 28(1):100–108
Haykin S (1999) Neural networks: a comprehensive foundation. Prentice Hall, Englewood Cliffs
Hinneburg A, Keim D (1999) Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large data bases, San Francisco, USA, pp 506–517
Holt CC (2004) Forecasting seasonals and trends by exponentially weighted moving averages. Int J Forecast 20(1):5–10
Itti L, Baldi P (2005) A principled approach to detecting surprising events in video. In: Proceedings of the (2005) IEEE Computer Society conference on computer vision and pattern recognition. Washington, USA, pp 631–637
Jain A (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666
Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In Proceedings of the thirtieth international conference on very large data bases-volume. VLDB Endowment, pp 180–191
Kohonen T (1997) Self-organizing maps. Springer, Secaucus
Kosko B (1988) Bidirectional associative memories. IEEE Trans Syst Man Cybern 18(1):49–60
Kranen P, Assent I, Baldauf C, Seidl T (2011) The clustree: Indexing micro-clusters for anytime stream mining. In: Knowl Inf Syst J 29(2):249–272
Lindstrom P, Delany S, Mac Namee B (2010) Handling concept drift in a text data stream constrained by high labelling cost. In: Proceedings of the Florida Artificial Intelligence Research Society conference, Daytona Beach, USA, pp 1–8
Ma J, Perkins S (2003) Online novelty detection on temporal sequences. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, New York, pp 613–618. http://doi.acm.org/10.1145/956750.956828
Marsland S (2002) On-line novelty detection through self-organisation, with application to inspection robotics. PhD thesis, University of Manchester
Masud M, Chen Q, Gao J, Khan L, Han J, Thuraisingham B (2010) Classification and novel class detection of data streams in a dynamic feature space. Mach Learn Knowl Discov Datab 6322:337–352
Meila M (2003) Comparing clusterings by the variation of information, pp 173–187. http://www.springerlink.com/content/rdda4r5bm9ajlbw3
O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S (2002a) Streaming-data algorithms for high-quality clustering. In: ICDE, pp 685–694
O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002b) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th IEEE international conference on data engineering, San Jose, California, pp 685–694
Pavlidis N, Tasoulis D, Adams N, Hand D (2011) [lambda]-Perceptron: an adaptive classifier for data streams. Pattern Recogn 44:78–96
Pincus SM (1991) Approximate entropy as a measure of system complexity. Proc Natl Acad Sci USA 88:2297–2301
R Development Core Team (2011) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Roberts SW (1966) A comparison of some control chart procedures. Technometrics 8(3):411–430. http://www.jstor.org/stable/1266688
Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. DOI:10.1016/0377-0427(87)90125-7. http://www.sciencedirect.com/science/article/pii/037704278790-1257
Seneta E (2006) Non-negative matrices and Markov chains (Springer Series in Statistics), 2nd edn. Springer, Berlin. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0387297650
Shannon C (1948) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–656
Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of the international conference on very large data bases, San Francisco, USA, pp 428–439
Spinosa E, de Leon F et al (2008) Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In: Proceedings of the (2008) ACM symposium on applied computing. ACM, New York, pp 976–980
Stanley JC (1976) Computer simulation of a model of habituation. Nature 261:146–148. doi:10.1038/261146a0. http://www.nature.com/nature/journal/v261/n5556/abs/261146a0.html
Swenson G Jr, Kellermann K (1975) An intercontinental array—a next-generation radio telescope. Science 188(4195):1263
Tyson T, Pike R, Stein M, Szalay A (2002) Managing and mining the Lsst data sets. Tech. rep, The Lsst Collaboration
Wang W, Yang J, Muntz R (1997) Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, San Francisco, USA, pp 186–195
Winters PR (1960) Forecasting sales by exponentially weighted moving averages. Manage Sci 6(3):324–342
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Young S, Arel I, Karnowski TP, Rose D (2010) A fast and stable incremental clustering algorithm. In: Third international conference on information technology: new generations. IEEE Computer Society, Los Alamitos, pp 204–209. http://doi.ieeecomputersociety.org/10.1109/ITNG.2010.148
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, vol 2. ACM, New York, pp 103–114
Acknowledgments
This paper is based upon work supported by FAPESP—São Paulo Research Foundation, Brazil, under grant no. 2011/19459-8, “CAPES—Coordenadoria de Aperfeiçoamento de Pessoal de Nível Superior” research funding agency under grant no. PDEE-4443-08-0, CNPq—National Council for Scientific and Technological Development research funding agency under grant no. 304338/2008-7. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of FAPESP, CAPES and CNPq.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Albertini, M.K., de Mello, R.F. Energy-based function to evaluate data stream clustering. Adv Data Anal Classif 7, 435–464 (2013). https://doi.org/10.1007/s11634-013-0145-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-013-0145-3