Energy-based function to evaluate data stream clustering

Marcelo Keese Albertini¹ &
Rodrigo Fernandes de Mello²

481 Accesses
6 Citations
Explore all metrics

Abstract

Severe constraints imposed by the nature of endless sequences of data collected from unstable phenomena have pushed the understanding and the development of automated analysis strategies, such as data clustering techniques. However, current clustering validation approaches are inadequate to data streams due to they do not properly evaluate representation of behavior changes. This paper proposes a novel function to continuously evaluate data stream clustering inspired in Lyapunov energy functions used by techniques such as the Hopfield artificial neural network and the Bidirectional Associative Memory (Bam). The proposed function considers three terms: i) the intra-cluster distance, which allows to evaluate cluster compactness; ii) the inter-cluster distance, which reflects cluster separability; and iii) entropy estimation of the clustering model, which permits the evaluation of the level of uncertainty in data streams. A first set of experiments illustrate the proposed function applied to scenarios of continuous evaluation of data stream clustering. Further experiments were conducted to compare this new function to well-established clustering indices and results confirm our proposal reflects the same information obtained with external clustering indices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

An evolving approach to the similarity-based modeling for online clustering in non-stationary environments

Article 30 November 2024

Internal Clustering Evaluation of Data Streams

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

Article 04 September 2020

Notes

In our experiments, we set \(\delta = 0.001\).
In our experiments, we set \(\eta = 10^{-12}\).
x \(=\) rnorm(mean \(=\) 0, sd \(=\) 0.1, n \(=\) 10000) of the R tool (R Development Core Team 2011).
This is performed by using the R command \(x = sin(seq(1,50,\mathrm{length}=10{,}000)) * seq(1{,}1{,}000,\mathrm{length}=10{,}000)\).

References

Ackley D, Hinton G, Sejnowski T (1985) A learning algorithm for Boltzmann machines. Cogn Sci 9(1): 147–169
Article Google Scholar
Aggarwal C, Han J, Wang J, Yu P (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, Berlin, Germany, pp 81–92
Albertini M, Mello R (2007) A self-organizing neural network for detecting novelties. In: Proceedings of the 2007 ACM symposium on Applied computing, New York, USA, pp 462–466. http://doi.acm.org/10.1145/1244002.1244110
Albertini M, Mello R (2008) Nature-inspired informatics for intelligent applications and knowledge discovery: implications in business, science and engineering. IGI Global, chap A self-organizing neural network to approach novelty detection, pp 49–71
Allcock B, Foster I, Nefedova V, Chervenak A, Deelman E, Kesselman C, Lee J, Sim A, Shoshani A, Williams B (2001) High-performance remote access to climate simulation data: a challenge problem for data grid technologies. In: Proceedings of the ACM/IEEE (2001) Conference supercomputing. Denver, USA, pp 1–20
Alligood K, Sauer T, Yorke J (1996) Chaos: an introduction to dynamical systems. Springer, Berlin
MATH Google Scholar
Amigó E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retriev 12:461–486
Article Google Scholar
Biehl M, Ghosh A, Hammer B (2006) Learning vector quantization: The dynamics of winner-takes-all algorithms. Neurocomputing 69(7–9):660–670. http://www.sciencedirect.com/science/article/pii/S0925231205003218
Google Scholar
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010a) Moa: Massive online analysis. J Mach Learn Res 11:1601–1604
Google Scholar
Bifet A, Holmes G, Pfahringer B, Kranen P, Kremer H, Jansen T, Seidl T (2010b) Moa: Massive online analysis, a framework for stream classification and clustering. In: Journal of Machine Learning Research (JMLR) workshop and conference proceedings: workshop on applications of pattern analysis. J Mach Learn Res 11:44–50
Google Scholar
Bifet A, Holmes G, Pfahringer B, Read J, Kranen P, Kremer H, Jansen T, Seidl T (2011) Moa: a real-time analytics open source framework. In: Machine learning and knowledge discovery in databases, lecture notes in computer science, vol 6913. Springer, Berlin, pp 617–620
Box GEP, Jenkins GM (1976) Time series analysis, forecasting, and control. Holden-Day, San Francisco
MATH Google Scholar
Cao F, Ester M, Qian W, Zhou A (2006) In density-based clustering over an evolving data stream with noise. SDM
Carpenter GA, Grossberg S, Rosen D (1991) Art 2-A: an adaptive resonance algorithm for rapid category learning and recognition. IJCNN-91-Seattle international joint conference on neural networks, 1991 ii, vol. 2, pp 151–156. http://dx.doi.org/10.1109/IJCNN.1991.155329
Freeman JA, Skapura DM (1991) Neural networks: algorithms, applications, and programming techniques. Addison Wesley, Redwood City
MATH Google Scholar
Fritzke B (1995) A growing neural gas network learns topologies. In: Advances in neural information processing systems, vol 7. MIT Press, Cambridge, pp 625–632. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4273
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of the (1998) ACM SIGMOD international conference on Management of data. ACM, New York, pp 73–84
Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: Theory and practice. IEEE Trans Knowl Data Eng 15(3):515–528
Article Google Scholar
Hartigan J, Wong M (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C (Applied Statistics) 28(1):100–108
MATH Google Scholar
Haykin S (1999) Neural networks: a comprehensive foundation. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Hinneburg A, Keim D (1999) Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large data bases, San Francisco, USA, pp 506–517
Holt CC (2004) Forecasting seasonals and trends by exponentially weighted moving averages. Int J Forecast 20(1):5–10
Article Google Scholar
Itti L, Baldi P (2005) A principled approach to detecting surprising events in video. In: Proceedings of the (2005) IEEE Computer Society conference on computer vision and pattern recognition. Washington, USA, pp 631–637
Google Scholar
Jain A (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666
Article Google Scholar
Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In Proceedings of the thirtieth international conference on very large data bases-volume. VLDB Endowment, pp 180–191
Kohonen T (1997) Self-organizing maps. Springer, Secaucus
Book MATH Google Scholar
Kosko B (1988) Bidirectional associative memories. IEEE Trans Syst Man Cybern 18(1):49–60
Article MathSciNet Google Scholar
Kranen P, Assent I, Baldauf C, Seidl T (2011) The clustree: Indexing micro-clusters for anytime stream mining. In: Knowl Inf Syst J 29(2):249–272
Lindstrom P, Delany S, Mac Namee B (2010) Handling concept drift in a text data stream constrained by high labelling cost. In: Proceedings of the Florida Artificial Intelligence Research Society conference, Daytona Beach, USA, pp 1–8
Ma J, Perkins S (2003) Online novelty detection on temporal sequences. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, New York, pp 613–618. http://doi.acm.org/10.1145/956750.956828
Marsland S (2002) On-line novelty detection through self-organisation, with application to inspection robotics. PhD thesis, University of Manchester
Masud M, Chen Q, Gao J, Khan L, Han J, Thuraisingham B (2010) Classification and novel class detection of data streams in a dynamic feature space. Mach Learn Knowl Discov Datab 6322:337–352
Article Google Scholar
Meila M (2003) Comparing clusterings by the variation of information, pp 173–187. http://www.springerlink.com/content/rdda4r5bm9ajlbw3
O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S (2002a) Streaming-data algorithms for high-quality clustering. In: ICDE, pp 685–694
O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002b) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th IEEE international conference on data engineering, San Jose, California, pp 685–694
Pavlidis N, Tasoulis D, Adams N, Hand D (2011) [lambda]-Perceptron: an adaptive classifier for data streams. Pattern Recogn 44:78–96
Article MATH Google Scholar
Pincus SM (1991) Approximate entropy as a measure of system complexity. Proc Natl Acad Sci USA 88:2297–2301
Article MathSciNet MATH Google Scholar
R Development Core Team (2011) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Roberts SW (1966) A comparison of some control chart procedures. Technometrics 8(3):411–430. http://www.jstor.org/stable/1266688
Google Scholar
Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. DOI:10.1016/0377-0427(87)90125-7. http://www.sciencedirect.com/science/article/pii/037704278790-1257
Seneta E (2006) Non-negative matrices and Markov chains (Springer Series in Statistics), 2nd edn. Springer, Berlin. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0387297650
Shannon C (1948) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–656
Article MathSciNet Google Scholar
Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of the international conference on very large data bases, San Francisco, USA, pp 428–439
Spinosa E, de Leon F et al (2008) Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In: Proceedings of the (2008) ACM symposium on applied computing. ACM, New York, pp 976–980
Stanley JC (1976) Computer simulation of a model of habituation. Nature 261:146–148. doi:10.1038/261146a0. http://www.nature.com/nature/journal/v261/n5556/abs/261146a0.html
Google Scholar
Swenson G Jr, Kellermann K (1975) An intercontinental array—a next-generation radio telescope. Science 188(4195):1263
Article Google Scholar
Tyson T, Pike R, Stein M, Szalay A (2002) Managing and mining the Lsst data sets. Tech. rep, The Lsst Collaboration
Google Scholar
Wang W, Yang J, Muntz R (1997) Sting: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, San Francisco, USA, pp 186–195
Winters PR (1960) Forecasting sales by exponentially weighted moving averages. Manage Sci 6(3):324–342
Article MathSciNet MATH Google Scholar
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Article Google Scholar
Young S, Arel I, Karnowski TP, Rose D (2010) A fast and stable incremental clustering algorithm. In: Third international conference on information technology: new generations. IEEE Computer Society, Los Alamitos, pp 204–209. http://doi.ieeecomputersociety.org/10.1109/ITNG.2010.148
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, vol 2. ACM, New York, pp 103–114

Download references

Acknowledgments

This paper is based upon work supported by FAPESP—São Paulo Research Foundation, Brazil, under grant no. 2011/19459-8, “CAPES—Coordenadoria de Aperfeiçoamento de Pessoal de Nível Superior” research funding agency under grant no. PDEE-4443-08-0, CNPq—National Council for Scientific and Technological Development research funding agency under grant no. 304338/2008-7. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of FAPESP, CAPES and CNPq.

Author information

Authors and Affiliations

Faculty of Computing, University of Uberlandia, Av. Joao Naves de Avila 2121, Uberlandia, MG, Brazil
Marcelo Keese Albertini
Department of Computer Science, University of Sao Paulo, Av. Trabalhador Saocarlense 400, Sao Carlos, SP, Brazil
Rodrigo Fernandes de Mello

Authors

Marcelo Keese Albertini
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo Fernandes de Mello
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcelo Keese Albertini.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Albertini, M.K., de Mello, R.F. Energy-based function to evaluate data stream clustering. Adv Data Anal Classif 7, 435–464 (2013). https://doi.org/10.1007/s11634-013-0145-3

Download citation

Received: 21 December 2012
Revised: 29 April 2013
Accepted: 20 July 2013
Published: 02 August 2013
Issue Date: December 2013
DOI: https://doi.org/10.1007/s11634-013-0145-3

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An evolving approach to the similarity-based modeling for online clustering in non-stationary environments

Internal Clustering Evaluation of Data Streams

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Energy-based function to evaluate data stream clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An evolving approach to the similarity-based modeling for online clustering in non-stationary environments

Internal Clustering Evaluation of Data Streams

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation