[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Data stream clustering: A survey

Published: 11 July 2013 Publication History

Abstract

Data stream mining is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. In this context, several data stream clustering algorithms have been proposed to perform unsupervised learning. Nevertheless, data stream clustering imposes several challenges to be addressed, such as dealing with nonstationary, unbounded data that arrive in an online fashion. The intrinsic nature of stream data requires the development of algorithms capable of performing fast and incremental processing of data objects, suitably addressing time and memory limitations. In this article, we present a survey of data stream clustering algorithms, providing a thorough discussion of the main design components of state-of-the-art algorithms. In addition, this work addresses the temporal aspects involved in data stream clustering, and presents an overview of the usually employed experimental methodologies. A number of references are provided that describe applications of data stream clustering in different domains, such as network intrusion detection, sensor networks, and stock market analysis. Information regarding software packages and data repositories are also available for helping researchers and practitioners. Finally, some important issues and open questions that can be subject of future research are discussed.

Supplementary Material

a13-silva-apndx.pdf (silva.zip)
Supplemental movie, appendix, image and software files for, Data stream clustering: A survey

References

[1]
Ackermann, M. R., Martens, M., Raupach, C., Swierkot, K., Lammersen, C., and Sohler, C. 2012. StreamKM++: A clustering algorithm for data streams. ACM J. Exper. Algor. 17, 1.
[2]
Agarwal, P. K., Har-Peled, S., and Varadarajan, K. R. 2004a. Approximating extent measures of points. J. ACM 51, 4, 606--635.
[3]
Aggarwal, C. C. 2003. A framework for diagnosing changes in evolving data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'03). 575--586.
[4]
Aggarwal, C. C. 2007. Data Streams -- Models and Algorithms. Springer.
[5]
Aggarwal, C. C. 2010. A segment-based framework for modeling and mining data streams. Knowl. Inf. Syst, 30, 1, 1--29.
[6]
Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. 2003. A framework for clustering evolving data streams. In Proceedings of the 29th Conference on Very Large Data Bases (VLDB'03). Vol. 29, 81--92.
[7]
Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. 2004b. A framework for projected clustering of high dimensional data streams. In Proceedings of the 30th Conference on Very Large Data Bases (VLDB'04). Vol. 30, 852--863.
[8]
Aggarwal, C. C. and Yu, P. S. 2006. A framework for clustering massive text and categorical data streams. In Proceedings of the 6th International Conference on Data Mining (SDM'06). 479.
[9]
Aggarwal, C. C. and Yu, P. S. 2008. A framework for clustering uncertain data streams. In Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE'08). 150--159.
[10]
Aghbari, Z. A., Kamel, I., and Awad, T. 2012. On clustering large number of data streams. Intell. Data Anal. 16, 1, 69--91.
[11]
Amini, A., Wah, T. Y., Saybani, M. R., Aghabozorgi, S. R., and Yazdi, S. 2011. A study of density-grid based clustering algorithms on data streams. In Proceedings of the 8th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'11). IEEE Press, Los Alamitos, CA, 1652--1656.
[12]
Arabie, P. and Hubert, L. J. 1999. An Overview of Combinatorial Data Analysis. World Scientific Publishing.
[13]
Arthur, D. and Vassilvitskii, S. 2006. How slow is the k-means method? In Proceedings of the 22nd Annual Symposium on Computational Geometry (SCG'06). ACM Press, New York, 144--153.
[14]
Arthur, D. and Vassilvitskii, S. 2007. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. 1027--1035.
[15]
Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'02). ACM Press, New York, 1--16.
[16]
Babcock, B., Datar, M., Motwani, R., and O'Callaghan, L. 2003. Maintaining variance and k-medians over data stream windows. In Proceedings of the 22nd ACM SIGMOD-SIGACT SIGART Symposium on Principles of Database Systems. ACM Press, New York, 234--243.
[17]
Badoiu, M., Har-Peled, S., and Indyk, P. 2002. Approximate clustering via core-sets. In Proceedings of the 34th ACM Symposium on Theory of Computing (STOC'02). ACM Press, New York, 250--257.
[18]
Ball, G. H. and Hall, D. J. 1965. ISODATA. A novel method of data analysis and pattern classification. Tech. rep. Stanford Research Institute, Menlo Park.
[19]
Barbara, D. 2002. Requirements for clustering data streams. SIGKDD Explorations 3, 23--27.
[20]
Bentley, J. L. 1975. Multidimensional binary search trees used for associative searching. Comm. ACM 18, 9, 509--517.
[21]
Bentley, J. L. and Saxe, J. B. 1980. Decomposable searching problems I: Static-to-dynamic transformation. J. Algor. 1, 4, 301--358.
[22]
Bhat, U. N. and Miller, G. K. 2002. Elements of Applied Stochastic Processes 3rd Ed. John Wiley and Sons, New Jersey.
[23]
Bifet, A. 2010. Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams. IOS Press.
[24]
Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. 2010. MOA: Massive online analysis. J. Mach. Learn. Res. 11, 1601--1604.
[25]
Bradley, P. S. and Fayyad, U. M. 1998a. Refining initial points for k-means clustering. In Proceedings of the 15th International Conference on Machine Learning (ICML'98). Morgan Kaufmann Publishers, San Francisco, 91--99.
[26]
Bradley, P. S., Fayyad, U. M., and Reina, C. 1998b. Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD'98). AAAI Press, 9--15.
[27]
Cao, F., Ester, M., Qian, W., and Zhou, A. 2006. Density-based clustering over an evolving data stream with noise. In Proceedings of the 6th SIAM International Conference on Data Mining. 328--339.
[28]
Charikar, M. and Guha, S. 1999. Improved combinatorial algorithms for the facility location and k-median problems. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science. 378--388.
[29]
Charikar, M., O'Callaghan, L., and Panigrahy, R. 2003. Better streaming algorithms for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing. ACM Press, New York, 30--39.
[30]
Chen, Y. and Tu, L. 2007. Density-based clustering for real-time stream data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, 133--142.
[31]
Cho, K., Jo, S., Jang, H., Kim, S. M., and Song, J. 2006. DCF: An efficient data stream clustering framework for streaming applications. In Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA'06). 114--122.
[32]
Csernel, B., Clerot, F., and Hebrail, G. 2006. Datastream clustering over tilted windows through sampling. In Proceedings of the Knowledge Discovery from Data Streams Workshop (ECML/PKDD).
[33]
Dang, X. H., Lee, V. C. S., Ng, W. K., Ciptadi, A., and Ong, K.-L. 2009. An em-based algorithm for clustering data streams in sliding windows. In Proceedings of the 14th International Conference on Database Systems for Advanced Applications. Lecture Notes in Computer Science, vol. 5463, Springer, 230--235.
[34]
Domingos, P. and Hulten, G. 2001. A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of the 8th International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, 106--113.
[35]
Dunham, M. H., Menga, Y., and Huang, J. 2004. Extensible markov model. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM'04). 371--374.
[36]
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 226--231.
[37]
Ribeiro, E. F., Barros, R. C., Carvalho, A. C. P. L. F., and Gama, J. 2012. improving the offline clustering stage of data stream algorithms in scenarios with variable number of clusters. In Proceedings of the 27th ACM Symposium on Applied Computing (SAC'12). 572--573.
[38]
Farnstrom, F., Lewis, J., and Elkan, C. 2000. Scalability for clustering algorithms revisited. SIGKDD Exploration Newslett. 2, 1, 51--57.
[39]
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. 1996. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining. American Association for Artificial Intelligence, Menlo Park, CA, 1--34.
[40]
Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml.
[41]
Gaber, M. M., Vatsavai, R. R., Omitaomu, O. A., Gama, J., Chawla, N. V., and Ganguly, A. R. 2010. Knowledge Discovery from Sensor Data. Springer.
[42]
Gama, J. 2010. Knowledge Discovery from Data Streams. Chapman Hall/CRC.
[43]
Gama, J. and Gaber, M. M. 2007. Learning from Data Streams: Processing Techniques in Sensor Networks. Springer.
[44]
Gama, J., Medas, P., Castillo, G., and Rodrigues, P. P. 2004. Learning with drift detection. In Proceedings of the 17th Brazilian Symposium on Artificial Intelligence (SBIA'04). Vol. 3171., 286--295.
[45]
Gama, J., Pereira, P. R., and Lopes, L. 2011. Clustering distributed sensor data streams using local processing and reduced communication. Intell. Data Anal. 15, 1, 3--28.
[46]
Gama, J. and Pinto, C. 2006. Discretization from data streams: Applications to histograms and data mining. In Proceedings of the ACM Symposium on Applied Computing (SAC'06). 662--667.
[47]
Gan, G., Ma, C., and Wu, J. 2007. Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability. SIAM.
[48]
Gao, M.-M., Liu, J.-Z., and Gao, X.-X. 2010. Application of compound gaussian mixture model clustering in the data stream. In Proceedings of the International Conference on Computer Application and System Modeling (ICCASM'10).
[49]
Gonzalez, T. F. 1985. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38, 293--306.
[50]
Guha, S. 2009. Tight results for clustering and summarizing data streams. In Proceedings of the 12th International Conference on Database Theory (ICDT'09). ACM Press, New York, 268--275.
[51]
Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O'Callaghan, L. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Engin. 15, 515--528.
[52]
Guha, S., Mishra, N., Motwani, R., and O'Callaghan, L. 2000. Clustering data streams. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 359--366.
[53]
Hahsler, M. and Dunham, M. H. 2010. rEMM: Extensible markov model for data stream clustering in r. J. Statist. Softw. 35, 5, 1--31.
[54]
Hahsler, M. and Dunham, M. H. 2011. Temporal structure learning for clustering massive data streams in real-time. In Proceedings of the SIAM Conference on Data Mining. SIAM/Omnipress, 664--675.
[55]
Han, J. and Kamber, M. 2000. Data Mining: Concepts and Techniques. Morgan Kaufmann.
[56]
Har-Peled, S. and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing. 291--300.
[57]
Hulten, G. and Domingos, P. 2003. VFML -- A toolkit for mining high-speed time-changing data streams. Tech. rep. University of Washington. http://www.cs.washington.edu/dm/vfml/.
[58]
Isaksson, C., Dunham, M. H., and Hahsler, M. 2012. SOStream: Self organizing density-based clustering over data stream. In Proceedings of the 8th International Conference on Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science, vol. 7376, Springer, 264--278.
[59]
Jain, A. K. 2009. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31, 651--666.
[60]
Jiang, N. and Gruenwald, L. 2006. Research issues in data stream association rule mining. SIGMOD Rec. 35, 1, 14--19.
[61]
Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Interscience.
[62]
Kavitha, V. and Punithavalli, M. 2010. Clustering time series data stream - A literature survey. Int. J. Comput. Sci. Inf. Secur. 8, 1, 289--294.
[63]
Khalilian, M. and Mustapha, N. 2010. Data stream clustering: challenges and issues. In Proceedings of International Multi Conference of Engineers and Computer Scientists. 566--569.
[64]
Kogan, J. 2007. Introduction to Clustering Large and High-Dimensional Data. Cambridge University Press.
[65]
Kontaki, M., Papadopoulos, A. N., and Manolopoulos, Y. 2008. Continuous trend-based clustering in data streams. In Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery. 251--262.
[66]
Kranen, P., Assent, I., Baldauf, C., and Seidl, T. 2011. The clustree: Indexing microclusters for anytime stream mining. Knowl. Inf. Syst. 29, 2, 249--272.
[67]
Kremer, H., Kranen, P., Jansen, T., Seidl, T., Bifet, A., Holmes, G., and Pfahringer, B. 2011. An effective evaluation measure for clustering on evolving data streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). ACM Press, New York, 868--876.
[68]
Li, Q., Ma, X., Tang, S., and Xie, S. 2011. Continuously identifying representatives out of massive streams. In Proceedings of the 7th International Conference on Advanced Data Mining and Applications (ADMA'11). Springer, 1--14.
[69]
Li, Y. and Tan, B. H. 2011. Data stream clustering algorithm based on affinity propagation and density. Advanced Materials Res. 267, 444--449.
[70]
Liu, Y.-B., Cai, J.-R., Yin, J., and Fu, A. W.-C. 2008. Clustering text data streams. J. Comput. Sci. Technol. 23, 1, 112--128.
[71]
Lloyd, S. P. 1982. Least squares quantization in pcm. IEEE Trans. Inf. Theory 28, 2, 129--137.
[72]
Luhr, S. and Lazarescu, M. 2009. Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl. Engin. 68, 1--27.
[73]
MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. L. M. Le Cam and J. Neyman, Eds., Vol. 1., 281--297.
[74]
De Maesschalck, R., Jouan-Rimbaud, D., and Massart, D. L. 2000. The mahalanobis distance. Chemometrics Intell. Laboratory Syst. 50, 1--18.
[75]
Mahdiraji, A. R. 2009. Clustering data stream: A survey of algorithms. Int. J. Knowl.-Based Intell. Engin. Syst. 13, 2, 39--44.
[76]
Markov, A. 1971. Extension of the limit theorems of probability theory to a sum of variables connected in a chain. In Dynamic Probabilistic Systems, Vol. 1, R. Howard, Ed., John Wiley and Sons, Chapter Appendix B, 552--577.
[77]
Metwally, A., Agrawal, D., and Abbadi, A. E. L. 2005. Duplicate detection in click streams. In Proceedings of the 14th International Conference on World Wide Web. ACM Press, New York, 12--21.
[78]
Meyerson, A. 2001. Online facility location. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science. 426--431.
[79]
Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and Euler, T. 2006. YALE: Rapid prototyping for complex data mining tasks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). L. Ungar, M. Craven, D. Gunopulos, and T. Eliassi-Rad, Eds., ACM Press, New York, 935--940.
[80]
O'Callaghan, L., Mishra, N., Meyerson, A., Guha, S., and Motwani, R. 2002. Streaming data algorithms for high-quality clustering. In Proceedings of the 18th International Conference on Data Engineering. 685--694.
[81]
Oliveira, M. and Gama, J. 2010. MEC --Monitoring clusters' transitions. In Proceedings of the 5th Starting AI Researchers Symposium. IOS Press, 212--224.
[82]
Ong, K. L, Li, W., Ng, W.-K., and Lim, E.-P. 2004. SCLOPE: An algorithm for clustering data streams of categorical attributes. In Proceedings of the 6th International Conference on Data Warehousing and Knowledge Discovery (KDD'04). 209--218.
[83]
Oliveira, M. D. B. and Gama, J. 2012. A framework to monitor clusters evolution applied to economy and finance problems. Intell. Data Anal. 16, 1, 93--111.
[84]
Ostfeld, A., Uber, J. G., Salomons, E., et al. 2008. The battle of the water sensor networks (BWSN): A design challenge for engineers and algorithms. J. Water Resources Plan. Manag. 134, 556.
[85]
Park, N. H. and Suk Lee, W. 2007. Cell trees: An adaptive synopsis structure for clustering multidimensional on-line data streams. Data Knowl. Engin. 63, 2, 528--549.
[86]
Ren, J. and Ma, R. 2009. Density-based data streams clustering over sliding windows. In Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery. Vol. 5. 248--252.
[87]
Rodrigues, P., Gama, J., and Pedroso, J. P. 2006. ODAC: Hierarchical clustering of time series data streams. In Proceedings of the 6th SIAM International Conference on Data Mining. 499--503.
[88]
Rodrigues, P. P., Gama, J., and Pedroso, J. P. 2008. Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Engin. 20, 5, 615--627.
[89]
Serir, L., Ramasso, E., and Zerhouni, N. 2012. Evidential evolving gustafson--kessel algorithm for online data streams partitioning using belief function theory. Int. J. Approximate Reason. 53, 5, 1--22.
[90]
Shah, R., Krishnaswamy, S., and Gaber, M. M. 2005. Resource-aware very fast k-means for ubiquitous data stream mining. In Proceedings of the 2nd International Workshop on Knowledge Discovery in Data Streams, Held in Conjunction with the 16th European Conference on Machine Learning (ECML'05).
[91]
Silva, A., Chiky, R., and Hebrail, G. 2011. A clustering approach for sampling data streams in sensor networks. Knowl. Inf. Syst. 32, 1, 1--23.
[92]
Silva, J. A. and Hruschka, E. R. 2011. Extending k-means-based algorithms for evolving data streams with variable number of clusters. In Proceedings of the 4th International Conference on Machine Learning and Applications (ICMLA'11). Vol. 2. 14--19.
[93]
Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., and Schult, R. 2006. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). ACM Press, New York, 706--711.
[94]
Steinhaus, H. 1956. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci 1, 801--804.
[95]
Tasoulis, D. K., Adams, N. M., and Hand, D. J. 2006. Unsupervised clustering in streaming data. In Proceedings of the 6th IEEE International Conference on Data Mining-Workshops (ICDM'06). 638--642.
[96]
Tavallaee, M., Bagheri, E., Lu, W., and Ghorbani, A. A. 2009. A detailed analysis of the kdd cup 99 data set. In Proceedings of the 2nd IEEE International Conference on Computational Intelligence for Security and Defense Applications. 53--58.
[97]
Vattani, A. 2009. K-means requires exponentially many iterations even in the plane. In Proceedings of the 25th Annual Symposium on Computational Geometry (SCG'09). ACM Press, New York, 324--332.
[98]
Wan, R., Yan, X., and Su, X. 2008. A weighted fuzzy clustering algorithm for data stream. In Proceedings of the ISECS International Colloquium on Computing, Communication, Control, and Management. 360--364.
[99]
Wu, X., Kumar, V., Quinlan, J. R, Ghosh, J., Yang, Q., Motoda, H., Mclachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., and Steinberg, D. 2007. Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1--37.
[100]
Xu, R. and Wunsch, D. 2009. Clustering. Computational Intelligence Series, Wiley-IEEE Press.
[101]
Yang, C. and Zhou, J. 2006. HClustream: A novel approach for clustering evolving heterogeneous data stream. In Proceedings of the 6th IEEE International Conference on Data Mining. 682--688.
[102]
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 103--114.
[103]
Zhang, T., Ramakrishnan, R., and Livny, M. 1997. BIRCH: A new data clustering algorithm and its applications. Data Mining Knowl. Discov. 1, 2, 141--182.
[104]
Zhang, X., Sebag, M., and Germain-Renaud, C. 2009. Multi-scale real-time grid monitoring with job stream mining. In Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'09). 420--427.
[105]
Zhang, X. and Wang, W. 2010. Self-adaptive change detection in streaming data with nonstationary distribution. In Advanced Data Mining and Applications. Springer, 1--12.
[106]
Zhang, X., Zhou, X., and Hu, X. 2006. Semantic smoothing for model-based document clustering. In Proceedings of the 6th International Conference on Data Mining (ICDM'06). 1193--1198.
[107]
Zhou, A., Cao, F., Qian, W., and Jin, C. 2008. Tracking clusters in evolving data streams over sliding windows. Knowl. Inf. Syst. 15, 2, 181--214.
[108]
Zhu, H., Wang, Y., and Yu, Z. 2010. Clustering of evolving data stream with multiple adaptive sliding window. In Proceedings of the International Conference on Data Storage and Data Engineering (DSDE'10). 95--100.
[109]
Zhu, Y. and Shasha, D. 2002. StatStream: Statistical monitoring of thousands of data streams in real time. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB'02). VLBD Endowment, 358--369.

Cited By

View all
  • (2025)Non-blocking functional dependency discovery from data streamsInformation Sciences10.1016/j.ins.2024.121531690(121531)Online publication date: Feb-2025
  • (2024)Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutionsPeerJ Computer Science10.7717/peerj-cs.228610(e2286)Online publication date: 29-Aug-2024
  • (2024)Fractional Adaptive Resonance Theory (FRA-ART): An Extension for a Stream Clustering Method with Enhanced Data RepresentationMathematics10.3390/math1213204912:13(2049)Online publication date: 30-Jun-2024
  • Show More Cited By

Recommendations

Reviews

Fazli Can

Data streams, which are continuously generated at uncontrollable, rapid rates, are a relatively new form of temporal data object. Data stream clustering (DSC) is an unsupervised data mining process that looks for hidden patterns in dynamic, massive, and unbounded datasets. Possible DSC applications include analysis of ATM transactions, computer network traffic, meteorological observations, and stock market transactions. In such applications, for example, one may want to find new trends and identify followers. The authors present various algorithms. They first provide an overview, followed by the data structures used for summarization, the time window models used for efficiency, outlier detection mechanisms that distinguish outliers from cluster evolution, and standard clustering algorithms used for actual clustering based on the output of the summarization step. In the later sections, the paper discusses temporal aspects that need to be considered during clustering, the assessment of the generated clustering structures, practical applications, software packages available for data stream clustering, challenges of the problem environment, and future research pointers. The paper is rounded out with a taxonomy that helps to distinguish the analyzed algorithms with respect to their important features. This is a useful paper for researchers, with a rich list of references (despite a typographical error in the reference for Faria et al., 2012). The ACM Digital Library provides online access to a short electronic appendix related to the paper with the time complexity analysis of the presented algorithms. The paper is comprehensive and nicely written. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 46, Issue 1
October 2013
551 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/2522968
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2013
Accepted: 01 January 2013
Revised: 01 September 2012
Received: 01 June 2012
Published in CSUR Volume 46, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data stream clustering
  2. online clustering

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)359
  • Downloads (Last 6 weeks)32
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Non-blocking functional dependency discovery from data streamsInformation Sciences10.1016/j.ins.2024.121531690(121531)Online publication date: Feb-2025
  • (2024)Comprehensive analysis of clustering algorithms: exploring limitations and innovative solutionsPeerJ Computer Science10.7717/peerj-cs.228610(e2286)Online publication date: 29-Aug-2024
  • (2024)Fractional Adaptive Resonance Theory (FRA-ART): An Extension for a Stream Clustering Method with Enhanced Data RepresentationMathematics10.3390/math1213204912:13(2049)Online publication date: 30-Jun-2024
  • (2024)A Systematic Literature Review of Novelty Detection in Data Streams: Challenges and OpportunitiesACM Computing Surveys10.1145/365728656:10(1-37)Online publication date: 12-Apr-2024
  • (2024)Mini-batching with Fused Training and Testing for Data Streams Processing on the EdgeProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649188(51-60)Online publication date: 7-May-2024
  • (2024)Clustering in Dynamic Environments: A Framework for Benchmark Dataset Generation With Heterogeneous ChangesProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654188(50-58)Online publication date: 14-Jul-2024
  • (2024)An Efficient and Distributed Framework for Real-Time Trajectory Stream ClusteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.3312319(1-17)Online publication date: 2024
  • (2024)Interrelate Training and Clustering for Online Speaker DiarizationIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.335703332(1352-1364)Online publication date: 2024
  • (2024)Stream Clustering on a Forensic Timeline2024 12th International Symposium on Digital Forensics and Security (ISDFS)10.1109/ISDFS60797.2024.10527350(1-6)Online publication date: 29-Apr-2024
  • (2024)Ocean: Online Clustering and Evolution Analysis for Dynamic Streaming Data2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00343(4504-4517)Online publication date: 13-May-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media