[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Data Stream Clustering: An In-depth Empirical Study

Published: 20 June 2023 Publication History

Abstract

Data Stream Clustering (DSC) plays an important role in mining continuous and unlabeled data streams in real-world applications. Over the last decades, numerous DSC algorithms have been proposed with promising clustering accuracy and efficiency. Despite the significant differences among existing DSC algorithms, they are commonly built around four key design aspects: summarizing data structure, window model, outlier detection mechanism, and offline refinement strategy. However, there is a lack of empirical studies on these key design aspects in the same codebase using real-world workloads with distinct characteristics. As a result, it is difficult for researchers to improve upon the state-of-the-art. In this paper, we conduct such a study of DSC on its four key design aspects. We implemented state-of-the-art variants of all of these design choices in an open-sourced platform from scratch and evaluated them using both real-world and synthetic workloads. Our analysis identifies the fundamental issues and trade-offs of each design choice in terms of both accuracy and efficiency. We even find that combining flexible design choices led to the development of a new algorithm called Benne, which can be tuned to achieve either better accuracy or better efficiency compared to the state-of-the-art.

Supplemental Material

MP4 File
Presentation video of "Data Stream Clustering: An In-depth Empirical Study" for SIGMOD 2023

References

[1]
[n.d.]. Covertype. http:// archive.ics.uci.edu/ ml/ datasets/ Covertype.
[2]
[n.d.]. Sensor. https:// www.cse.fau.edu/ xqzhu/ stream.html.
[3]
[n.d.]. Ticat, https:// github.com/ innerr/ ticat.
[4]
Marcel R. Ackermann and et al. 2012. StreamKM: A Clustering Algorithm for Data Streams. ACM J. Exp. Algorithmics 17 (May 2012), 30.
[5]
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A Framework for Clustering Evolving Data Streams. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29 (Berlin, Germany) (VLDB '03). VLDB Endowment, 81--92.
[6]
Daniel Barbará. 2002. Requirements for Clustering Data Streams. ACM SIGKDD Explorations Newsletter 3, 2 (Jan. 2002), 23--27. https://doi.org/10.1145/507515.507519
[7]
Daniel Barbará. 2002. Requirements for Clustering Data Streams. ACM SIGKDD Explorations Newsletter 3, 2 (jan 2002), 23--27. https://doi.org/10.1145/507515.507519
[8]
Alessio Bechini and et al. 2020. TSF-DBSCAN: a Novel Fuzzy Density-based Approach for Clustering Unbounded Data Streams. IEEE Transactions on Fuzzy Systems (2020).
[9]
A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. 2010. Moa: Massive online analysis. Journal of Machine Learning Research 11 (2010), 1601--1604.
[10]
Michele Borassi, Alessandro Epasto, Silvio Lattanzi, Sergei Vassilviskii, and Morteza Zadimoghaddam. 2020. Sliding Window Algorithms for K-Clustering Problems. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS'20). Curran Associates Inc., Red Hook, NY, USA, Article 731, 12 pages.
[11]
Paul Bradley, Johannes Gehrke, Raghu Ramakrishnan, and Ramakrishnan Srikant. 2002. Scaling mining algorithms to large databases. Commun. ACM 45, 8 (2002), 38--43.
[12]
Feng Cao, Martin Estert, Weining Qian, and Aoying Zhou. 2006. Density-based clustering over an evolving data stream with noise. In Proceedings of the 2006 SIAM international conference on data mining. SIAM, 328--339.
[13]
Matthias Carnein, Dennis Assenmacher, and Heike Trautmann. 2017. An Empirical Comparison of Stream Clustering Algorithms. In Proceedings of the Computing Frontiers Conference.(CF' 17) 17 (2017), 361--366.
[14]
Yixin Chen and Li Tu. 2007. Density-Based Clustering for Real-Time Stream Data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '07). Association for Computing Machinery, New York, NY, USA, 133--142.
[15]
Gianmarco De Francisci Morales and Albert Bifet. 2015. SAMOA: Scalable Advanced Massive Online Analysis. J. Mach. Learn. Res. 16, 1 (Jan. 2015), 149--153.
[16]
M. Deepa, P. Revathy, and P. G. Student. 2012. Validation of Document Clustering based on Purity and Entropy measures.
[17]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press, 226--231.
[18]
Fredrik Farnstrom, James Lewis, and Charles Elkan. 2000. Scalability for clustering algorithms revisited. ACM SIGKDD Explorations Newsletter 2, 1 (2000), 51--57.
[19]
Shufeng Gong, Yanfeng Zhang, and Ge Yu. 2018. Clustering Stream Data by Exploring the Evolution of Density Mountain. Proc. VLDB Endow. 11, 4 (oct 2018), 393--405. https://doi.org/10.1145/3164135.3164136
[20]
Michael Hahsler and Matthew Bolaños. 2016. Clustering Data Streams Based on Shared Density between Micro-Clusters. IEEE Transactions on Knowledge and Data Engineering 28, 6 (2016), 1449--1461. https://doi.org/10.1109/TKDE.2016.2522412
[21]
Michael Hahsler, Matthew Bolanos, and John Forrest. 2015. streamMOA: Interface for MOA Stream Clustering Algorithms. https://cran.r-project.org/web/packages/streamMOA/
[22]
Ahsanul Haque, Latifur Khan, Michael Baron, Bhavani Thuraisingham, and Charu Aggarwal. 2016. Efficient handling of concept drift and concept evolution over Stream Data. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE). 481--492. https://doi.org/10.1109/ICDE.2016.7498264
[23]
Philipp Kranen, Ira Assent, Corinna Baldauf, and Thomas Seidl. 2011. The ClusTree: indexing micro-clusters for anytime stream mining. Knowledge and Information Systems 29, 2 (2011), 249--272.
[24]
Hardy Kremer and et al. 2011. An Effective Evaluation Measure for Clustering on Evolving Data Streams. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2020408.2020555
[25]
S. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129--137. https://doi.org/10.1109/TIT.1982.1056489
[26]
Jie Lu and et al. 2019. Learning under Concept Drift: A Review. IEEE Transactions on Knowledge and Data Engineering 31, 12 (2019), 2346--2363. https://doi.org/10.1109/TKDE.2018.2876857
[27]
J. Macqueen. 1967. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability. 281--297.
[28]
Stratos Mansalis, Eirini Ntoutsi, Nikos Pelekis, and Yannis Theodoridis. 2018. An evaluation of data stream clustering algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal 11 (06 2018). https://doi.org/10.1002/sam.11380
[29]
Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham. 2011. Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints. IEEE Transactions on Knowledge and Data Engineering 23, 6 (2011), 859--874. https://doi.org/10.1109/TKDE.2010.61
[30]
Mohammad M. Masud, Qing Chen, Latifur Khan, Charu Aggarwal, Jing Gao, Jiawei Han, and Bhavani Thuraisingham. 2010. Addressing Concept-Evolution in Concept-Drifting Data Streams. In 2010 IEEE International Conference on Data Mining. 929--934. https://doi.org/10.1109/ICDM.2010.160
[31]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Duplicate detection in click streams. In Proceedings of the 14th international conference on World Wide Web. 12--21.
[32]
A. Meyerson. 2001. Online facility location. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science. 426--431. https://doi.org/10.1109/SFCS.2001.959917
[33]
K. Namitha and G. Santhosh Kumar. 2020. Concept Drift Detection in Data Stream Clustering and its Application on Weather Data. Int. J. Agric. Environ. Inf. Syst. 11, 1 (2020), 67--85. https://doi.org/10.4018/IJAEIS.2020010104
[34]
Jonathan A. Silva, Elaine R. Faria, Rodrigo C. Barros, Eduardo R. Hruschka, André C. P. L. F. de Carvalho, and Joao Gama. 2013. Data Stream Clustering: A Survey. ACM Comput. Surv. 46, 1, Article 13 (jul 2013), 31 pages. https://doi.org/10.1145/2522968.2522981
[35]
V. M. A. Souza, D. M. Reis, A. G. Maletzke, and G. E. A. P. A. Batista. 2020. Challenges in Benchmarking Stream Learning Algorithms with Real-world Data. Data Mining and Knowledge Discovery 34 (2020), 1805--1858. https://doi.org/10.1007/s10618-020-00698--5
[36]
Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani. 2009. A detailed analysis of the KDD CUP 99 data set. In IEEE Symposium on Computational Intelligence for Security and Defense Applications. 1--6.
[37]
Li Wan, Wee Keong Ng, Xuan Hong Dang, Philip S. Yu, and Kuan Zhang. 2009. Density-Based Clustering of Data Streams at Multiple Resolutions. ACM Trans. Knowl. Discov. Data 3, 3 (July 2009), 28.
[38]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Rec. 25, 2 (jun 1996), 103--114. https://doi.org/10.1145/235968.233324
[39]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1997. BIRCH: A new data clustering algorithm and its applications. Data mining and knowledge discovery 1, 2 (1997), 141--182.
[40]
Aoying Zhou, Feng Cao, Weining Qian, and Cheqing Jin. 2008. Tracking clusters in evolving data streams over sliding windows. Knowledge and Information Systems 15, 2 (2008), 181--214.

Cited By

View all
  • (2024)On Reducing Space Amplification with Multi-Column Compaction in Apache IoTDBProceedings of the VLDB Endowment10.14778/3681954.368197717:11(2974-2986)Online publication date: 30-Aug-2024
  • (2024)DIDS: Double Indices and Double Summarizations for Fast Similarity SearchProceedings of the VLDB Endowment10.14778/3665844.366585117:9(2198-2211)Online publication date: 6-Aug-2024
  • (2024)CIVET: Exploring Compact Index for Variable-Length Subsequence Matching on Time SeriesProceedings of the VLDB Endowment10.14778/3665844.366584517:9(2123-2135)Online publication date: 6-Aug-2024
  • Show More Cited By

Index Terms

  1. Data Stream Clustering: An In-depth Empirical Study

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 2
      PACMMOD
      June 2023
      2310 pages
      EISSN:2836-6573
      DOI:10.1145/3605748
      • Editor:
      • Divyakant Agrawal
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 June 2023
      Published in PACMMOD Volume 1, Issue 2

      Permissions

      Request permissions for this article.

      Author Tags

      1. clustering algorithm
      2. data stream
      3. empirical study

      Qualifiers

      • Research-article

      Funding Sources

      • Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2
      • National Research Foundation, Singapore and Infocomm Media Development Authority under its Future Communications Research & Development Programme
      • Key R&D Program of Hubei
      • National Key R&D Program of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)197
      • Downloads (Last 6 weeks)19
      Reflects downloads up to 12 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)On Reducing Space Amplification with Multi-Column Compaction in Apache IoTDBProceedings of the VLDB Endowment10.14778/3681954.368197717:11(2974-2986)Online publication date: 30-Aug-2024
      • (2024)DIDS: Double Indices and Double Summarizations for Fast Similarity SearchProceedings of the VLDB Endowment10.14778/3665844.366585117:9(2198-2211)Online publication date: 6-Aug-2024
      • (2024)CIVET: Exploring Compact Index for Variable-Length Subsequence Matching on Time SeriesProceedings of the VLDB Endowment10.14778/3665844.366584517:9(2123-2135)Online publication date: 6-Aug-2024
      • (2024)Visualization-Aware Time Series Min-Max Caching with Error Bound GuaranteesProceedings of the VLDB Endowment10.14778/3659437.365946017:8(2091-2103)Online publication date: 31-May-2024
      • (2024)Performance-Based Pricing for Federated Learning via AuctionProceedings of the VLDB Endowment10.14778/3648160.364816917:6(1269-1282)Online publication date: 3-May-2024
      • (2024)Hybrid Prompt Learning for Generating Justifications of Security Risks in Automation RulesACM Transactions on Intelligent Systems and Technology10.1145/367540115:5(1-26)Online publication date: 29-Jun-2024
      • (2024)Databases in Edge and Fog Environments: A SurveyACM Computing Surveys10.1145/366600156:11(1-40)Online publication date: 8-Jul-2024
      • (2024)RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/36549702:3(1-27)Online publication date: 30-May-2024
      • (2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
      • (2024)Time Series Representation for Visualization in Apache IoTDBProceedings of the ACM on Management of Data10.1145/36392902:1(1-26)Online publication date: 26-Mar-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media