Abstract
Sequence clustering in a streaming environment is challenging because it is computationally expensive, and the sequences may evolve over time. K-medoids or Partitioning Around Medoids (PAM) is commonly used to cluster sequences since it supports alignment-based distances, and the k-centers being actual data items helps with cluster interpretability. However, offline k-medoids has no support for concept drift, while also being prohibitively expensive for clustering data streams. We therefore propose SECLEDS, a streaming variant of the k-medoids algorithm with constant memory footprint. SECLEDS has two unique properties: i) it uses multiple medoids per cluster, producing stable high-quality clusters, and ii) it handles concept drift using an intuitive Medoid Voting scheme for approximating cluster distances. Unlike existing adaptive algorithms that create new clusters for new concepts, SECLEDS follows a fundamentally different approach, where the clusters themselves evolve with an evolving stream. Using real and synthetic datasets, we empirically demonstrate that SECLEDS produces high-quality clusters regardless of drift, stream size, data dimensionality, and number of clusters. We compare against three popular stream and batch clustering algorithms. The state-of-the-art BanditPAM is used as an offline benchmark. SECLEDS achieves comparable F1 score to BanditPAM while reducing the number of required distance computations by 83.7%. Importantly, SECLEDS outperforms all baselines by 138.7% when the stream contains drift. We also cluster real network traffic, and provide evidence that SECLEDS can support network bandwidths of up to 1.08 Gbps while using the (expensive) dynamic time warping distance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++ a clustering algorithm for data streams. JEA 17, 2–1 (2012)
Aggarwal, C.C., Philip, S.Y., Han, J., Wang, J.: A framework for clustering evolving data streams. In: VLDB, pp. 81–92. Elsevier (2003)
de Andrade Silva, J., Hruschka, E.R.: Extending k-means-based algorithms for evolving data streams with variable number of clusters. In: ICMLA, vol. 2, pp. 14–19. IEEE (2011)
Barros, R.S.M., Santos, S.G.T.C.: A large-scale comparison of concept drift detectors. Inf. Sci. 451, 348–370 (2018)
Boeva, V., Nordahl, C.: Modeling evolving user behavior via sequential clustering. In: Cellier, P., Driessens, K. (eds.) ECML PKDD 2019. CCIS, vol. 1168, pp. 12–20. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43887-6_2
Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SDM, pp. 328–339. SIAM (2006)
Cook, D.J., Krishnan, N.C., Rashidi, P.: Activity discovery and activity recognition: a new partnership. IEEE Trans. Cybern. 43(3), 820–828 (2013)
Dua, D., Graff, C.: UCI machine learning repository (2017)
Fahy, C., Yang, S.: Finding and tracking multi-density clusters in online dynamic data streams. IEEE Trans. Big Data 8, 178–192 (2019)
Garcia, S., Grill, M., Stiborek, J., Zunino, A.: An empirical comparison of botnet detection methods. Comput. Secur. 45, 100–123 (2014)
Glover, F.: Future paths for integer programming and links to artificial intelligence. Comput. Oper. Res. 13(5), 533–549 (1986)
Guijo-Rubio, D., Durán-Rosal, A.M., Gutiérrez, P.A., Troncoso, A., Hervás-Martínez, C.: Time-series clustering based on the characterization of segment typologies. IEEE Trans. Cybern. 51(11), 5409–5422 (2020)
Guo, J., Liu, G., Zuo, Y., Wu, J.: Learning sequential behavior representations for fraud detection. In: ICDM, pp. 127–136. IEEE (2018)
Hyde, R., Angelov, P., MacKenzie, A.R.: Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf. Sci. 382, 96–114 (2017)
Islam, M.K., Ahmed, M.M., Zamli, K.Z.: A buffer-based online clustering for evolving data stream. Inf. Sci. 489, 113–135 (2019)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G.: Learning under concept drift: a review. TKDE 31(12), 2346–2363 (2018)
Lu, N., Zhang, G., Lu, J.: Concept drift detection via competence models. Artif. Intell. 209, 11–28 (2014)
Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval. Nat. Lang. Eng. 16(1), 100–103 (2010)
Nadeem, A., Hammerschmidt, C., Gañán, C.H., Verwer, S.: Beyond labeling: using clustering to build network behavioral profiles of malware families. In: Stamp, M., Alazab, M., Shalaginov, A. (eds.) Malware Analysis Using Artificial Intelligence and Deep Learning, pp. 381–409. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-62582-5_15
Nadeem, A., Verwer, S., Moskal, S., Yang, S.J.: Alert-driven attack graph generation using s-PDFA. IEEE Trans. Dependable Sec. Comput. 19(2), 731–746 (2021)
Pedregosa, F., et al.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Schubert, E., Rousseeuw, P.J.: Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 171–187. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_16
Sculley, D.: Web-scale k-means clustering. In: WWW, pp. 1177–1178 (2010)
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C., Gama, J.: Data stream clustering: a survey. CSUR 46(1), 1–31 (2013)
Tiwari, M., Zhang, M.J., Mayclin, J., Thrun, S., Piech, C., Shomorony, I.: Banditpam: almost linear time k-medoids clustering via multi-armed bandits. NeurIPS 33, 10211–10222 (2020)
Ushakov, A.V., Vasilyev, I.: Near-optimal large-scale k-medoids clustering. Inf. Sci. 545, 344–362 (2021)
Wang, T., Li, Q., Bucci, D.J., Liang, Y., Chen, B., Varshney, P.K.: K-medoids clustering of data sequences with composite distributions. IEEE Trans. Signal Process. 67(8), 2093–2106 (2019)
Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., Keogh, E.: Experimental comparison of representation methods and distance measures for time series data. Data Min. Knowl. Disc. 26(2), 275–309 (2013)
Wang, Y., Chen, L., Mei, J.P.: Incremental fuzzy clustering with multiple medoids for large data. IEEE Trans. Fuzzy Syst. 22(6), 1557–1568 (2014)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: a new data clustering algorithm and its applications. Data Min. Knowl. Disc. 1(2), 141–182 (1997)
Žliobaitė, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. In: Japkowicz, N., Stefanowski, J. (eds.) Big Data Analysis: New Algorithms for a New Society. SBD, vol. 16, pp. 91–114. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-26989-4_4
Zubaroğlu, A., Atalay, V.: Data stream clustering: a review. Artif. Intell. Rev. 54(2), 1201–1236 (2021)
Acknowledgements
We thank Ruben te Wierik, Silviu Fucarev, and Rami Al-Obaidi for their contributions to the SECLEDS algorithm.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nadeem, A., Verwer, S. (2023). SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids and Medoid Voting. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13713. Springer, Cham. https://doi.org/10.1007/978-3-031-26387-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-26387-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26386-6
Online ISBN: 978-3-031-26387-3
eBook Packages: Computer ScienceComputer Science (R0)