Abstract
Data stream clustering has wide applications, such as online financial transactions, telephone records, and network monitoring. Grid-based clustering partitions stream data into cells, derives statistical information of the cells, and then applies clustering on these much smaller statistical information without referring to the input data. Therefore, grid-based clustering is efficient and very suitable for high-throughput data streams, which are continuous, time-varying, and possibly unpredictable. Various grid-based clustering schemes have been proposed. However, to the best of our knowledge, none of them provides an accuracy guarantee for their clustering output. To fill this gap, in this paper we study grid-based k-median clustering. We first develop an accuracy guarantee on the cost difference between grid-based solution and the optimum. Based on the theoretical analysis, we then propose a general and adaptive solution, which partitions stream data into cells of dynamically determined granularity and runs k-median clustering on the statistical information of cells with an accuracy guarantee. An extensive experiment over three real datasets clearly shows that our solution provides high-quality clustering outputs in an efficient way.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption
Ackermann, M.R., Blömer, J.: Coresets and approximate clustering for bregman divergences. In: SODA
Ackermann, M.R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., Sohler, C.: Streamkm++: a clustering algorithm for data streams. ACM Journal of Experimental Algorithmics 17(1) (2012)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: VLDB, pp. 81–92 (2003)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, pp. 94–105 (1998)
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: NIPS, pp. 10–18 (2009)
Arora, S., Raghavan, P., Rao, S.: Approximation schemes for euclidean k-medians and related problems. In: STOC, pp. 106–113 (1998)
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., Pandit, V.: Local search heuristic for k-median and facility location problems. In: STOC, pp. 21–29 (2001)
Bartal, Y.: Probabilistic approximations of metric spaces and its algorithmic applications. In: FOCS, pp. 184–193 (1996)
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SDM, pp. 328–339 (2006)
Charikar, M., Chekuri, C., Goel, A., Guha, S.: Rounding via trees: deterministic approximation algorithms for group steiner trees and k-median. In: STOC, pp. 114–123 (1998)
Charikar, M., Guha, S.: Improved combinatorial algorithms for the facility location and k-median problems. In: FOCS, pp. 378–388 (1999)
Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem (extended abstract). In: STOC, pp. 1–10 (1999)
Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)
Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: KDD, pp. 133–142 (2007)
Cormode, G., Muthukrishnan, S., Zhuang, W.: Conquering the divide: continuous clustering of distributed data streams. In: ICDE, pp. 1036–1045 (2007)
de Andrade Silva, J., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C.P.L.F., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13 (2013)
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In: SODA
Gama, J., Rodrigues, P.P., Lopes, L.M.B.: Clustering distributed sensor data streams using local processing and reduced communication. Intell. Data Anal. 15(1), 3–28 (2011)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)
Guo, T., Zhu, X., Pei, J., Zhang, C.: Snoc: streaming network node classification. In: ICDM (2014)
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. In: Proceedings of the Twenty-first Annual Symposium on Computational Geometry
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: STOC, pp. 291–300 (2004)
Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: FOCS, pp. 2–13 (1999)
Koudas, N., Ooi, B.C., Tan, K.-L., Zhang, R.: Approximate nn queries on streams with guaranteed error/performance bounds. In: VLDB, pp. 804–815 (2004)
Lin, J., Vitter, J.S.: Approximation algorithms for geometric median problems. Inf. Process. Lett. 44(5), 245–249 (1992)
Lin, J., Vitter, J.S.: Epsilon-approximations with minimum packing constraint violation (extended abstract). In: STOC, pp. 771–782 (1992)
Park, N.H., Lee, W.S.: Statistical grid-based clustering over data streams. SIGMOD Record 33(1), 32–37 (2004)
Park, N.H., Lee, W.S.: Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. Data Knowl. Eng. 63(2), 528–549 (2007)
Tao, Y., Lian, X., Papadias, D., Hadjieleftheriou, M.: Random sampling for continuous streams with arbitrary updates. IEEE Trans. Knowl. Data Eng. 19(1), 96–110 (2007)
Wang, W., Yang, J., Muntz, R.R.: Sting: a statistical information grid approach to spatial data mining. In: VLDB, pp. 186–195 (1997)
Zhang, Q., Liu, J., Wang, W.: Approximate clustering on distributed data streams. In: ICDE, pp. 1131–1139 (2008)
Zhang, Z., Shu, H., Chong, Z., Lu, H., Yang, Y.: C-cube: elastic continuous clustering in the cloud. In: ICDE, pp. 577–588 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Cao, J., Zhou, Y., Wu, M. (2015). Adaptive Grid-Based k-median Clustering of Streaming Data with Accuracy Guarantee. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9049. Springer, Cham. https://doi.org/10.1007/978-3-319-18120-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-18120-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18119-6
Online ISBN: 978-3-319-18120-2
eBook Packages: Computer ScienceComputer Science (R0)