More Web Proxy on the site http://driver.im/

research-article

YADING: fast clustering of large-scale time series data

Editors: Chen Li, Volker Markl Authors:

Dongmei ZhangAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 5

Pages 473 - 484

https://doi.org/10.14778/2735479.2735481

Published: 01 January 2015 Publication History

Abstract

Fast and scalable analysis techniques are becoming increasingly important in the era of big data, because they are the enabling techniques to create real-time and interactive experiences in data analysis. Time series are widely available in diverse application areas. Due to the large number of time series instances (e.g., millions) and the high dimensionality of each time series instance (e.g., thousands), it is challenging to conduct clustering on large-scale time series, and it is even more challenging to do so in real-time to support interactive exploration.

In this paper, we propose a novel end-to-end time series clustering algorithm, YADING, which automatically clusters large-scale time series with fast performance and quality results. Specifically, YADING consists of three steps: sampling the input dataset, conducting clustering on the sampled dataset, and assigning the rest of the input data to the clusters generated on the sampled dataset. In particular, we provide theoretical proof on the lower and upper bounds of the sample size, which not only guarantees YADING's high performance, but also ensures the distribution consistency between the input dataset and the sampled dataset. We also select L₁ norm as similarity measure and the multi-density approach as the clustering method. With theoretical bound, this selection ensures YADING's robustness to time series variations due to phase perturbation and random noise.

Evaluation results have demonstrated that on typical-scale (100,000 time series each with 1,000 dimensions) datasets, YADING is about 40 times faster than the state-of-the-art, sampling-based clustering algorithm DENCLUE 2.0, and about 1,000 times faster than DBSCAN and CLARANS. YADING has also been used by product teams at Microsoft to analyze service performance. Two of such use cases are shared in this paper.

References

[1]

Debregeas, A., and Hebrail, G. 1998. Interactive interpretation of Kohonen maps applied to curves. In Proc. of KDD'98. 179--183.

[2]

Derrick, K., Bill, K., and Vamsi, C. 2012. Large scale/big data federation & virtualization: a case study.

[3]

D. A. Patterson. 2002. A simple way to estimate the cost of downtime. In Proc. of LISA' 02, pp. 185--188.

Digital Library

[4]

Eamonn, K., and Shruti, K. 2002. On the need for time series data mining benchmarks: a survey and empirical demonstration. In Proc. of KDD'02, July 23--26.

[5]

T. W. Liao. 2005. Clustering of time series data---A survey Pattern Recognit., vol. 38, no. 11, pp. 1857--1874, Nov.

Digital Library

[6]

X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger. 1998. A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med.

[7]

B. K. Yi, H. V. Jagadish, and C. Faloutsos. 1998. Efficient retrieval of similar time sequences under time warping. In IEEE Proc. of ICDE, Feb.

Digital Library

[8]

J. Han, M. Kamber. 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, pp. 346--389.

Digital Library

[9]

Chu, K. & Wong, M. 1999. Fast time-series searching with scaling and shifting. In proc. of PODS. pp 237--248.

Digital Library

[10]

Popivanov, I. & Miller, R. J. 2002. Similarity search over time series data using wavelets. In proc. of ICDE.

Digital Library

[11]

Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In proc. of ACM SIGMOD.

Digital Library

[12]

Yi, B. & Faloutsos, C. 2000. Fast time sequence indexing for arbitrary lp norms. In proc. of the VLDB. pp 385--394.

Digital Library

[13]

E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. 2001. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst., 3(3).

[14]

G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. 2003. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE, 15(5).

Digital Library

[15]

Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu. Y. 2000. Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In Proc. of the PAKDD.

Digital Library

[16]

Stuart, Alan. 1962. Basic Ideas of Scientific Sampling, Hafner Publishing Company, New York.

[17]

M. Ester, H. P. Kriegel, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. ACM SIGKDD 1996, pages 226--231.

[18]

M. Steinbach, L. Ertoz, and V. Kumar. 2003. Challenges of clustering high dimensional data. In L. T. Wille, editor, New Vistas in Statistical Physics -- Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer-Verlag.

[19]

Ng R. T., and Han J. 1994. Efficient and Effective Clustering Methods for Spatial Data Mining, In proc. of VLDB.

Digital Library

[20]

S. Guha, R. Rastogi, K. Shim. 1998. CURE: an efficient clustering algorithm for large databases. SIGMOD.

Digital Library

[21]

W. Wang, J. Yang, R. Muntz, R. 1997. STING: a statistical information grid approach to spatial data mining, VLDB'97.

Digital Library

[22]

Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Arthur Zimek 2011. Density-based Clustering. WIREs Data Mining and Knowledge Discovery 1 (3): 231--240.

[23]

Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander. 1999. OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD. pp. 49--60.

Digital Library

[24]

Tao Pei, Ajay Jasra, David J. Hand, A. X. Zhu, C. Zhou. 2009. DECODE: a new method for discovering clusters of different densities in spatial data, Data Min Knowl Disc.

Digital Library

[25]

C. Guo, H. Li, and D. Pan. 2010. An improved piecewise aggregate approximation based on statistical features for time series mining. KSEM'10, pages 234--244.

Digital Library

[26]

Beckmann, N.; Kriegel, H. P.; Schneider, R.; Seeger, B. 1990. "The R-tree: an efficient and robust access method for points and rectangles". In proc. of SIGMOD. p. 322.

Digital Library

[27]

X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger, A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med. 40 (1998).

[28]

T. W. Liao, B. Bolt, J. Forester, E. Hailman, C. Hansen, R. C. Kaste, J. O'May, Understanding and projecting the battle state, 23rd Army Science Conference, Orlando, FL, 2002.

[29]

D. Piccolo, A distance measure for classifying ARMA models, J. Time Ser. Anal. 11 (2) (1990) 153--163.

[30]

D. Tran, M. Wagner, Fuzzy c-means clustering-based speaker verification, in: N. R. Pal, M. Sugeno (Eds.), AFSS 2002, Lecture Notes in Artificial Intelligence, 2275, 2002.

Digital Library

[31]

Danon L, Díaz-Guilera A, Duch J and Arenas A 2005 J. Stat. Mech. P09008.

[32]

M. Datar, N. Immorlica, P. Indyk, V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions, Computational Geometry, pp. 253--262.

Digital Library

[33]

Berchthold S., Keim D., Kriegel H. P. 1996. The X-Tree: An Index Structure for High-Dimensional Data, VLDB.

Digital Library

[34]

Keogh, E., Zhu, Q., Hu, B., Hao. Y., Xi, X., Wei, L. & Ratanamahatana, C. A. (2011). The UCR Time Series Classification/Clustering Homepage: www.cs.ucr.edu/~eamonn/time_series_data/

[35]

http://research.microsoft.com/en-us/people/juding/yading.aspx

[36]

A. Hinneburg and H. H. Gabriel. DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation. In IDA, 2007.

Digital Library

[37]

Sheih, J., Keogh, E., 2009. iSAX: disk-aware mining and indexing of massive time series data sets. Data Mining and Knowledge Discovery 19 (1), 24--57.

Digital Library

[38]

D. Barbara. Requirements for clustering data streams. SIGKDD Explorations, 3(2): 23--27, January 2002.

Digital Library

[39]

P. Liu, D. Zhou, N. Wu. 2007. Varied Density Based Spatial Clustering of Application with Noise, In proc. of ICSSSM. 528--531

Cited By

Cai YPan XZhang LXu FZhang S(2025)Towards Sustainable Parking: Analyzing the Characteristics of Periodic Off-Street Parking Lots and Their Application in Shared ParkingSustainability10.3390/su1703083317:3(833)Online publication date: 21-Jan-2025
https://doi.org/10.3390/su17030833
Siddiqui TNarasayya VDumitru MChaudhuri S(2024)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.14778/3636218.3636222
Zhou FChen YZhu CJiang LLiao XZhong ZChen XChen YZhao Y(2024)Visual Analysis of Money Laundering in Cryptocurrency ExchangeIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.323168711:1(731-745)Online publication date: Feb-2024
https://doi.org/10.1109/TCSS.2022.3231687
Show More Cited By

YADING: fast clustering of large-scale time series data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
  2. Information systems applications
    1. Data mining

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
On cluster tree for nested and multi-density data clustering

Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 5

January 2015

181 pages

ISSN:2150-8097

Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2015

Published in PVLDB Volume 8, Issue 5

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

70
Total Citations
View Citations
840
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cai YPan XZhang LXu FZhang S(2025)Towards Sustainable Parking: Analyzing the Characteristics of Periodic Off-Street Parking Lots and Their Application in Shared ParkingSustainability10.3390/su1703083317:3(833)Online publication date: 21-Jan-2025
https://doi.org/10.3390/su17030833
Siddiqui TNarasayya VDumitru MChaudhuri S(2024)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.14778/3636218.3636222
Zhou FChen YZhu CJiang LLiao XZhong ZChen XChen YZhao Y(2024)Visual Analysis of Money Laundering in Cryptocurrency ExchangeIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.323168711:1(731-745)Online publication date: Feb-2024
https://doi.org/10.1109/TCSS.2022.3231687
Yao DWang JChen WGuo FHan PBi J(2024)Deep Dirichlet Process Mixture Model for Non-parametric Trajectory Clustering2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00339(4449-4462)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00339
Huang YLuo X(2024)A Method Based on Lie Group Machine Learning for Multivariate Time-Series Clustering2024 6th International Conference on Data-driven Optimization of Complex Systems (DOCS)10.1109/DOCS63458.2024.10704262(638-643)Online publication date: 16-Aug-2024
https://doi.org/10.1109/DOCS63458.2024.10704262
Wei YCao DCui LLi COu-Yang S(2024)Novel Adaptive and Iterative Multi-density DBSCAN Method for Space Signal Sorting in Complex Electromagnetic Environment2024 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT)10.1109/COMNETSAT63286.2024.10862591(451-458)Online publication date: 28-Nov-2024
https://doi.org/10.1109/COMNETSAT63286.2024.10862591
Chakraborty AEswaran AThorat PVerma MGupta PJayachandran P(2024)Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00014(30-41)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00014
Cai BHuang GYang SXiang YChi C(2024)SE-shapeletsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122584240:COnline publication date: 8-Aug-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122584
Lukianchenko PKopylov D(2024)Comparative Analysis of Traditional Machine Learning Approaches for Time Series Clustering Under Colored NoiseInformation Technologies and Intelligent Decision Making Systems10.1007/978-3-031-60318-1_3(26-39)Online publication date: 1-May-2024
https://doi.org/10.1007/978-3-031-60318-1_3
Paparrizos JWu KElmore AFaloutsos CFranklin M(2023)Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding DistancesProceedings of the VLDB Endowment10.14778/3594512.359453016:8(2019-2032)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.14778/3594512.3594530
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents