[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

YADING: fast clustering of large-scale time series data

Published: 01 January 2015 Publication History

Abstract

Fast and scalable analysis techniques are becoming increasingly important in the era of big data, because they are the enabling techniques to create real-time and interactive experiences in data analysis. Time series are widely available in diverse application areas. Due to the large number of time series instances (e.g., millions) and the high dimensionality of each time series instance (e.g., thousands), it is challenging to conduct clustering on large-scale time series, and it is even more challenging to do so in real-time to support interactive exploration.
In this paper, we propose a novel end-to-end time series clustering algorithm, YADING, which automatically clusters large-scale time series with fast performance and quality results. Specifically, YADING consists of three steps: sampling the input dataset, conducting clustering on the sampled dataset, and assigning the rest of the input data to the clusters generated on the sampled dataset. In particular, we provide theoretical proof on the lower and upper bounds of the sample size, which not only guarantees YADING's high performance, but also ensures the distribution consistency between the input dataset and the sampled dataset. We also select L1 norm as similarity measure and the multi-density approach as the clustering method. With theoretical bound, this selection ensures YADING's robustness to time series variations due to phase perturbation and random noise.
Evaluation results have demonstrated that on typical-scale (100,000 time series each with 1,000 dimensions) datasets, YADING is about 40 times faster than the state-of-the-art, sampling-based clustering algorithm DENCLUE 2.0, and about 1,000 times faster than DBSCAN and CLARANS. YADING has also been used by product teams at Microsoft to analyze service performance. Two of such use cases are shared in this paper.

References

[1]
Debregeas, A., and Hebrail, G. 1998. Interactive interpretation of Kohonen maps applied to curves. In Proc. of KDD'98. 179--183.
[2]
Derrick, K., Bill, K., and Vamsi, C. 2012. Large scale/big data federation & virtualization: a case study.
[3]
D. A. Patterson. 2002. A simple way to estimate the cost of downtime. In Proc. of LISA' 02, pp. 185--188.
[4]
Eamonn, K., and Shruti, K. 2002. On the need for time series data mining benchmarks: a survey and empirical demonstration. In Proc. of KDD'02, July 23--26.
[5]
T. W. Liao. 2005. Clustering of time series data---A survey Pattern Recognit., vol. 38, no. 11, pp. 1857--1874, Nov.
[6]
X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger. 1998. A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med.
[7]
B. K. Yi, H. V. Jagadish, and C. Faloutsos. 1998. Efficient retrieval of similar time sequences under time warping. In IEEE Proc. of ICDE, Feb.
[8]
J. Han, M. Kamber. 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, pp. 346--389.
[9]
Chu, K. & Wong, M. 1999. Fast time-series searching with scaling and shifting. In proc. of PODS. pp 237--248.
[10]
Popivanov, I. & Miller, R. J. 2002. Similarity search over time series data using wavelets. In proc. of ICDE.
[11]
Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In proc. of ACM SIGMOD.
[12]
Yi, B. & Faloutsos, C. 2000. Fast time sequence indexing for arbitrary lp norms. In proc. of the VLDB. pp 385--394.
[13]
E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. 2001. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst., 3(3).
[14]
G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. 2003. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE, 15(5).
[15]
Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu. Y. 2000. Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In Proc. of the PAKDD.
[16]
Stuart, Alan. 1962. Basic Ideas of Scientific Sampling, Hafner Publishing Company, New York.
[17]
M. Ester, H. P. Kriegel, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. ACM SIGKDD 1996, pages 226--231.
[18]
M. Steinbach, L. Ertoz, and V. Kumar. 2003. Challenges of clustering high dimensional data. In L. T. Wille, editor, New Vistas in Statistical Physics -- Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer-Verlag.
[19]
Ng R. T., and Han J. 1994. Efficient and Effective Clustering Methods for Spatial Data Mining, In proc. of VLDB.
[20]
S. Guha, R. Rastogi, K. Shim. 1998. CURE: an efficient clustering algorithm for large databases. SIGMOD.
[21]
W. Wang, J. Yang, R. Muntz, R. 1997. STING: a statistical information grid approach to spatial data mining, VLDB'97.
[22]
Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Arthur Zimek 2011. Density-based Clustering. WIREs Data Mining and Knowledge Discovery 1 (3): 231--240.
[23]
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander. 1999. OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD. pp. 49--60.
[24]
Tao Pei, Ajay Jasra, David J. Hand, A. X. Zhu, C. Zhou. 2009. DECODE: a new method for discovering clusters of different densities in spatial data, Data Min Knowl Disc.
[25]
C. Guo, H. Li, and D. Pan. 2010. An improved piecewise aggregate approximation based on statistical features for time series mining. KSEM'10, pages 234--244.
[26]
Beckmann, N.; Kriegel, H. P.; Schneider, R.; Seeger, B. 1990. "The R-tree: an efficient and robust access method for points and rectangles". In proc. of SIGMOD. p. 322.
[27]
X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger, A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med. 40 (1998).
[28]
T. W. Liao, B. Bolt, J. Forester, E. Hailman, C. Hansen, R. C. Kaste, J. O'May, Understanding and projecting the battle state, 23rd Army Science Conference, Orlando, FL, 2002.
[29]
D. Piccolo, A distance measure for classifying ARMA models, J. Time Ser. Anal. 11 (2) (1990) 153--163.
[30]
D. Tran, M. Wagner, Fuzzy c-means clustering-based speaker verification, in: N. R. Pal, M. Sugeno (Eds.), AFSS 2002, Lecture Notes in Artificial Intelligence, 2275, 2002.
[31]
Danon L, Díaz-Guilera A, Duch J and Arenas A 2005 J. Stat. Mech. P09008.
[32]
M. Datar, N. Immorlica, P. Indyk, V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions, Computational Geometry, pp. 253--262.
[33]
Berchthold S., Keim D., Kriegel H. P. 1996. The X-Tree: An Index Structure for High-Dimensional Data, VLDB.
[34]
Keogh, E., Zhu, Q., Hu, B., Hao. Y., Xi, X., Wei, L. & Ratanamahatana, C. A. (2011). The UCR Time Series Classification/Clustering Homepage: www.cs.ucr.edu/~eamonn/time_series_data/
[35]
http://research.microsoft.com/en-us/people/juding/yading.aspx
[36]
A. Hinneburg and H. H. Gabriel. DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation. In IDA, 2007.
[37]
Sheih, J., Keogh, E., 2009. iSAX: disk-aware mining and indexing of massive time series data sets. Data Mining and Knowledge Discovery 19 (1), 24--57.
[38]
D. Barbara. Requirements for clustering data streams. SIGKDD Explorations, 3(2): 23--27, January 2002.
[39]
P. Liu, D. Zhou, N. Wu. 2007. Varied Density Based Spatial Clustering of Application with Noise, In proc. of ICSSSM. 528--531

Cited By

View all
  • (2023)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 1-Dec-2023
  • (2023)Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding DistancesProceedings of the VLDB Endowment10.14778/3594512.359453016:8(2019-2032)Online publication date: 22-Jun-2023
  • (2022)Scalable Time Series Compound InfrastructureProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517888(1685-1698)Online publication date: 10-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 5
January 2015
181 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2015
Published in PVLDB Volume 8, Issue 5

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)101
  • Downloads (Last 6 weeks)8
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Cache-Efficient Top-k Aggregation over High Cardinality Large DatasetsProceedings of the VLDB Endowment10.14778/3636218.363622217:4(644-656)Online publication date: 1-Dec-2023
  • (2023)Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding DistancesProceedings of the VLDB Endowment10.14778/3594512.359453016:8(2019-2032)Online publication date: 22-Jun-2023
  • (2022)Scalable Time Series Compound InfrastructureProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517888(1685-1698)Online publication date: 10-Jun-2022
  • (2021)A deep bidirectional similarity learning model using dimensional reduction for multivariate time series clusteringMultimedia Tools and Applications10.1007/s11042-020-10476-680:26-27(34269-34281)Online publication date: 1-Nov-2021
  • (2020)Sorting Data via a Look-Up-Table Neural Network and Self-Regulating IndexComplexity10.1155/2020/47935452020Online publication date: 1-Jan-2020
  • (2020)CRATOSProceedings of the 2020 European Symposium on Software Engineering10.1145/3393822.3432319(194-203)Online publication date: 6-Nov-2020
  • (2020)Efficient incident identification from multi-dimensional issue reports via meta-heuristic searchProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3368089.3409741(292-303)Online publication date: 8-Nov-2020
  • (2020)Debunking Four Long-Standing Misconceptions of Time-Series Distance MeasuresProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389760(1887-1905)Online publication date: 11-Jun-2020
  • (2020)Clustering Hashtags Using Temporal PatternsWeb Information Systems Engineering – WISE 202010.1007/978-3-030-62005-9_14(183-195)Online publication date: 20-Oct-2020
  • (2019)Learning representations for time series clusteringProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454626(3781-3791)Online publication date: 8-Dec-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media