Abstract
The last decade has seen a flurry of research on all-pairs-similarity-search (or similarity joins) for text, DNA and a handful of other datatypes, and these systems have been applied to many diverse data mining problems. However, there has been surprisingly little progress made on similarity joins for time series subsequences. The lack of progress probably stems from the daunting nature of the problem. For even modest sized datasets the obvious nested-loop algorithm can take months, and the typical speed-up techniques in this domain (i.e., indexing, lower-bounding, triangular-inequality pruning and early abandoning) at best produce only one or two orders of magnitude speedup. In this work we introduce a novel scalable algorithm for time series subsequence all-pairs-similarity-search. For exceptionally large datasets, the algorithm can be trivially cast as an anytime algorithm and produce high-quality approximate solutions in reasonable time and/or be accelerated by a trivial porting to a GPU framework. The exact similarity join algorithm computes the answer to the time series motif and time series discord problem as a side-effect, and our algorithm incidentally provides the fastest known algorithm for both these extensively-studied problems. We demonstrate the utility of our ideas for many time series data mining problems, including motif discovery, novelty discovery, shapelet discovery, semantic segmentation, density estimation, and contrast set mining. Moreover, we demonstrate the utility of our ideas on domains as diverse as seismology, music processing, bioinformatics, human activity monitoring, electrical power-demand monitoring and medicine.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
There are many such worse case scenarios, including high levels of noise blurring the distinction between closest and furthest neighbors and thus rendering triangular-inequality pruning and early abandoning worthless.
References
Agrawr R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th international conference on foundations of data organization and algorithms (FODO’93), pp 69–84
Alibaba.com (2017) http://www.alibaba.com/showroom/seismograph.html
Assent I, Kranen P, Baldauf C, Seidl T (2012) AnyOut: anytime outlier detection on streaming data. In: Proceedings of the 17th international conference on database systems for advanced applications—volume part I (DASFAA’12), pp 228–242
Bavardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web, pp 131–140
Begum N, Keogh E (2014) Rare time series motif discovery from unbounded streams. In: Proceedings of the VLDB endowment (VLDB), vol 8(2), pp 149–160
Beroza G (2016) Personal correspondence. Jan 21, 2016
Bouezmarni T, Rombouts J (2010) Nonparametric density estimation for positive time series. Comput Stat Data Anal 54:245–261
Brown AEX, Yemini EI, Grundy LJ, Jucikas T, Schafer WR (2013) A dictionary of behavioral motifs reveals clusters of genes affecting caenorhabditis elegans locomotion. Proc Natl Acad Sci USA 110:791–796
Chandola V, Cheboli D, Kumar V (2009) Detecting anomalies in a time series database. UMN TR09-004
Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The UCR time series classification archive. http://www.cs.ucr.edu/~eamonn/time_series_data/
CMU Motion Capture Database (2017) http://mocap.cs.cmu.edu/
Convolution (2016) Wikipedia, The Free Encyclopedia https://en.wikipedia.org/wiki/Convolution
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: Proceedings of the VLDB endowment (VLDB), vol 1(2), pp 1542–1552
Dittmar C, Hildebrand K. F, Gaertner D, Winges M, Müller F, Aichroth P (2012) Audio forensics meets music information retrieval—a toolbox for inspection of music plagiarism. In: 2012 Proceedings of the 20th European signal processing conference (EUSIPCO), pp 126–131
Geller RJ, Mueller CS (1980) Four similar earthquakes in central California. Geophys Res Lett 7:821–824
Gomez-Valero L, Rusniok C, Cazalet C, Buchrieser C (2011) Comparative and functional genomics of legionella identified eukaryotic like proteins as key players in host-pathogen interactions. Front Microbiol 2:208
Hao MC, Marwah M, Janetzko H, Dayal U, Keim DA, Patnaik D, Ramakrishnan N, Sharma RK (2012) Visual exploration of frequent patterns in multivariate time series. Inf Vis 11:71–83
Hassanieh H, Indyk P, Katabi D, Price E (2012) Nearly optimal sparse Fourier transform. In: Proceedings of the forty-fourth annual ACM symposium on theory of computing (STOC), pp 563–78
Hu B, Chen Y, Zakaria J, Ulanova L, Keogh EJ (2013) Classification of multi-dimensional streaming time series by weighting each classifier’s track record. In: 2013 IEEE 13th international conference on data mining (ICDM), pp 281–290
Huang T, Zhu Y, Mao Y, Li X, Liu M, Wu Y, Ha Y, Dobbie G (2016) Parallel discord discovery. In: Advances in knowledge discovery and data mining: 20th Pacific-Asia conference, PAKDD 2016, Auckland, New Zealand, April 19–22, 2016. Proceedings, Part II, pp 233–244
Hughes JF, Skaletsky H, Pyntikova T, Graves TA, van Daalen SK, Minx PJ, Fulton RS, McGrath SD, Locke DP, Friedman C, Trask BJ, Mardis ER, Warren WC, Repping S, Rozen S, Wilson RK, Page DC (2010) Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature 463:536–539
Lee H, Ng R, Shim K (2011) Similarity join size estimation using locality sensitive hashing. In: Proceedings of the VLDB endowment (VLDB), vol 4(6), pp 338–349
Li Y, Yiu ML, Gong Z (2015) Quick-motif: an efficient and scalable framework for exact motif discovery. In: 2015 IEEE 31st international conference on data engineering (ICDE), pp 579–590
Lian X, Chen L (2009) Efficient join processing on uncertain data streams. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 857–866
Luo W, Tan H, Mao H, Ni, LM (2012) Efficient similarity joins on massive high-dimensional datasets using MapReduce. In: 2012 IEEE 13th international conference on mobile data management (MDM), pp 1–10
Ma Y, Meng X, Wang S (2016) Parallel similarity joins on massive high-dimensional data using MapReduce. Concurr Comput 28(1):166–183
Makonin SV (2013) AMPds: a public dataset for load disaggregation and eco-feedback research. In: 2013 IEEE electrical power & energy conference (EPEC)
Morales GDF, Gionis A (2016) Streaming similarity self-join. In: Proceedings of the VLDB endowment (VLDB), vol 9(10), pp 792–803
Motamedi-Fakhr S, Moshrefi-Torbati M, Hill M, Hill CM, White PR (2014) Signal processing techniques applied to human sleep EEG signals—a review. Biomed Signal Process Control 10(2014):21–33
Mueen A, Hamooni H, Estrada T (2014) Time series join on subsequence correlation. In: Proceedings of the 2014 IEEE international conference on data mining (ICDM), pp 450–459
Mueen A, Keogh E, Young N (2011) Logical-shapelets: an expressive primitive for time series classification. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1154–1162
Mueen A, Keogh E, Zhu Q, Cash S, Westover B (2009) Exact discovery of time series motif. In: Proceedings of the 2009 SIAM international conference on data mining (SDM), pp 473–484
Mueen A, Nath S, Liu J (2010) Fast approximate correlation for massive time-series data. In; Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 171–182
Murray D, Liao J, Stankovic L, Stankovic V, Hauxwell-Baldwin R, Wilson C, Coleman M, Kane T, Firth S (2015) A Data management platform for personalised real-time energy feedback. In: Proceedings of the 8th international conference on energy efficiency in domestic appliances and lighting (EEDAL), pp 1293–1307
Niennattrakul V, Keogh, EJ, Ratanamahatana, CA (2010) Data editing techniques to allow the application of distance-based outlier detection to streams. In: 2010 IEEE 10th international conference on data mining (ICDM), pp 947–952
Patnaik D, Manish M, Sharma RK, Ramakrishnan N (2009) Sustainable operation and management of data center chillers using temporal data mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1305–1314
Quick Motif (2015) http://degroup.cis.umac.mo/quickmotifs/
Rakthanmanon T, Champana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 262–270
Rakthanmanon T, Champana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2013a) Addressing big data time series: mining trillions of time series subsequences under dynamic time warping. ACM Trans Knowl Discov Data 7(3):10
Rakthanmanon T, Keogh E (2013b) Fast shapelets: a scalable algorithm for discovering time series shapelets. In: Proceedings of the 2013 SIAM international conference on data mining (SDM), pp 668–676
Reiss A, Weber M, Stricker D (2011) Exploring and extending the boundaries of physical activity recognition. In: IEEE International conference on systems, man, and cybernetics (SMC), pp 46–50
Seidl T, Assent I, Kranen K, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: Proceedings of the 12th international conference on extending database technology: advances in database technology (EDBT), pp 311–322
Shao H, Marwah M, Ramakrishnan N (2013) A temporal motif mining approach to unsupervised energy disaggregation: applications to residential and commercial buildings. In: Proceedings of the twenty-seventh AAAI conference on artificial intelligence (AAAI), pp 1327–1333
Supporting Page (2017) http://www.cs.ucr.edu/~eamonn/MatrixProfile.html
Truong CD, Anh DT (2015) An efficient method for motif and anomaly detection in time series based on clustering. Int J Bus Intell Data Min 10(4):356–377
Tucker A, Liu X (2004) A Bayesian network approach to explaining time series with changing structure. Intell Data Anal 8(5):469–480
Ueno K, Xi X, Keogh EJ, Lee D-J (2006) Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: Sixth international conference on data mining (ICDM), 2006
Vlachos M, Meek C, Vagena Z, Gunopulos D (2004) Identifying similarities, periodicities and bursts for online search queries. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, pp 131–142
Vlachos M, Vagena Z, Yu PS, Athitsos V (2005) Rotation invariant indexing of shapes and line drawings. In: Proceedings of the 14th ACM international conference on information and knowledge management (CIKM), pp 131–138
Wintner A (1934) On analytic convolutions of Bernoulli distributions. Am J Math 56(1/4):659–663
Ye L and Keogh E (2009) Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 947–956
Yeh C-C. M, van Herle H, Keogh E (2016a) Matrix profile III: the matrix profile allows visualization of salient subsequences in massive time series. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 579–588
Yeh C-C M, Zhu Y, Ulanova L, Begum N, Ding Y, Dau H A, Silva D F, Mueen A, Keogh E (2016b) Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1317–1322
Yoon C, O’Reilly O, Bergen K, Beroza G (2015) Earthquake detection through computationally efficient similarity search. Sci Adv 1(11):e1501057
Zhou F, Torre F, Hodgins J (2008) Aligned cluster analysis for temporal segmentation of human motion. In: 2008 8th IEEE international conference on automatic face & gesture recognition, pp 1–7
Zhu Y, Zimmerman Z, Senobari N S, Yeh C-C M, Funning G, Mueen A, Brisk P, Keogh E (2016) Matrix profile II: exploiting a novel algorithm and GPUs to break the one hundred million barrier for time series motifs and joins. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 739–748
Zilberstein S, Russell S (1995) Approximate reasoning using anytime algorithms. In: Imprecise and approximate computation, pp 43–62
Acknowledgements
We gratefully acknowledge funding from NSF IIS-1161997 II, MERL Labs and Samsung, and all the data donors.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Jian Pei.
Appendix: On the unpredictable time needed for state-of-the-art algorithms
Appendix: On the unpredictable time needed for state-of-the-art algorithms
In Section 4.8 we made some unintuitive observations about all known rival motif discovery/time series join algorithms. In essence, by making the problem apparently slightly easier, by either reducing the dimensionality or time series length, the time needed can get actually much worse (and vice versa). Here we sketch out an explanation for this fact.
The key observation is that these algorithms all use some form of pruning. The utility of pruning depends on two things; the distance between the discovered motifs (relative to all pairwise distances), and how quickly the algorithm can find these best motifs (or some good best-so-far motif) to enable the pruning strategy to extract the most benefit. Note that the former factor is a property of the data, not the algorithm.
Imagine we construct a dataset of length 100,000, and search for motifs of length 100. If our data is just random numbers, this is the worst case for both Li et al. (2015) and Mueen et al. (2009), as the intrinsic dimensionality is the same as the actual dimensionality. In MATLAB, we could create such a dataset with:
As this is the worst case for Li et al. (2015) and Mueen et al. (2009), both degenerate to brute force search and will take several hours to finish. Naturally the “motif” they discover will only be slightly closer than any randomly chosen pair of subsequences.
Now let us create a near identical dataset, but one which has a critical difference, this dataset has a perfect motif embedded at the beginning and at the end of the time series:
If we run motif discovery on this dataset, both Li et al. (2015) and Mueen et al. (2009) terminate much faster, in just seconds. This is because both will find the embedded motif early on, and this will allow very aggressive pruning.
Finally, suppose we consider a new dataset, which is simply data2 with the last point truncated:
It is clear that although this dataset is very slightly smaller than data2, the time needed by either Li et al. (2015) or Mueen et al. (2009) will return to the many hours needed for than data1. This is because the best motif in data3 will once again be a time series pair that is only be slightly closer than any randomly chosen pair of subsequences, and the pruning will thus be ineffective. By similar reasoning we can construct the two other cases noted in Section 4.8.
Finally, we note that although the examples above are contrived and “worst case”, in practice both Li et al. (2015) and Mueen et al. (2009) do vary greatly in the time require to terminate, on real datasets that appear essentially identical to human inspection.
Rights and permissions
About this article
Cite this article
Yeh, CC.M., Zhu, Y., Ulanova, L. et al. Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile. Data Min Knowl Disc 32, 83–123 (2018). https://doi.org/10.1007/s10618-017-0519-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-017-0519-9