Abstract
The mining of time series data plays an important role in modern information retrieval and monitoring infrastructures. In particular, the identification of similarities within and across large time series is of great importance in analytics and knowledge discovery. For this task, the matrix profile similarity indexing approach, which encodes the correlations among snapshots of a time series, is well-established. However, it is computationally expensive, especially for long time series, as existing exact approaches mostly rely on exhaustive, exact query (search) operations and are inefficient. Similarly, existing approximate approaches are limited with respect to parallelism, scalability, or their extent of practicality. We, therefore, focus on an approximate parallel tree-based nearest-neighbors approach and address the weak scaling challenges raised when applied to large time series in HPC settings.
We build on the existing concept of parallel iterative tree-based nearest neighbor solvers and introduce a novel approach for the approximate calculation of the matrix profile. To improve the performance and overcome weak scalability challenges, we exploit a mix of creating a forest of parallel trees on exclusive ensembles of resources combined with pipelining of iterations. We provide an implementation targeting large-scale CPU-based HPC systems and illustrate the performance of this new approach with experimental data. Finally, we demonstrate the mining of time series at billion-records-scale datasets on the SuperMUC-NG system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The implementation can be provided upon request.
- 2.
Leibniz Supercomputing Centre.
- 3.
Other metrics are also used in the literature, and some might be subjective to particular practical use cases. An investigation of such alternative metrics is beyond the scope of the present work.
References
Arya, S., et al.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45(6), 891–923 (1998)
Barnes, J., Hut, P.: A hierarchical O(N log N) force-calculation algorithm. Nature 324(6096), 446–449 (1986)
Cools, S., et al.: Improving strong scaling of the conjugate gradient method for solving large linear systems using global reduction pipelining. ArXiv abs/1905.06850 (2019)
Curtin, R.R.: Faster dual-tree traversal for nearest neighbor search. In: Amato, G., Connor, R., Falchi, F., Gennaro, C. (eds.) SISAP 2015. LNCS, vol. 9371, pp. 77–89. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25087-8_7
Dau, H.A., Keogh, E.: Matrix profile V: a generic technique to incorporate domain knowledge into motif discovery. In: 23rd ACM SIGKDD, pp. 125–134 (2017)
Eamonn Keogh: Electrocardiography Dataset. https://www.cs.ucr.edu/~eamonn/ECG_one_day.zip. Accessed 15 Aug 2022
Gharghabi, S., et al.: Domain agnostic online semantic segmentation for multi-dimensional time series. In: Data Mining and Knowledge Discovery (2018)
Heldens, S., et al.: Rocket: efficient and scalable all-pairs computations on heterogeneous platforms. In: Proceedings of SC 2020. IEEE Press (2020)
Jirkovský, V., et al.: Big data analysis for sensor time-series in automation. In: IEEE Emerging Technology and Factory Automation (ETFA), pp. 1–8 (2014)
Jones, P.W., et al.: Randomized approximate nearest neighbors algorithm. Proc. Natl. Acad. Sci. 108(38), 15679–15686 (2011)
Ju, Y., et al.: Exploiting reduced precision for GPU-based Time series mining. In: IEEE IPDPS, pp. 124–134 (2022)
Karlstetter, R., et al.: Turning dynamic sensor measurements from gas turbines into insights: a big data approach. In: Turbo Expo, vol. 6 (2019)
Karlstetter, R., et al.: Living on the edge: efficient handling of large scale sensor data. In: 2021 IEEE/ACM CCGrid 2021, pp. 1–10 (2021)
Linardi, M., et al.: Matrix profile X: VALMOD - scalable discovery of variable-length motifs in data series. In: ACM SIGMOD, p. 1053–1066 (2018)
Lu, Y., et al.: Matrix profile XXIV: scaling time series anomaly detection to trillions of datapoints and ultra-fast arriving data streams. In: ACM SIGKDD (2022)
Mercer, R., et al.: Matrix profile XXIII: contrast profile: a novel time series primitive that allows real world classification. In: IEEE ICDM (2021)
Muja, M., Lowe, D.G.: Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2227–2240 (2014)
Netti, A.: HPC-ODA dataset collection (2020). https://doi.org/10.5281/zenodo.3701440
Patwary, M.M.A., et al.: PANDA: extreme scale parallel k-nearest neighbor on distributed architectures. CoRR abs/1607.08220 (2016)
Pfeilschifter, G.: time series analysis with matrix profile on HPC systems. Master thesis, Technische Universität München (2019)
Raksha, S., et al.: Weather forecasting framework for time series data using intelligent learning models. In: 5th ICEECCOT 2021, pp. 783–787 (2021)
Rakthanmanon, T., et al.: Searching and mining trillions of time series subsequences under dynamic time warping. In: ACM SIGKDD, pp. 262–270 (2012)
Ram, P., Sinha, K.: Revisiting KD-tree for nearest neighbor search. In: KDD 2019, pp. 1378–1388. Association for Computing Machinery, New York (2019)
Raoofy, A., Karlstetter, R., Yang, D., Trinitis, C., Schulz, M.: Time series mining at petascale performance. In: Sadayappan, P., Chamberlain, B.L., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12151, pp. 104–123. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50743-5_6
Rokhlin, V.: Rapid solution of integral equations of classical potential theory. J. Comput. Phys. 60(2), 187–207 (1985)
Schall-Zimmerman, Z., et al.: Matrix profile XVIII: time series mining in the face of fast moving streams using a learned approximate matrix profile. In: IEEE ICDM, pp. 936–945 (2019)
Schmidl, S., et al.: Anomaly detection in time series: a comprehensive evaluation. Proc. VLDB Endow. 15(9), 1779–1797 (2022)
Shakibay Senobari, et al.: Using the similarity matrix profile to investigate foreshock behavior of the 2004 parkfield earthquake. In: AGU Fall Meeting Abstracts, vol. 2018, pp. S51B–03 (2018)
Steinbusch, B., et al.: A massively parallel barnes-hut tree code with dual tree traversal. In: PARCO (2015)
Thill, M., et al.: MarkusThill/MGAB: The Mackey-glass anomaly benchmark (2020). https://doi.org/10.5281/zenodo.3760086
Van Der Maaten, L.: Accelerating T-SNE using tree-based algorithms. J. Mach. Learn. Res. 15(1), 3221–3245 (2014)
Xiao, B., Biros, G.: Parallel algorithms for nearest neighbor search problems in high dimensions. SIAM J. Sci. Comput. 38(5), S667–S699 (2016)
Yeh, C.M., et al.: Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: IEEE ICDM, pp. 1317–1322 (2016)
Yeh, C.M., et al.: Matrix profile III: the matrix profile allows visualization of salient subsequences in massive time series. In: IEEE ICDM, pp. 579–588 (2016)
Yu, C.D., et al.: Performance optimization for the K-nearest neighbors kernel on X86 architectures. In: ACM SC (2015)
Zheng, X., et al.: PSML: a multi-scale time-series dataset for machine learning in decarbonized energy grids (dataset) (2021). https://doi.org/10.5281/zenodo.5130612
Zhu, Y., et al.: Matrix profile XI: SCRIMP++: time series motif discovery at interactive speeds. In: IEEE ICDM, pp. 837–846 (2018)
Zhu, Y., et al.: Matrix profile VII: time series chains: a new primitive for time series data mining. In: 2017 IEEE ICDM 2017, pp. 695–704 (2017)
Zhu, Y., et al.: Matrix profile II: exploiting a novel algorithm and GPUs to break the one hundred million barrier for time series motifs and joins. Knowl. Inf. Syst. 54(1) (2018)
Zhu, Y., et al.: The swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code. In: KDD 2020, vol. 34, pp. 949–979 (2020)
Zimmerman, Z., et al.: Matrix profile XIV: scaling time series motif discovery with GPUs to break a quintillion pairwise comparisons a day and beyond. In: ACM SoCC, pp. 74–86 (2019)
Acknowledgements
This work is partially funded by Bayerische Forschungsstiftung under the research grants Optimierung von Gasturbinen mit Hilfe von Big Data (AZ-1214-16), and Von der Edge zur Cloud und zurück: Skalierbare und Adaptive Sensordatenverarbeitung (AZ-1468-20). The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this work by providing computing time on GCS Supercomputer SuperMUC-NG at at Leibniz Supercomputing Centre (www.lrz.de).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Raoofy, A., Karlstetter, R., Schreiber, M., Trinitis, C., Schulz, M. (2023). Overcoming Weak Scaling Challenges in Tree-Based Nearest Neighbor Time Series Mining. In: Bhatele, A., Hammond, J., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13948. Springer, Cham. https://doi.org/10.1007/978-3-031-32041-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-32041-5_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-32040-8
Online ISBN: 978-3-031-32041-5
eBook Packages: Computer ScienceComputer Science (R0)