Abstract
Modern Renewable Energy System (RES) installations, e.g., wind turbines, produce petabytes of high-frequency time series. State-of-the-art systems cannot cope with such amounts of data. Thus, practitioners generally store simple aggregates, e.g., 10-min averages. Based on discussions with practitioners, we present requirements and our vision for a next-generation time series management system that can efficiently manage vast amounts of time series across edge, cloud, and client.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Modern Renewable Energy System (RES) installations, such as wind turbines, are monitored by up to hundreds of high-quality sensors sampled at high frequencies, e.g., 10 Hz, 50 Hz, or 100 Hz. Each of these sensors can perform several measurements and output several time series. Thus, a wind turbine produces thousands of high-frequency time series. A time series is a sequence of data points \(\langle (t_1, v_1), (t_2, v_2), \ldots \rangle \) where \(t_i < t_{i+1}\) and \(t_i\) represents the time when the value \(v_i \in \textbf{R}\) was measured by the sensor. Assuming a wind turbine can generate 2500 time series sampled at 100 Hz and that both time stamps and values require 8 bytes, one wind turbine generates more than 321 GiB of data each day. Thus, a park of 100 wind turbines generates more than 11 PiB of data every year.
Ingesting, managing, and analyzing such vast amounts of time series is infeasible with traditional Relational Database Management Systems (RDBMSs) [11]. Besides, RES installations are often located in remote areas with such limited connectivity that it is impossible to transfer the raw time series to the cloud. Thus, in addition to using lossless compression, manufacturers and owners of RES installations typically downsample the time series, e.g., by computing 10-min averages. However, this removes valuable fluctuations and outliers.
In addition, data scientists often use advanced, tailor-made, RES-specific scripts executed on their local PCs (“clients”) to analyze the performance of and detect problems in RES installations. Thus, the data scientists have to download the relevant time series from the edge and/or cloud before they can do their analysis. This is not only cumbersome but also highly inefficient and thus very costly due to the large amount of bandwidth, storage, and computation required.
As an attempt to remedy these problems, several systems for ingesting, managing, and analyzing time series have been proposed [1, 5, 9, 10]. However, these Time Series Management Systems (TSMSs) are generally designed to be deployed on the edge or in the cloud and do not take the client’s computational power into account when executing queries. Thus, they cannot optimize across edge, cloud, and client. Instead users must connect the systems using additional software and determine both where to execute queries and when to transfer data.
Based on discussions with manufacturers and owners of RES installations, we present our vision for a next-generation TSMS that can efficiently ingest, manage, and analyze vast amounts of RES time series across edge, cloud, and client. To allow data scientists to focus on analytics and creating value instead of data management, the envisioned TSMS must provide the following functionality:
(R1) Error-Bounded Compression on the Edge: Time series must be compressed on the edge to reduce the required bandwidth and storage. To support accurate analytics, timestamps must be represented without any error while each value must be represented within a user-defined error bound (possibly 0%).
(R2) Continuous Transfer of Data Points from Edge to Cloud: The compressed time series must be continuously transferred to the cloud in a resource-aware manner, i.e., important data must be prioritized. The TSMS must also support persisting compressed time series on the edge if the connection fails.
(R3) Locality-Aware Execution of Queries and Scripts: To avoid transferring vast amounts of data from the edge and cloud to the client, the TSMS must efficiently execute queries and scripts across the edge, cloud, and client.
(R4) Integration with Data Analytics Tools: The TSMS must integrate with existing infrastructure and tools such that data scientists can continue using their advanced, tailor-made, RES-specific scripts. This makes it feasible for the TSMS to actually be adopted by manufacturers and owners of RES installations.
The rest of this paper is structured as follows. In Sect. 2, we describe the requirements in detail and our vision for a next-generation TSMS. In Sect. 3, we describe related work. Finally, in Sect. 4, we present our conclusion.
2 The Envisioned System
Data from modern RES installations is managed and analyzed by edge nodes, cloud nodes, and clients. Based on discussions with manufacturers and owners of RES installations, we have defined the following high-level requirements that a next-generation TSMS must meet to efficiently ingest, manage, and analyze high-frequency time series at the vast scale required for modern RES installations.
(R1) Error-Bounded Compression on the Edge: The vast amounts of raw time series data produced by sampling the many high-quality sensors in modern RES installations make it infeasible to store the raw time series on the edge nodes, transfer them to the cloud nodes, and even store them on the cloud nodes. Thus, state-of-the-art time series compression methods are required and the compression must be performed on the edge nodes. The compression to use depends on how the time series data will be used, e.g., lossy compression is known to be significantly more effective than lossless compression [8] and can improve the precision of analytics [19], but more research is required for users to trust lossy compression [3]. Thus, a next-generation TSMS for RES must automatically use the best compression possible while taking the available resources and requirements of the analytics that will be performed into account.
(R2) Continuous Transfer of Data Points from Edge to Cloud: Due to the very limited hardware on the edge nodes, and to enable analytics across multiple RES installations, the data points ingested on the edge nodes must be transferred to the cloud nodes. From discussions with owners of RES installations, we know that their edge nodes are low-end commodity PCs, e.g., 4 CPU Cores, 4 GiB RAM, and an HDD. However, it is often impossible to transfer the raw time series due to the limited amount of bandwidth available as it can be as low as 500 Kbits/s to 5 Mbits/s. In addition, the connection may be down for an unknown period of time. Thus, a next-generation TSMS for RES must continuously transfer the highly compressed representations of the high-frequency time series to the cloud, decide which data must be transferred if the available bandwidth is limited, and also persist the data in case the connection is down.
(R3) Locality-Aware Execution of Queries and Scripts: Data scientists generally perform analytics using advanced, tailor-made, RES-specific scripts on the client, however, this can require downloading vast amounts of time series to the client before the analytics can be performed. Instead, it would be much more efficient to execute the queries or scripts where the data is located, i.e., the edge nodes and cloud nodes. For example, if a simple aggregate such as MIN or MAX is computed, the input may be terabytes of data while the result is only an 8 byte value. However, the client should also participate in query processing when it is more efficient to do so as even low-end commodity PCs today have multiple CPU cores and gigabytes of memory. For example, if multiple types of analytics have to be performed and the result of each is similar in size to the input data, it may be faster to download the data to the client and perform the analytics there. Thus, a next-generation TSMS for RESs must support executing queries and scripts that perform advanced analytics on the edge nodes, cloud nodes, and client and it must include a dynamic optimizer that automatically determines where to perform the analytics to use the least amount of resources.
(R4) Integration with Data Analytics Tools: As stated, data scientists em-ployed by RES manufacturers and owners use tailor-made, specialized analytics tools to detect and understand problems in RES installations. These tools use a significant amount of custom code that has been refined for years. In addition, the computations performed by these tools are not easily expressable in SQL and are typically implemented using Python and its packages for scientific computing such as NumPy and pandas. Thus, to make it feasible for manufacturers and owners of RES installations to adopt the next-generation TSMS, it must provide effective and efficient integration with the current infrastructure and tools used by the data scientists so they can continue to use their existing tools.
To meet these requirements, we envision a system as outlined in Fig. 1. A single-node version of the TSMS is deployed on each of the edge nodes. It collects sensor data and compresses it within an error bound (possibly 0%) based on the requirements of the analytics (R1). The compressed data is continuously transferred from the edge nodes to the cloud with important data transferred first (R2). A distributed version of the TSMS is deployed on the cloud nodes for scalable analytics. Data scientists perform analytics using their own client PCs and use the TSMS as a library that integrates with their existing tools (R4). The library sends queries and scripts to the cloud, and when cloud nodes execute them, they may send queries, scripts, or data requests to the edge nodes (R3). As the TSMS is deployed on the edge, cloud, and client, it can continue to operate on local data if a network connection is temporarily unavailable. For example, an edge node can simply write data to disk, and engineers working on the wind turbine can query its local data using their existing tools (R3, R4).
3 Related Work
Many TSMSs have been proposed to manage the vast amounts of time series data being produced [1, 5, 9, 10]. However, while some of them can be deployed across edge and cloud, none of them take a holistic approach to time series analytics.
Some TSMSs support ingesting time series on the edge nodes and then transferring them to the cloud. Respawn [2] is a TSMS that ingests sensor data on edge nodes, continuously computes aggregates from the ingested sensor data to improve query response time, and continuously transfers data to cloud nodes based on two strategies: transfer low-resolution aggregates to the cloud nodes and transfer important data based on its standard deviation. Queries are routed to the relevant edge nodes and cloud nodes. Storacle [4, 6] is a TSMS designed to be deployed on edge nodes throughout a smart grid. Storacle uses a three-tiered storage model consisting of local memory, local storage, and cloud storage. Data is continuously transferred from the edge nodes to the cloud nodes. The latest data points are not immediately deleted when they are transferred to the next tier so they can be used to answer queries with low latency. Apache IoTDB [20] is designed to be deployed on edge nodes as an embedded or standalone TSMS and on cloud nodes as a distributed TSMS. The ingested time series are stored in a novel compressed column-based format similar to Apache Parquet but optimized for time series. VergeDB [15] is a TSMS designed to ingest and compress time series on edge nodes using different lossless and lossy compression methods depending on the analytics to be performed on the data downstream. A component is being developed that will automatically select the compression method to use based on the available resources and the analytics that will be performed. ModelarDB [11,12,13,14] is a modular TSMS designed to be deployed on edge nodes and cloud nodes. It supports different query engines and data stores optimized for different use cases and it is simple to extend the system with support for additional query engines and data stores due to its modularity. It uses multiple different types of models to efficiently compress high-frequency time series within a user-defined error bound (possibly 0%). Thus, both lossless and lossy compression are supported. User-defined model types can optionally be added to ModelarDB without recompiling the system. The models and accompanying metadata are continuously transferred from the edge nodes to the cloud nodes.
Some TSMSs support performing complex user-defined analytics directly in the TSMS while others utilize the computational capabilities of both the cloud nodes and the client when executing queries. NilmDB [16] stores time series in a novel data store named BulkData and metadata in SQLite [7]. Users can retrieve data points using queries and then analyze them on the client. Alternatively, the users can run the analysis on the server by submitting a Python script. This reduces the amount of data to be transferred. LittleTable [18] is a TSMS designed like an RDBMS but specialized for time series. For example, it lacks support for updates and NULL values. The client software for LittleTable is implemented using SQLite’s [7] Virtual Table Interface. When LittleTable is queried, it guarantees that the data is transmitted to SQLite in ascending or descending order by primary key. The SQLite client can utilize this knowledge to, e.g., efficiently compute aggregates that GROUP BY time and metadata such as a device id. DuckDB [17] is not a TSMS but an embeddable RDBMS designed for analytics. Thus, it can be used like SQLite [7] but is optimized for Online Analytical Processing (OLAP) instead of Online Transaction Processing (OLTP). To do so, DuckDB combines a C/C++/SQL-API, a cost-based optimizer, Multiversion Concurrency control (MVCC), and a vectorized query engine. Thus, it allows OLAP queries to be efficiently performed on the client without requiring installation, configuration, and management of a complex standalone RDBMS.
Existing TSMSs thus only have simple integration between the edge and the cloud [2, 4, 6, 11,12,13,14,15] (R1, R2, R3) or simple integration between the cloud and the client [16, 18] (R3, R4). We envision a TSMS that provides both to enable scalable, holistic analytics of RES sensor data across edge, cloud, and client.
4 Conclusion
Modern Renewable Energy System (RES) installations produce petabytes of high-frequency time series. State-of-the-art systems cannot cope with such vast amounts of data. In this paper, we presented our vision for a next-generation TSMS that can efficiently manage vast amounts of time series data. Based on discussions with practitioners, we presented requirements for such a system and an outline of how it should operate distributedly across edge, cloud, and client.
References
Bader, A., Kopp, O., Michael, F.: Survey and comparison of open source time series databases. In: Proceedings of the BTW - Workshopband, pp. 249–268. GI (2017)
Buevich, M., Wright, A., Sargent, R., Rowe, A.: Respawn: a distributed multi-resolution time-series datastore. In: Proceedings of the RTSS, pp. 288–297. IEEE (2013)
Cappello, F., Di, S., Gok, A.M.: Fulfilling the promises of lossy compression for scientific applications. In: Nichols, J., Verastegui, B., Maccabe, A.B., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds.) SMC 2020. CCIS, vol. 1315, pp. 99–116. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63393-6_7
Cejka, S., Mosshammer, R., Einfalt, A.: Java embedded storage for time series and meta data in Smart Grids. In: Proceedings of the SmartGridComm, pp. 434–439. IEEE (2015)
DB-Engines Ranking of Time Series DBMS (2023). https://db-engines.com/en/ranking/time+series+dbms
Faschang, M., et al.: Provisioning, deployment, and operation of smart grid applications on substation level. CSRD 32(1–2), 117–130 (2017)
Gaffney, K.P., Prammer, M., Brasfield, L.C., Hipp, D.R., Kennedy, D.R., Patel, J.M.: SQLite: past, present, and future. PVLDB 15(12), 3535–3547 (2022)
Hung, N.Q.V., Jeung, H., Aberer, K.: An evaluation of model-based approaches to sensor data compression. TKDE 25(11), 2434–2447 (2013)
Jensen, S.K., Pedersen, T.B., Thomsen, C.: Time series management systems: a 2022 survey. In: Palpanas, T., Zoumpatianos, K. (eds.) Data Series Management and Analytics. ACM (Forthcoming)
Jensen, S.K., Pedersen, T.B., Thomsen, C.: Time series management systems: a survey. TKDE 29(11), 2581–2600 (2017)
Jensen, S.K., Pedersen, T.B., Thomsen, C.: ModelarDB: modular model-based time series management with spark and Cassandra. PVLDB 11(11), 1688–1701 (2018)
Jensen, S.K., Pedersen, T.B., Thomsen, C.: Demonstration of ModelarDB: model-based management of dimensional time series. In: Proceedings of the SIGMOD, pp. 1933–1936. ACM (2019)
Jensen, S.K., Pedersen, T.B., Thomsen, C.: Scalable model-based management of correlated dimensional time series in ModelarDB\(_+\). In: Proceedings of the ICDE, pp. 1380–1391. IEEE (2021)
Jensen, S.K., Thomsen, C., Pedersen, T.B.: ModelarDB: integrated model-based management of time series from edge to cloud. TLDKS 53, 1–33 (2023)
Paparrizos, J., et al.: VergeDB: a database for IoT analytics on edge devices. In: Proceedings of the CIDR (2021)
Paris, J., Donnal, J.S., Leeb, S.B.: NilmDB: the non-intrusive load monitor database. TSG 5(5), 2459–2467 (2014)
Raasveldt, M., Mühleisen, H.: DuckDB: an embeddable analytical database. In: Proceedings of the SIGMOD. ACM (2019)
Rhea, S., Wang, E., Wong, E., Atkins, E., Storer, N.: LittleTable: a time-series database and its uses. In: Proceedings of the SIGMOD, pp. 125–138. ACM (2017)
Tirupathi, S., et al.: Machine learning platform for extreme scale computing on compressed IoT data. In: Proceedings of the BigData, pp. 3179–3185. IEEE (2022)
Wang, C., et al.: Apache IoTDB: time-series database for internet of things. PVLDB 13(12), 2901–2904 (2020)
Acknowledgements
This research was supported by the MORE project funded by Horizon 2020 grant number 957345. In addition, we thank our industry partners for providing a large amount of detailed information about their domain.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Jensen, S.K., Thomsen, C. (2023). Holistic Analytics of Sensor Data from Renewable Energy Sources: A Vision Paper. In: Abelló, A., et al. New Trends in Database and Information Systems. ADBIS 2023. Communications in Computer and Information Science, vol 1850. Springer, Cham. https://doi.org/10.1007/978-3-031-42941-5_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-42941-5_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42940-8
Online ISBN: 978-3-031-42941-5
eBook Packages: Computer ScienceComputer Science (R0)