1 Introduction

Modern Renewable Energy System (RES) installations, such as wind turbines, are monitored by up to hundreds of high-quality sensors sampled at high frequencies, e.g., 10 Hz, 50 Hz, or 100 Hz. Each of these sensors can perform several measurements and output several time series. Thus, a wind turbine produces thousands of high-frequency time series. A time series is a sequence of data points \(\langle (t_1, v_1), (t_2, v_2), \ldots \rangle \) where \(t_i < t_{i+1}\) and \(t_i\) represents the time when the value \(v_i \in \textbf{R}\) was measured by the sensor. Assuming a wind turbine can generate 2500 time series sampled at 100 Hz and that both time stamps and values require 8 bytes, one wind turbine generates more than 321 GiB of data each day. Thus, a park of 100 wind turbines generates more than 11 PiB of data every year.

Ingesting, managing, and analyzing such vast amounts of time series is infeasible with traditional Relational Database Management Systems (RDBMSs) [11]. Besides, RES installations are often located in remote areas with such limited connectivity that it is impossible to transfer the raw time series to the cloud. Thus, in addition to using lossless compression, manufacturers and owners of RES installations typically downsample the time series, e.g., by computing 10-min averages. However, this removes valuable fluctuations and outliers.

In addition, data scientists often use advanced, tailor-made, RES-specific scripts executed on their local PCs (“clients”) to analyze the performance of and detect problems in RES installations. Thus, the data scientists have to download the relevant time series from the edge and/or cloud before they can do their analysis. This is not only cumbersome but also highly inefficient and thus very costly due to the large amount of bandwidth, storage, and computation required.

As an attempt to remedy these problems, several systems for ingesting, managing, and analyzing time series have been proposed [1, 5, 9, 10]. However, these Time Series Management Systems (TSMSs) are generally designed to be deployed on the edge or in the cloud and do not take the client’s computational power into account when executing queries. Thus, they cannot optimize across edge, cloud, and client. Instead users must connect the systems using additional software and determine both where to execute queries and when to transfer data.

Based on discussions with manufacturers and owners of RES installations, we present our vision for a next-generation TSMS that can efficiently ingest, manage, and analyze vast amounts of RES time series across edge, cloud, and client. To allow data scientists to focus on analytics and creating value instead of data management, the envisioned TSMS must provide the following functionality:

(R1) Error-Bounded Compression on the Edge: Time series must be compressed on the edge to reduce the required bandwidth and storage. To support accurate analytics, timestamps must be represented without any error while each value must be represented within a user-defined error bound (possibly 0%).

(R2) Continuous Transfer of Data Points from Edge to Cloud: The compressed time series must be continuously transferred to the cloud in a resource-aware manner, i.e., important data must be prioritized. The TSMS must also support persisting compressed time series on the edge if the connection fails.

(R3) Locality-Aware Execution of Queries and Scripts: To avoid transferring vast amounts of data from the edge and cloud to the client, the TSMS must efficiently execute queries and scripts across the edge, cloud, and client.

(R4) Integration with Data Analytics Tools: The TSMS must integrate with existing infrastructure and tools such that data scientists can continue using their advanced, tailor-made, RES-specific scripts. This makes it feasible for the TSMS to actually be adopted by manufacturers and owners of RES installations.

The rest of this paper is structured as follows. In Sect. 2, we describe the requirements in detail and our vision for a next-generation TSMS. In Sect. 3, we describe related work. Finally, in Sect. 4, we present our conclusion.

2 The Envisioned System

Data from modern RES installations is managed and analyzed by edge nodes, cloud nodes, and clients. Based on discussions with manufacturers and owners of RES installations, we have defined the following high-level requirements that a next-generation TSMS must meet to efficiently ingest, manage, and analyze high-frequency time series at the vast scale required for modern RES installations.

(R1) Error-Bounded Compression on the Edge: The vast amounts of raw time series data produced by sampling the many high-quality sensors in modern RES installations make it infeasible to store the raw time series on the edge nodes, transfer them to the cloud nodes, and even store them on the cloud nodes. Thus, state-of-the-art time series compression methods are required and the compression must be performed on the edge nodes. The compression to use depends on how the time series data will be used, e.g., lossy compression is known to be significantly more effective than lossless compression [8] and can improve the precision of analytics [19], but more research is required for users to trust lossy compression [3]. Thus, a next-generation TSMS for RES must automatically use the best compression possible while taking the available resources and requirements of the analytics that will be performed into account.

(R2) Continuous Transfer of Data Points from Edge to Cloud: Due to the very limited hardware on the edge nodes, and to enable analytics across multiple RES installations, the data points ingested on the edge nodes must be transferred to the cloud nodes. From discussions with owners of RES installations, we know that their edge nodes are low-end commodity PCs, e.g., 4 CPU Cores, 4 GiB RAM, and an HDD. However, it is often impossible to transfer the raw time series due to the limited amount of bandwidth available as it can be as low as 500 Kbits/s to 5 Mbits/s. In addition, the connection may be down for an unknown period of time. Thus, a next-generation TSMS for RES must continuously transfer the highly compressed representations of the high-frequency time series to the cloud, decide which data must be transferred if the available bandwidth is limited, and also persist the data in case the connection is down.

(R3) Locality-Aware Execution of Queries and Scripts: Data scientists generally perform analytics using advanced, tailor-made, RES-specific scripts on the client, however, this can require downloading vast amounts of time series to the client before the analytics can be performed. Instead, it would be much more efficient to execute the queries or scripts where the data is located, i.e., the edge nodes and cloud nodes. For example, if a simple aggregate such as MIN or MAX is computed, the input may be terabytes of data while the result is only an 8 byte value. However, the client should also participate in query processing when it is more efficient to do so as even low-end commodity PCs today have multiple CPU cores and gigabytes of memory. For example, if multiple types of analytics have to be performed and the result of each is similar in size to the input data, it may be faster to download the data to the client and perform the analytics there. Thus, a next-generation TSMS for RESs must support executing queries and scripts that perform advanced analytics on the edge nodes, cloud nodes, and client and it must include a dynamic optimizer that automatically determines where to perform the analytics to use the least amount of resources.

(R4) Integration with Data Analytics Tools: As stated, data scientists em-ployed by RES manufacturers and owners use tailor-made, specialized analytics tools to detect and understand problems in RES installations. These tools use a significant amount of custom code that has been refined for years. In addition, the computations performed by these tools are not easily expressable in SQL and are typically implemented using Python and its packages for scientific computing such as NumPy and pandas. Thus, to make it feasible for manufacturers and owners of RES installations to adopt the next-generation TSMS, it must provide effective and efficient integration with the current infrastructure and tools used by the data scientists so they can continue to use their existing tools.

Fig. 1.
figure 1

Illustration of envisioned data collection and analysis in a RES domain

To meet these requirements, we envision a system as outlined in Fig. 1. A single-node version of the TSMS is deployed on each of the edge nodes. It collects sensor data and compresses it within an error bound (possibly 0%) based on the requirements of the analytics (R1). The compressed data is continuously transferred from the edge nodes to the cloud with important data transferred first (R2). A distributed version of the TSMS is deployed on the cloud nodes for scalable analytics. Data scientists perform analytics using their own client PCs and use the TSMS as a library that integrates with their existing tools (R4). The library sends queries and scripts to the cloud, and when cloud nodes execute them, they may send queries, scripts, or data requests to the edge nodes (R3). As the TSMS is deployed on the edge, cloud, and client, it can continue to operate on local data if a network connection is temporarily unavailable. For example, an edge node can simply write data to disk, and engineers working on the wind turbine can query its local data using their existing tools (R3, R4).

3 Related Work

Many TSMSs have been proposed to manage the vast amounts of time series data being produced [1, 5, 9, 10]. However, while some of them can be deployed across edge and cloud, none of them take a holistic approach to time series analytics.

Some TSMSs support ingesting time series on the edge nodes and then transferring them to the cloud. Respawn [2] is a TSMS that ingests sensor data on edge nodes, continuously computes aggregates from the ingested sensor data to improve query response time, and continuously transfers data to cloud nodes based on two strategies: transfer low-resolution aggregates to the cloud nodes and transfer important data based on its standard deviation. Queries are routed to the relevant edge nodes and cloud nodes. Storacle [4, 6] is a TSMS designed to be deployed on edge nodes throughout a smart grid. Storacle uses a three-tiered storage model consisting of local memory, local storage, and cloud storage. Data is continuously transferred from the edge nodes to the cloud nodes. The latest data points are not immediately deleted when they are transferred to the next tier so they can be used to answer queries with low latency. Apache IoTDB [20] is designed to be deployed on edge nodes as an embedded or standalone TSMS and on cloud nodes as a distributed TSMS. The ingested time series are stored in a novel compressed column-based format similar to Apache Parquet but optimized for time series. VergeDB [15] is a TSMS designed to ingest and compress time series on edge nodes using different lossless and lossy compression methods depending on the analytics to be performed on the data downstream. A component is being developed that will automatically select the compression method to use based on the available resources and the analytics that will be performed. ModelarDB [11,12,13,14] is a modular TSMS designed to be deployed on edge nodes and cloud nodes. It supports different query engines and data stores optimized for different use cases and it is simple to extend the system with support for additional query engines and data stores due to its modularity. It uses multiple different types of models to efficiently compress high-frequency time series within a user-defined error bound (possibly 0%). Thus, both lossless and lossy compression are supported. User-defined model types can optionally be added to ModelarDB without recompiling the system. The models and accompanying metadata are continuously transferred from the edge nodes to the cloud nodes.

Some TSMSs support performing complex user-defined analytics directly in the TSMS while others utilize the computational capabilities of both the cloud nodes and the client when executing queries. NilmDB [16] stores time series in a novel data store named BulkData and metadata in SQLite [7]. Users can retrieve data points using queries and then analyze them on the client. Alternatively, the users can run the analysis on the server by submitting a Python script. This reduces the amount of data to be transferred. LittleTable [18] is a TSMS designed like an RDBMS but specialized for time series. For example, it lacks support for updates and NULL values. The client software for LittleTable is implemented using SQLite’s [7] Virtual Table Interface. When LittleTable is queried, it guarantees that the data is transmitted to SQLite in ascending or descending order by primary key. The SQLite client can utilize this knowledge to, e.g., efficiently compute aggregates that GROUP BY time and metadata such as a device id. DuckDB [17] is not a TSMS but an embeddable RDBMS designed for analytics. Thus, it can be used like SQLite [7] but is optimized for Online Analytical Processing (OLAP) instead of Online Transaction Processing (OLTP). To do so, DuckDB combines a C/C++/SQL-API, a cost-based optimizer, Multiversion Concurrency control (MVCC), and a vectorized query engine. Thus, it allows OLAP queries to be efficiently performed on the client without requiring installation, configuration, and management of a complex standalone RDBMS.

Existing TSMSs thus only have simple integration between the edge and the cloud [2, 4, 6, 11,12,13,14,15] (R1, R2, R3) or simple integration between the cloud and the client [16, 18] (R3, R4). We envision a TSMS that provides both to enable scalable, holistic analytics of RES sensor data across edge, cloud, and client.

4 Conclusion

Modern Renewable Energy System (RES) installations produce petabytes of high-frequency time series. State-of-the-art systems cannot cope with such vast amounts of data. In this paper, we presented our vision for a next-generation TSMS that can efficiently manage vast amounts of time series data. Based on discussions with practitioners, we presented requirements for such a system and an outline of how it should operate distributedly across edge, cloud, and client.