Keywords

1 Introduction

Given the problems of gradual oil depletion and global warming, energy consumption has become a critical factor for energy-intensive sectors, especially the semiconductor, manufacturing, iron and steel, and aluminum industries. Without a doubt, energy is a vital resource for modern civilization and economic growth, especially for long-term competitive sustainability. To reduce unnecessary energy consumption and improve energy efficiency, it is critical to make informed decisions in real time. To that end, we collected data regarding energy consumption as well as information from the corresponding production and manufacturing domains from the co-operating steel-forging plant. Based on load profiles determined from data stream mining, we propose a refined symbolic time-series data mining approach for electricity consumption analysis and typical patterns extraction based on the load profiles. The objectives of our research is we have tried to improve the breakpoint table of the SAX algorithm. Breakpoint is the key factor for the SAX algorithm to symbolize values, and SAX converts different numerical data to different symbols through the breakpoint table. Compared to the SAX breakpoint table proposed by Lin et al. [1], this study computes the breakpoint value by using cumulative distribution function (CDF) that relies on a density-based notion to improve the accuracy of the symbolization [2]. We adopt the tightness of lower bound (TLB) measure to evaluate the performance of our refined distance measure.

2 Framework

To confirm the advantage of adopting the symbolic time-series data mining framework for electricity consumption analysis proposed in this study, we collected electricity consumption data and corresponding product information from an annealing furnace during April 1–December 31, 2014 to carry out our research. The following content describes the three-phase operational process of the framework in this research, as shown in Fig. 1.

Fig. 1.
figure 1

The process of CDF-based symbolic time-series data mining for electricity consumption analysis

  • Phase 1: Data preprocessing and normalization. In this phase we carried out the data-cleaning integration, for example, replacing the missing values and defining a complete machining procedure. After data preprocessing, we took these data to execute z-score normalization.

  • Phase 2: Dimensionality reduction and symbolization. In this phase we reduced the dimensionality of data and then symbolized the numerical time-series data. The SAX algorithm follows a two-step process: (1) Piecewise Aggregate Approximation (PAA) and (2) conversion of a PAA sequence into a series of letters. PAA divides the data set of length n into w equally spaced segments or bins, and computes the average of each segment. This essentially means that we reduce the number of dimensions from n to w (n > w). The SAX algorithm is a symbolic representation of a time series and uses a synthetic set of symbols to reduce the dimensionality of the numerical series. In this work, we refined the breakpoint value by using the cumulative distribution function (CDF) and took the results to build the breakpoint table. Breakpoint is the key factor for the SAX algorithm to transform numerical data into symbolic data. The methods and the adjusted algorithm of SAX will be explained in Sect. 3.

  • Phase 3: Identification of machine operational states. In this phase we identified the operational states of the machine and measured the tightness of lower bound (TLB) between methods to measure the effectiveness of the modified SAX algorithm in our application domain. In addition, we improved the lower-bounding distance measure that calculates the similarity among symbolized energy-load profiles by considering the variance of time-series data to get the new distance of two continuous strings. The methods and experimental results will be explained in Sects. 4 and 5.

3 Modified Breakpoint Table of SAX Algorithm

When dealing with time-series data efficiently, it is important to develop data representation techniques that can reduce the dimensionality of time-series while still preserving the fundamental characteristics of the data [3, 4]. Data- size reduction techniques may be helpful in the process of categorizing the electrical load consumption patterns on the basis of their shape. The SAX, a symbolic representation of a time series, uses a synthetic set of symbols to reduce the dimensionality of the numerical series. As we introduced it in the basic concepts, the SAX algorithm follows a two-step process: (1) Piecewise Aggregate Approximation (PAA) and (2) conversion of a PAA sequence into a series of letters. PAA divides the data set of length n into w equally spaced segments or bins, and computes the average of each segment. This essentially means that we reduce the number of dimensions from n to w.

Having transformed a time-series database into the PAA we can apply a further transformation to obtain a discrete representation. It is desirable to have a discretization technique that will produce symbols with equal probability. The amplitude intervals may be regular or determined according to the quantiles of the statistical distribution that represents the probability density of the amplitudes in the entire data set, Lin et al. [1] used the former case and this study used the latter case. In the latter case, the entire data set has to be processed to determine the probability distribution of the amplitudes, this study represented through its CDF. The CDF of a random variable is a method to describe the distribution of random variables. The advantage of the CDF is that it can be defined for any kind of random variable, including discrete, continuous, and mixed random variables. Starting from the CDF, the amplitude breakpoints can be identified by the quantiles of the probability curve to partition the amplitude axis into intervals and obtain partitioning of the amplitude intervals with equal probability. The SAX converts different numerical data into different symbols through the breakpoint table. Compared to the SAX’s breakpoint table proposed by Lin et al. [1] (shown as Table 2), this study computed the breakpoint value by using CDF, that is, by relying on a density-based notion, in which the number of symbolized data points is the same in every amplitude interval.

The cumulative distribution function (CDF) of random variable \( X \) is defined as

$$ F_{X} \left( {\mathcal{X}} \right) = P\left( {X \le {\mathcal{X}}} \right),\; for \;all\; {\mathcal{X}}\, \in \,{\mathbb{R}} $$
(1)

\( F_{X} \left( {\mathcal{X}} \right) \) accumulates all of the probability less than or equal to \( {\mathcal{X}} \). The CDF for continuous random variables is a straightforward extension of that of the discrete case. The CDF calculates the cumulative probability for a given X-value. For continuous distributions, the CDF gives the area under the probability density function, up to the given x-value. And for discrete distributions, the CDF gives the cumulative probability for the given X-value.

4 The Distance Measure of the SAX Algorithm

The most common distance measure for time series data is the Euclidean distance. The weakness of the Euclidean distance is its sensitivity to distortion in the time axis [5], that is, when there are two time series sequences that have an overall similar shape but are not aligned in the time axis. Given two time series T1 and T2, with the same length, n, we then conducted dimension reduction using the PAA approach to transform the original T1 and T2 into T1′ and T2′, respectively. Based on Chakrabarti et al. [6], we obtained a lower bounding Euclidean distance approximation between the original time-series data by Eq. (2). The lower bounding Euclidean distance measure can be applied using the reduced-dimension time series representation method, which ensures the reduced dimension can be less than or equal to the true distance on the raw time-series data (Ding et al. 2008).

$$ D_{LB} (T_{1}^{\prime } ,T_{2}^{\prime } ) = \sqrt {\frac{n}{w}} \sqrt {\sum\limits_{i = 1}^{w} {(t_{1i}^{\prime } - t_{2i}^{\prime } )^{2} } } $$
(2)

We further transformed the data into the symbolic representation, that is, SAX, with a lower-bounding distance measure. In the SAX algorithm Lin et al. (2007) define a Dmin_dist function to calculate the minimum distance between two sequences of symbols as shown in Eq. (3).

$$ D_{\hbox{min} \_dist} (T_{1}^{\prime \prime } ,T_{2}^{\prime \prime } ) = \sqrt {\frac{n}{w}} \sqrt {\sum\limits_{i = 1}^{w} {dist((t_{1i}^{\prime \prime } \,*\,i - t_{2i}^{\prime \prime } \,*\,i))^{2} } } $$
(3)

The distance between two symbols can be read off by checking the corresponding row and column by checking the look up table [7]. The equation is shown in Eq. (4) below:

$$ Dist \left( {R,C} \right) = \left\{ {\begin{array}{*{20}l} {0,\,if\left| {R - C} \right| \le 1} \hfill \\ {\beta_{{{\rm{max}} \left( {R,C} \right) - 1}} - \beta_{{{\rm{min}} \left( {R,C} \right),\,otherwise}} } \hfill \\ \end{array} } \right. $$
(4)

Where \( \beta i \) is the element of the breakpoint list B = (β1, β2,…βW-1) and βi – 1 < βi. R denotes row and C denotes column in the look up table which can be referenced in Wu et al. [7].

5 Experimental Design and Results

5.1 Distance Evaluation Metrics: Measuring the Tightness of Lower Bound

Lin et al. [1, 8] have proposed empirically determining the best values by simply measuring the TLB, defined as the ratio in the range [0,1] of the lower bound distance to the actual true Euclidean distance. The higher the ratio, the tighter is the bound. Based on previous researches [3, 8], we adopted the TLB measure to evaluate the performance of our adjusted distance measure using SAX algorithm. Since we aimed to achieve the tightest possible lower bounds, we can simply estimate the lower bounds over all possible parameters and select the best settings. To identify the TLB, we used Eq. (5):

$$ {\text{TLB}} = \frac{lower\; bounds\; distance}{true\; euclidean \;distance} $$
(5)

The lower bound distance represents the distance after symbolization, whereas the true Euclidean distance represents the true distance of two time-series data. The value range of the TLB is always between 0 and 1; the higher the TLB value, the closer it is to the true Euclidean distance, which indicates better results.

5.2 Effectiveness of the Modified Breakpoint Table of the SAX Algorithm

Firstly, we examined the degree of distortion of two time-series data with the SAX algorithm and modified the breakpoint table of the SAX algorithm by the CDF. Lin et al. [1] used the former case and this study used the latter case. In the latter case, the entire data set has to be processed in order to determine the probability distribution of the amplitudes. This study represented through its CDF. We set α to be 10 (i.e., 10 amplitude intervals) in our electricity consumption data to have alphabetical labels. As such, we can get the best symbolized results. Herein, we focused on the electricity consumption data from one of the annealing furnaces to conduct the experiment to evaluate the effectiveness of the proposed CFD-refined breakpoint table approach (SAX_CDF) compared to the original SAX algorithm (SAX_Orginal).

Observation1:

Table 1 show a comparison of TLB between Lin’s breakpoint table and the CDF breakpoint table for the machine. Apparently, the results show that the CDF-based SAX approach can achieve 10.38% improvements, compared to the original SAX approach. The results confirm the effectiveness of our refined CDF-based SAX algorithm.

Table 1. Comparison of TLB with Lin’s and CDF’s breakpoint table

Observation 2:

There are two optimum solutions, one is n (time window) is 1450, the other is n is 950. Setting n to 1450 can get the highest TLB value, while setting n to 950 is the most efficient parameter setting for the machine. The breakpoints when n is 1450 are {−0.6024, −0.5987, −0.5945, −0.5933, −0.45, −0.2688, 0.1377, 1.6334, 1.6965}, and the breakpoints when n is 950 are {−0.7455, −0.7355, −0.7267, −0.7202, −0.7177, −0.5321, −0.2688, 1.9060, 2.0317}. It provides reference values for our future research.

6 Conclusion

Here is a summary of the preliminary findings of this research. First, our experimental results show that the modified CDF-based SAX algorithm can achieve better results in terms of higher TLB values compared to the one without any modification, representing a 10.38% improvement. In the future, we will adopt the results for further normal and abnormal electricity-pattern retrieving tasks. We also aim to deploy a visualized electricity consumption system for conducting real-time decisions in a real context.