Abstract
Machine learning approaches play a crucial role in nonlinear system modeling across diverse domains, finding applications in system monitoring, anomaly/fault detection, control, and various other areas. With technological advancements, today such systems might include hundreds or thousands of sensors that generate large amounts of multivariate data streams. This inevitably results in increased model complexity. In response, feature selection techniques are widely employed as a means to reduce complexity, avoid the curse of high dimensionality, decrease training and inference times, and eliminate redundant features. This paper introduces a sensitivity-inspired feature analysis technique for regression tasks. Leveraging the energy distance on the model prediction errors, this approach performs both feature ranking and selection. Additionally, this paper introduces an ensemble-based unsupervised fault detection methodology that incorporates homogeneous units, specifically long short-term memory (LSTM) predictors and cumulative sum-based detectors. The proposed predictors utilize a variant of the teacher forcing (TF) algorithm during both the training and inference phases. Additionally, predictors are used to model the normal behavior of the system, whereas detectors are used to identify deviations from normality. The detector decisions are aggregated using a majority voting scheme. The validity of the proposed approach is illustrated on the two representative datasets, where numerous experiments are performed for feature selection and fault detection evaluation. Experimental assessment reveals promising results, even compared to well-established techniques. Nevertheless, the results also demonstrate the need to perform additional experiments with datasets originating from both simulators and real systems. Further possible refinements of the detection ensemble include the addition of heterogeneous units and other decision fusion techniques.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Nowadays, machine learning-based techniques are gaining momentum in applications targeting the modeling of nonlinear systems [1]. Such models are then applied to task monitoring [2], anomaly detection [3], control and optimization [4], alongside various other directions [5]. The underlying approach for building machine learning-supported models belongs to the class of black-box or data-driven modeling techniques [6]. To this end, accurate system models are trained by leveraging real-life industrial process measurements. These are seconded by other data-driven methodologies [7], which aim to further enhance the generated model. A popular recurrent machine learning technique is the long short-term memory (LSTM) model. LSTM models have demonstrated their effectiveness across different domains, as evidenced by researchers in a plethora of studies [8,9,10,11]. Due to their gated-based architecture and memory mechanisms, LSTM models present a robust solution to the vanishing and exploding gradient issues encountered in vanilla recurrent neural networks (RNNs) [12].
Nonlinear systems are commonly found not just in industrial processes and engineering contexts [13, 14], but also in a wide array of areas including Astronomy [15], Physics [16], Medicine [17] and even Biology [18]. The complexity of these systems, along with nonlinear relationships, introduces additional challenges to modeling [19, 20], especially when these models are utilized for anomaly detection tasks [21]. The presence of nonlinear patterns may result in deviations that are more difficult to identify with conventional linear models [22]. Consequently, this requires the adoption of specialized algorithms and techniques for effective modeling and detection [21].
With technological advancements, today such systems might include hundreds or thousands of sensors that generate large quantities of multivariate data streams [23]. This adds another layer of difficulty to data-driven modeling approaches, as it heavily increases the complexity of the resulting models. Furthermore, the performance of many models is affected by the high dimensionality of the data; this is known as the curse of dimensionality [24]. To this end, feature analysis and selection is a highly researched area and has been for the past two decades [25]. Efficient feature selection methodologies can help reduce the complexity of machine learning models, avoid the curse of high dimensionality, decrease training and inference times, and eliminate redundant features that could negatively influence the models. Although feature analysis remains a heavily researched area, with multiple well-established methods [26] that yield outstanding results, there is still no silver bullet method that performs equally well on datasets from both linear and nonlinear systems [27].
Fault detection can be seen as a specific application of anomaly detection, focused on identifying deviations from the normal behavior of the system, caused by either sensor failures, component failures, or interventions on the system. The authors in [28] state that the purpose of fault detection is to determine the occurrence of an abnormal event in a process. Additionally, as highlighted in [29] misclassifying normal samples as faults may result in unnecessary operational disruptions and increased labor costs. Conversely, considering faulty samples as normal poses significant risks, potentially exposing the equipment to detrimental operating conditions. This further illustrates the need for efficient, accurate, and timely detections. Moreover, in large, sometimes distributed systems, the effects of certain faults may also propagate among interconnected subsystems or components, observable in the generated time-series data flows.
This article introduces a new method inspired by sensitivity analysis [30], which includes feature ranking and automatic feature selection, for regression tasks. Although this method is proposed and tested on LSTM models with a modified version of teacher forcing (TF) [31], it is also suitable for other models. Following a backward approach, feature ranking is performed by sequentially eliminating inputs and measuring changes in model prediction residuals using the energy distance metric [32]. The automatic feature selection process adopts a forward approach, starting with a predetermined number of features, and incorporating at each step additional features based on their ranks, until a stopping criterion is met. In addition, this paper presents an ensemble-based unsupervised approach for unknown fault detection. The ensemble incorporates multiple homogeneous units consisting of LSTM with TF (LSTMTF) predictors and Cumulative Sum (CUSUM) detectors capable of detecting mean and variance shifts in the prediction residuals. Furthermore, the ensemble employs a majority voting scheme, yielding a binary decision between clean and faulty.
The experimental assessment focuses on two distinct datasets, namely Tennessee Eastman process (TEP) [33] and heating, ventilation, and air conditioning (HVAC) boiler plant [34]. The proposed approach is compared with 27 well-established techniques, in both the context of feature selection and fault detection. Our proposed approach yields notable results, with low prediction RMSE values and up to 100% detection rates with 0% false alerts on all datasets, including hard-to-detect faults. This surpasses both supervised and unsupervised fault detection solutions proposed in the literature.
The main contributions of this work are as follows:
-
The design and development of a data-driven modeling and fault detection framework.
-
The design and development of a feature sensitivity analysis method that utilizes the energy distance metric on the model prediction residuals.
-
The design and development of feature ranking and selection methodologies based on the previous feature sensitivity analysis applied to LSTMTF models.
-
The design and development of an ensemble-based solution for fault detection that utilizes LSTMTF models and CUSUM-based detectors together with a majority voting scheme for decision aggregation.
-
Extensive experimental evaluation on two nonlinear system datasets with promising results.
-
Experimental comparisons with well-established feature selection and fault detection techniques.
The remainder of the paper is structured as follows. Section 2 introduces pertinent background notions together with relevant related studies. Section 3 provides detailed explanations of the proposed feature selection and fault detection techniques. The following section presents the experimental assessment, including relevant details about the datasets, model architectures, and feature selection. The results of the fault detection experiments are illustrated in Sect. 5. Finally, the paper concludes in Sect. 6.
2 Related work
2.1 Feature selection
In the popular and highly cited paper An Introduction to Variable and Feature Selection [25], the authors thoroughly describe the approaches needed to develop feature selection solutions. Furthermore, the authors identify the three main objectives of variable selection, namely improving prediction performance, reducing the complexity and cost of predictors, and better understanding the underlying process that generated the data. On a more granular level, the authors highlight other potential benefits of feature selection, including improved data visualization, reduced storage requirements, and reduced training and inference times. In the same paper, the authors define the three main directions utilized in feature selection, specifically wrappers, filters, and embedded methods.
Wrapper-based methods treat the machine learning model as a black-box to assign an importance score to the features. Filter methods select the relevant features during the preprocessing step, independently of the prediction model. Embedded-based methods perform the feature selection process during training procedures. These types of methods are specific to various types of machine learning models [25].
For wrappers and embedded methods, feature selection is performed utilizing three iterative search approaches: forward, backward, and hill-climbing selection. In the first approach, variables are progressively introduced based on some selection criteria. The second approach starts with the set of all available features and progressively eliminates the least important ones. In the last approach, features are added or removed, at each step, from the random set of features. The stopping criteria for this approach are based on a predefined number of iterations [35].
Feature selection has been a point of interest in research for over two decades. For example, more than 20 years ago, Kira and Rendell proposed a statistical-based general feature selection methodology named Relief [26] which is still a well-established technique, heavily utilized today. This approach was originally proposed for classification problems, however an adaption of this method Regression Relief (RRelief) was introduced by Robnik-Šikonja and Kononenko in [36]. Being such a heavily researched area, numerous survey papers on feature selection are published every year, in a multitude of domains [37, 38].
Moving forward to recently proposed techniques, Thakkar and Lohiya proposed a wrapper-filter hybrid feature selection methodology for Deep Neural Network-based Intrusion Detection Systems [39]. In this approach, the features are ranked utilizing a fusion of statistical importance using standard deviation and the difference between the Mean and Median values. For feature selection, the author utilizes a forward selection approach, where iteratively a new feature is added based on ranks. The algorithm utilizes the accuracy of the model as a stopping criterion for selection.
Another recent feature selection technique is the work of Jenul et al. [40]. In this work, the authors introduced RENT, a Repeated Elastic Net Technique for Feature Selection which utilizes an ensemble of generalized linear models with elastic net regularization, trained on subsets from the training set. Their approach selects features based on three criteria, evaluating the weight distributions over all the models in the ensemble. As proven by the authors through extensive experimental assessment, this selection algorithm is efficient and applicable to both classification and regression problems.
Random forest [41] is another popular and well-established technique that consists of an ensemble of unpruned classification or regression trees. These trees are created by drawing subsets of training samples with replacement, where the same sample can be selected multiple times. Random forest is also utilized as a wrapper-based technique for feature selection, applicable to classification and regression, respectively [42].
Ding and Peng in [43] originally proposed the maximum relevance-minimum redundancy (mRMR) feature selection method. This algorithm selects features that are highly relevant to the target variable, while simultaneously minimizing redundancy among the subset of selected features. This is done to ensure that the selected features capture the most valuable and nonredundant information. mRMR gained new popularity after 2019, when it was integrated into a marketing machine learning platform at Uber [44].
Among the popular feature selection methods, we find the work of Robert Tibshirani, who proposed Least Absolute Shrinkage and Selection Operator (LASSO) [45]. LASSO regression is a common statistical and machine learning approach, particularly for large datasets where feature selection is critical for model interpretability and generalization. This technique is utilized to penalize regression coefficients, which favor solutions with a specified coefficient of zero. This method selects the most relevant features while minimizing the influence of less relevant ones, resulting in simpler models. Since its initial proposal, this method has found application in various fields, not only for feature selection, but also in other contexts [46, 47].
Neural network-based solutions have also been applied for feature selection. In a recent paper, Figueroa et al. [48] proposed a feature ranking and selection method that uses deep neural networks. In their work, they introduced a feature selection layer, after the input layer, where each feature is multiplied by a trainable weight, without bias. The weights of the feature layer are trained jointly with the rest of the network’s weights and represent each feature’s importance. The authors also introduce a novel metric that analyses the feature rankings, in terms of performance evolution over several iterations.
Metaheuristic algorithms have also been suggested for feature analysis and selection. In metaheuristic algorithms, the focus is on finding the most viable solution from a multitude of possibilities. The method assesses the efficacy of each potential solution through a series of operations performed on predicted solutions, aiming to uncover a range of improved solutions [49]. Here, we find bio- and nature-based algorithms [50], and physics-based algorithms [51].
In recent years, since model explainability has become a topic of interest, popular methods such as local interpretable model agnostic explanations (LIME) [52] and SHapley Additive exPlanations (SHAP) [53] gained increasing attention from the scientific community. Although these models were not originally proposed for feature selection, some authors proposed feature ranking and selection techniques based on these methods [54, 55]. In contrast, other authors have shown that LIME and SHAP, when utilized as feature selectors, did not show any improvement over other techniques [56].
Compared to existing techniques, this paper introduces a new feature ranking methodology based on sensitivity analysis for LSTM models. Additionally, the proposed method performs automatic feature selection, returning the best subset of features that improve the model, reducing the RMSE.
2.2 Fault detection
Like feature selection, fault detection is a heavily researched area, particularly in the engineering domain. However, its scope extends beyond this field. The three main approaches to fault detection include supervised [57,58,59,60], semi-supervised [61,62,63], and unsupervised methods [7, 13, 64,65,66,67].
In the direction of unsupervised and semi-supervised approaches, a multitude of principal component analysis (PCA) variants have been widely employed in fault detection [13, 66, 68,69,70]. Recently, Yang J. et al. [13] proposed a hybrid fault detection framework together with variable importance analysis for nonlinear process monitoring. Their proposed approach encompasses multivariate exponentially weighted moving average and kernel principal component analysis (MEKPCA). To fuse different space information from different spaces, the authors utilized a Bayesian inference strategy. Their proposed model is validated on two nonlinear systems, including the TEP.
Avinash and Ajaya [71] proposed an unsupervised approach to fault detection utilizing Gaussian process regression (GPR). Their paper also offers an investigation of the effects that GPR parameters, such as the covariance and the mean function, have on the detection performance. The tested variance functions include the matern function and the squared exponential. In terms of mean functions, the authors experimented with zero, sum, constant, and polynomial functions. Their detection approach yielded notable detection results on the TEP dataset. However, as stated by the authors, due to low detection threshold values, their method exhibits high false alarm rates, averaging 20.20%.
Another recent paper by Yang Y. et al. [61] proposed a semi-supervised feature contrast convolutional neural network for fault diagnosis. The fault diagnosis procedures include feature extraction and fault classification. Their proposed approach yields notable results on the TEP, of up to 92.4 % accuracy when utilizing only 20% of the labels.
In [7] an adaptive dynamic programming approach is followed to develop a data-driven fault control method for hydraulic servo actuators. This method accounts for unknown system dynamics, uncertain distributions, and unmeasurable system states. Performance is measured using three metrics, including mean squared error (MSE). Experimental results indicate that this solution achieves better tracking performance with fewer tracking errors, smaller overshoot, and faster responses compared to a similar approach without fault compensation.
Tao et al. [63] proposed a semi-supervised planetary gearbox fault diagnosis method that utilizes graph attention networks (GAT) with few labeled samples. Their solution utilizes the fast Fourier method to process gearbox vibration signals, which are used as graph nodes. Furthermore, the authors introduce a KNN graph construction method that uses pooling for fuzzy distance computation. Their experimental results yield accuracy scores greater than 99% even with very few labels.
Recently, Wang et al. [72] proposed a fault estimation technique in conjunction with a fault tolerant control methodology for multi-input multi-output systems using the Q-learning algorithm. The proposed algorithm is validated using a robot numerical simulation, revealing improvements both in convergent speed and precision.
In [73], Tan et al. proposed an LSTM-based detection system for nonlinear dynamic systems. The authors construct a system model using LSTMs and monitor changes in the mean, slope, and standard deviation values of the predicted output values. Their solution yields accuracy scores up to 99.5%. However, it is worth noting that the proposed detection system is applied to a univariate, self-generated dataset, and the paper does not provide architectural information about the model, nor is the dataset available.
Ensemble-based detection techniques were also proposed for fault detection. Biao and Zhizhong in [64] developed an outlier detection method based on dynamic ensemble learning. Their proposed approach utilizes One-Class classifiers as base learners in the ensemble. To aggregate the results of the base learners, a decision template methodology is employed, with average detection rates of up to 91% on the TEP dataset.
Moving toward supervised proposed techniques, we find the work of Zhou et al. [57]. This study proposes a hybrid vision-based transformer, which encompasses three components, namely an embedding block, a feature extraction block, and a multi-layer perceptron classification block. The authors extend their proposed approach with six variants of the vision-based transformer to address both fault detection and diagnosis. The authors conducted experiments with two datasets, one of which is the TEP.
Lomov et al. [59] investigated the fault detection efficiency of a wide range of classifier deep learning models. Their study analyzes both recurrent and convolutional architectures, including LSTM, Gated Recurrent Unit (GRU), Transformers, and convolutional neural networks (CNN). Additionally, the authors proposed a Generative Adversarial Network (GAN) based approach for new data generation to extend the training datasets. Their experimental assessment is performed with sequence lengths of 60 and 800, respectively. Utilizing a sequence length of 800 observations, the tested models yield notable results, with detection rates of up to 100% on almost all faulty datasets and by almost all tested models. Conversely, shorter sequence lengths yield lower detection rates, ranging from 47% to 100%.
Compared to other methods, our proposed approach introduces an ensemble-based fault detection framework. Each detector includes an LSTMTF model with automatic feature selection. The detectors decisions are aggregated using a majority voting scheme. Additionally, by utilizing an ensemble with simpler units positioned at various subcomponents, our proposed approach can detect deviations that might not be evident solely in the system output variables.
3 Proposed methodology
The proposed approach builds on two major concepts, namely feature analysis and fault detection. The first concept comprises two components: feature ranking and feature selection, specifically designed for regression tasks. The feature selection component is classified as a wrapper-based approach that utilizes the machine learning model’s prediction errors during the selection process. The second concept, fault detection, incorporates the design of the detectors and the ensemble, respectively.
In short, the methodology for feature ranking utilizes a sensitivity analysis-based backward approach, which does not require model retraining. The list of ranked features is employed for feature selection. The selection process follows a forward-based approach in which, based on a stopping criterion, the highest-ranking features are incrementally added one by one, and the model is retrained at each step. The same feature selection procedures are followed for each output variable when the ensemble is constructed. Furthermore, in both directions, the energy distance is utilized to compute a distance score. A general overview of the proposed feature analysis methodology is illustrated in Fig. 1.
It is worth mentioning that during both procedures involved in feature analysis, namely feature ranking and feature selection, the architecture of the models is not modified in any way. Specifically, the number of hidden layers or the number of hidden units remains unchanged throughout both procedures.
The proposed fault detection approach utilizes CUSUM-based detectors that work on the prediction residuals originating from each LSTMTF predictor. Each predictor takes multiple input variables and predicts a single output. Furthermore, each predictor is coupled with a CUSUM-based detector, forming the basic homogeneous units of the detection ensemble. Employing a threshold-based methodology, each detector outputs a binary decision regarding the validity of a new data point. The final decision is given by the ensemble using a majority voting scheme.
3.1 Long short-term memory with teacher forcing
LSTM models, initially proposed by Hochreiter and Schmidhuber in 1997 [74], were designed as a solution to address the vanishing and exploding gradient problem encountered in vanilla versions of RNNs. Over time, LSTM models gained substantial popularity and became increasingly preferred in the field. LSTMs feature memory units that can capture both long and short-term dependencies in time-series data.
A typical LSTM layer comprises blocks containing memory cells and three gates, including input, output, and forget gates. Moreover, each block features two recurrent connections, namely the hidden state and the cell state representing short-term and long-term memory, respectively. The three gates control the flow of information to and from the memory cell. As implied by their names, the forget gate manages discarded information, the input gate regulates information to be stored, and the output gate computes the current output of the cell.
Training models with TF implies feeding the previously observed output value, \(y(t-1)\), as an additional input at each time step. However, during testing or inference, the previous model output \(\hat{y}(t-1)\) is employed as input instead of using the observed output value. Utilizing the model’s output as input during testing has some drawbacks. A significant issue is that the inputs seen by the model during training may differ significantly from the inputs encountered during inference, resulting in Exposure Bias. The initial TF methodology assumes that the true output values will not be available after the training phase. However, the LSTMTF model described in this paper, as originally proposed in [75], incorporates the utilization of the previous true output value, during both training and inference stages. Additional information on LSTMs with TF is available in [75, 76] where it was demonstrated that such models can outperform various other existing techniques.
The equations for LSTMTF at time t are presented below. Equation 1 denotes the forget gate, Eq. 2 denotes the input gate, the cell state computation is shown in Eqs. 3 and 4, Eq. 5 shows the output gate computations and Eq. 6 illustrates the hidden state. Additionally, Eq. 7 describes the computation of the neural network’s predicted output at time t, further denoted as \(\hat{y}(t)\).
Here, \({\mathcal {W}}\),\({\mathcal {V}}\), \({\mathcal {U}}\) and \({\mathcal {P}}\) denote the weight matrices, X is the input vector, \(\overline{C}\) denotes the vector of new candidates for the cell state, C is the current cell state and b denotes the bias vectors. Let H represent the number of units on the hidden layer. Also, let m denote the total number of measured signals, where \(m-1\) signals are utilized as inputs and one signal is utilized as output. Including the TF input, the number of inputs will be equal to m. The input vector X will be of size \(X \in {\mathbb {R}}^{m \times 1}\). The dimensionality of the weight matrices is as follows: \({\mathcal {W}} \in {\mathbb {R}}^{H \times (m-1)}\), \({\mathcal {V}} \in {\mathbb {R}}^{H \times 1}\), \({\mathcal {U}} \in {\mathbb {R}}^{H \times H}\) and \({\mathcal {P}} \in {\mathbb {R}}^{1 \times H}\) As nonlinear activation functions, the Sigmoid and Hyperbolic Tangent functions are utilized, as they appear in Eqs. 1−6.
Let \({\textbf{Y}}\) denote the vector of observed output values, where \([y(0), y(1),\ldots , y(t)] \in Y\). Likewise, let \(\hat{Y}\) denote the vector that contains the LSTMTF’s predictions, where \([\hat{y}(0), \hat{y}(1),\ldots , \hat{y}(t)] \in \hat{\textbf{Y}}\). The prediction error vector is computed as the difference between Y and \(\hat{Y}\) as shown in Eq. 8:
3.2 Energy distance metric
The energy distance (ED) [32] originates from physics, from the potential energy between objects in a gravitational space. The potential energy is zero if the gravitational center of the two objects is similar, following an increasing trend as their distance increases. This concept can be applied to data as follows. Let F and G denote the cumulative distribution function (CDF) of two vectors \(\mathbf {{e}}\) and \(\hat{\textbf{e}}\), respectively. Also, let \(|| \cdot ||\) denote the vector’s length and let \({\mathbb {E}}\) denote the expected value. If we consider \({\textbf{e}}'\) and \(\hat{\textbf{e}}'\) to be copies of \({\textbf{e}}\) and \(\hat{\textbf{e}}\), with the cumulative distribution functions F and G, then the continuous form of the ED is defined as:
Next, we consider \({\textbf{e}}\) and \(\hat{\textbf{e}}\) as having n elements each, the discrete form of the ED between \({\textbf{e}}\) and \(\hat{\textbf{e}}\) is computed using the following formula:
The ED, as shown in Eq. 10, is particularly useful as it measures the distance between distributions and the changes in the shape and position of the distributions. ED is sensitive to differences in the entire distribution, making it suitable for cases where variations beyond the mean or variance need to be measured. In the context of our proposed approach, the discrete form of the ED is utilized to measure the sensitivity of the LSTMTF model to the input features. This is achieved by computing the distance between the prediction error distributions of the trained model with all inputs and the model with one disabled input at a time.
3.3 Feature ranking
Let \(S_{train}\) and \(S_{fs}\) denote the fault-free training and the larger fault-free feature analysis datasets. Also, let m denote the number of available measured variables. Out of the m measured variables, \(m-1\) variables will be used as inputs and one variable as output. Feature ranking is performed as follows.
-
The model is trained on \(S_{train}\), and the prediction error is computed on both the training and feature analysis datasets (\(S_{train}\) and \(S_{fs}\)), with the resulting prediction error vectors \(\mathbf {e_{train}}\) and \(\mathbf {e_{fs}}\).
-
A reference score using \(\mathbf {e_{train}}\) and \(\mathbf {e_{fs}}\) is computed using Algorithm 1.
-
Each of the \(m-1\) inputs is individually deactivated on the trained model, one at a time, and the prediction error vectors \(\mathbf {e_i}\), where \(i \in \{1, 2,\ldots , m-1\}\), are computed over \(S_{fs}\).
-
An energy distance score (EDS) is computed for each \(\mathbf {e_i}\) using the methodology outlined in Algorithm 2, with \(\mathbf {e_{fs}}\) serving as the reference vector.
-
The vector of ranks \(\mathbf {v_{r}}\) is sorted in descending order, ensuring that the most important features occupy the top positions, according to Algorithm 3.
Algorithm 1 computes a reference score denoted as \(\psi\), where a sliding window approach is applied to traverse the reference error vector \(\mathbf {e_{ref}}\) with the window size equal to the length of the training error vector \(\mathbf {e_{train}}\). The length of the sliding window is denoted as \(\phi\). At each step, the ED between \(\mathbf {e_{train}}\) and the values within the window is computed and stored in a distance vector \(\mathbf {v_r}\). The final value of \(\psi\) is determined as the maximum from \(\mathbf {v_r}\).
The EDS is computed using Algorithm 2. Here, \(\mathbf {e_{test}}\) is also traversed using a sliding window of size \(\phi\). At each step, the ED between \(\mathbf {e_{train}}\) and the values within the window is computed and stored in a distance vector \(\mathbf {v_t}\). Next, the values from \(\mathbf {v_t}\) are all divided by \(\psi\). This division is applied to obtain the ratio between the distance of the reference error vector distribution and the training error vector distribution. Finally, the EDS is computed as the average value of \(\mathbf {v_t}\).
To mitigate the impact of variance introduced by the random parameter initialization of the LSTMTF model, the steps of Algorithm 3 are iteratively performed multiple times. The resulting ranking vectors \(\mathbf {v_{r}}\) are averaged to produce a final score. It’s important to note that this additional step is optional and is included solely to address model variance. This final sorted vector holds the ranking scores for each input variable, and it will be further utilized for automatic feature selection.
3.4 Feature selection
To establish the stopping criterion for the feature selection process, a threshold, further denoted as \(\delta\), is computed as follows. The selected model, with all available features, is trained and tested multiple times on \(S_{train}\) and \(S_{fs}\), respectively, and the prediction error vectors \(\mathbf {e_{train}}\) and \(\mathbf {e_{fs}}\) are calculated. For each repetition, the EDS value is computed and stored in \(\mathbf {v_{max}}\). The threshold value \(\delta\) is selected as the maximum value within \(\mathbf {v_{max}}\).
The proposed automatic selection process is showcased in Algorithm 4. Selection begins with the top K features chosen from the sorted vector of ranks \(\mathbf {v_{r}}\). These selected variables are then stored in \(\mathbf {v_{sel}}\), which contains the final selected features.
Subsequently, the model is retrained on \(S_{train}\) with the currently selected features from \(\mathbf {v_{sel}}\). The trained model is evaluated on \(S_{fs}\), and from the resulting error vectors, the EDS value is computed. If the resulting score is below \(\delta\), the selection process stops. Otherwise, the next highest-ranking feature is added to \(\mathbf {v_{sel}}\), and the selection process is repeated until the stopping criterion is met. Alternatively, the selection process continues until all available features are added to \(\mathbf {v_{sel}}\).
Additionally, the stopping criterion can be expanded to allow for feature addition as long as the score continues to decrease, without stopping on the first EDS value under \(\delta\). If redundancy is desired for the selected output variable, the features from \(\mathbf {v_{sel}}\) can be distributed into multiple groups based on their ranks. This results in various feature groups for the selected output variable.
3.5 Cumulative sum-based detector design
The CUSUM-based detectors, as we originally proposed them in [75], monitor changes in the mean and variance values of LSTMTF prediction errors, utilizing an adapted variant of the 1-CUSUM scheme [77]. As demonstrated by the original authors, this method can detect deviations in both mean and variance values. It operates with a singular two-sided control chart, analyzing individual observations.
Let \(\mu _0\) denote the mean of the prediction errors \({\textbf{e}}\), computed on fault-free data that has not been seen by the LSTMTF predictor. Also, let \(\mathbf {\nu } = {\textbf{e}} - \mu _{0}\). The CUSUM value at time t, denoted as CSM(t), is computed as follows:
In Eq. 11, \(\beta\) denotes the reference parameter value, \(\lambda\) (\(0\le \lambda \le 1\)) represents a weighting factor, and CSM(0) is initialized with 0. The selection methodology for the initial values of these two parameters is presented in Sect. 4.3.
Let d denote the number of detectors. The detection thresholds for each detector, denoted as \(\theta _j\), where \(j \in \{1, 2,\ldots , d\}\), encapsulates two components, \(\theta _j^H\) and \(\theta _j^L\) that are computed as:
Here, \({CSM}_j\) is computed over a fault-free subset. In terminology correlated with cumulative sum control chart methodologies, \(\theta _j^H\) and \(\theta _j^L\) represent the upper and lower control limits, respectively.
In the detection process, each detector continuously computes the updated CSM value utilizing the prediction error supplied by the predictor. Subsequently, each detector produces a binary decision of normal or indicating a fault, based on the comparison of the CSM values with \(\theta _j\).
3.6 Ensemble design
Considering a system with multiple outputs, for each selected output the LSTMTF model’s inputs are automatically selected using the previously described methodology. The proposed ensemble is composed of multiple homogeneous basic units, where each unit contains an LSTMTF predictor and a CUSUM-based detector. As previously mentioned, the ensemble outputs a binary decision for each new data point. This decision represents the fusion of the decisions received from the detectors and will be denoted as \(V_E\).
The decisions of the d individual detectors, denoted by \(\Omega _{j} \in \{0,1\}\), signify the two possible outcomes (e.g., 0: clean, 1: fault). \(\Omega _{j}\) will take the value of 0, if detector j outputs a clean decision, and 1 otherwise. The final output of the ensemble at time t is computed utilizing Eq. 13, where a majority voting scheme is employed. Figure 2 visually illustrates the proposed fault detection ensemble, including the LSTMTF predictors, CUSUM-based detectors, and the majority voting scheme.
4 Experimental assessment
To measure the performance of the feature selection method, two metrics were selected, namely RMSE and the coefficient of determination (R-squared or \(R^2\)) [78]. The latter is utilized for feature selection comparisons. The next subsections present the two datasets used for the experimental assessment, the final architecture of the models, and the results of the feature selection.
4.1 Datasets description
4.1.1 Tennessee Eastman process
Introduced by Downs and Vogel in 1993 [33], the TEP illustrates an industrial chemical process with the primary goal of facilitating the design and evaluation of various control technologies. The TEP encompasses five major units including the reactor, product condenser, separator, compressor, and product stripper. As originally proposed by Downs and Vogel, this model features 52 variables, with 41 process variables and 11 manipulated variables. Although the TEP serves as an industrial process, it also represents a complex continuous nonlinear system. As related by Jockenhövel et al. [79] and by Ricker and Lee [80], the TEP is described by 30 differential equations, 149 algebraic equations, 160 algebraic variables, 11 control variables and 26 states.
The publicly accessible TEP dataset [81] is extensively utilized for various applications, including control strategy design, multivariate control analysis, anomaly detection, fault diagnosis, and even cybersecurity analysis [82]. This dataset is composed of both fault-free and fault subsets. Faulty datasets include 20 generated faults on multiple system components. Such faults encompass variations in process variables, manipulated variables, or anomalous behavior in the chemical reactions occurring within the process.
For the same TEP dataset, the training subset consists of 500 simulations, with each simulation encompassing 500 observations, with a total of 250,000 observations. During each simulation, variables were sampled at three-minute intervals. For the training dataset, each simulation ran for 25 h. Conversely, to generate the testing fault-free subset the simulation ran for 48 h, comprising 500 simulations with 960 observations per simulation, adding up to a total of 480,000 observations. Each subset contains 55 columns, out of which 52 variables, simulation number, sample number, and a final column for additional information.
The 20 faulty subsets consist of 500 simulations, with each simulation encompassing 960 observations. Each fault occurs after the 160th observation and remains active for the next 800 observations. In the following, F1 to F20 will denote the 20 faulty subsets. In a recent study by Hu et al. [65] the authors discuss the nature and effects the faults have on product quality control. As identified by the authors, nine faults (i.e., 1, 2, 5–8, 10, 12 and 13) are quality-related, while five faults (i.e., 3, 4, 9, 11 and 15) are quality-unrelated faults. Among the quality-unrelated faults, as identified in [83,84,85], faults 3,9 and 15 are difficult to detect due to the lack of observable change in the mean or variance of the data. Due to this fact, these three faults are not considered in numerous studies. However, as previously stated, this paper considers all available faults, treating them as unknown possible future faults.
For model training, 10,240 observations are selected from the original training subset. For validation during training, 1,000 data points are selected. For fault detection, the entire faulty datasets are utilized, consisting of 480,000 data points for each subsequent anomalous dataset. The subset \(S_{fs}\), which is utilized for feature analysis, is selected from the fault-free subset. As the proposed ensemble technique follows an unsupervised detection approach, only fault-free data are utilized for predictor training. The faulty datasets are seen by the models and are utilized only during the detection phase. The subsets used for training, feature selection and CUSUM parameter estimation are not employed in the remaining evaluations, as to not introduce biased results. While for this specific dataset, the fault-free observations are divided in two separate files, our proposed approach requires fault-free observations for training and feature analysis procedures. For a comprehensive list of variables, faults, and additional information, please consult the original TEP paper [33].
Additionally, all the datasets utilized in the experimental assessment are normalized using the feature scaling approach, with all features scaled in the [0,1] range, utilizing the minimum and maximum values from the training dataset [86].
4.1.2 Heating, ventilation, and air conditioning
To further validate the proposed methods, an additional continuous nonlinear system dataset from the heating, ventilation, and air conditioning (HVAC) domain is utilized [34]. Developed by Lawrence Berkeley National Laboratory, Pacific Northwest National Laboratory, and the United States Department of Energy, this dataset encompasses observations from a boiler plant system that provides hot water to a large office building. The office building consists of 12 floors, where each floor is served by a single Air Handling Unit (AHU). Furthermore, each AHU serves five thermal zones that have dedicated thermal units that use the hot water produced by the boiler plant. Sensors and valves are also utilized to control water flow. This dataset contains 17 subsets, of which one is fault-free and 16 faulty variants. Each subset contains 23 variables sampled once per minute for 365 days (1 year), totaling 525,500 observations each. Faults include the addition of bias to the output of the hot water sensors, the multiplication of the intensity value of the heat transfer coefficient, and the modification of the controller gain values. Additional information on the dataset, the faulty scenarios, and the measured variables is available in [34].
From the fault-free dataset containing 525,500 observations, 30,000 observations are utilized for model training, 100,000 for feature selection, and 60,000 for CUSUM parameter analysis and threshold estimation. The remaining observations are utilized for false alert evaluations. For fault detection, all the observations from the 16 faulty subsets are utilized. As was the case for the previous scenario, for model training only fault-free observations are utilized, while the faulty datasets are employed only during the detection phase. From the available 1 year worth of measurements, only 20 days of observations are used for model training. Out of the available 23 variables, 4 variables were removed, 2 containing constant values and 2 marked with NAN values. Each subset is normalized in the [0,1] range, utilizing the minimum and maximum values from the training set.
4.2 Parameters and hyperparameters
To approximate the CUSUM parameters, namely the reference value \(\beta\) and the weighting factor \(\lambda\), the following experiment was performed. From the fault-free dataset, 20 simulations worth of data points were selected. On this subset, utilizing one predictor and one detector, the FPR was computed for multiple values of \(\beta\) and \(\lambda\), both in the [0, 1] range, with a 0.05-step increase. Visually, from the constructed heat map, the final parameter values were selected from the area with the lowest FPR. As one of the primary objectives of this paper is to establish detection methodologies for unforeseen failures, the selection process deliberately excluded any anomalous data. Figure 3 illustrates the constructed heat map with the FPR areas for both parameters for the TEP dataset. The final selected values are as follows: \(\beta = 1\) and \(\lambda = 0.1\) for the TEP dataset, and \(\beta = 0.5\) and \(\lambda = 0.5\) for the HVAC system dataset.
Moving forward to the LSTMTF models hyperparameters, in the experiments, the models are constructed with a single hidden layer containing 16 hidden units, along with an output layer consisting of 1 neuron. The training process entails a maximum of 300 epochs, with an initial learning rate of 0.005, a batch size of 32, and a training sequence length of 40. The models are trained using the Adam optimizer, with a learning rate decay of 4% every 20 epochs. Additionally, the models are designed to be stateful and in a many-to-many mode. The best values for the hyperparameters are obtained utilizing an exhaustive performance analysis of multiple models, including LSTM with and without TF. Additional information can be found in [76]. For feature selection procedures, the size of the analysis dataset \(S_{fs}\) was equal to 50,000 data points for the TEP and 100,000 data points for the HVAC system.
4.3 Model architectures
4.3.1 Tennessee Eastman process dataset
For the TEP dataset, the outputs of the predictors were selected as follows. Five variables denoting the system outputs (Predictor 7 - Predictor 11) and one additional predictor for each of the major system components (Predictor 1–Predictor 6), as described in Sect. 4. Table 1 illustrates the automatically selected features for each output variable. As observed, the number of selected features ranges from 5 to 50, depending on the output variables. Having 11 Predictors, the number of CUSUM detectors d was also set to 11.
Moving forward, Table 2 illustrates the performance of two models, with and without feature selection, on both the training dataset and the fault-free testing set. In this direction, the RMSE metric was utilized. Furthermore, in the same table, the number of features is also provided. As observed, with a reduced number of features, the models maintained the RMSE over the unseen testing dataset. Moreover, both models performed better than the models which utilized all the available features. For example, the first predictor with feature selection yields a decrease of \(0.17 \%\) in terms of RMSE, compared to the same model with all features. A decrease in the RMSE value is also observed for the second predictor, in this case with a \(10.94 \%\) difference. In terms of the number of reduced features, for the first predictor shown in the table, the features are reduced from 52 to 22, specifically a \(57.69\%\) reduction. For the second predictor, the number of features was reduced from 52 to 5 translating to a \(90.38\%\) reduction in the number of features, while also performing equally or better than the model with all the available features.
The prediction performances of three models are visually illustrated in Fig. 4. Here, the observed and predicted values are shown on two simulations that consist of 500 points from the fault-free subset not seen by the models during training. The selected outputs with their corresponding selected input variables are presented in Table 1.
To further illustrate the results of the proposed feature selection technique, additional measurements were performed. For all 11 predictors, the training and inference time was measured over a subset of 50,000 data points. The results of these measurements are shown in Table 3. The same table also shows the minimum, maximum, and average number of trainable parameters (e.g., model weights) computed over the 11 LSTMTF predictors.
The rationale behind computing the number of trainable parameters lies in its direct correlation with the computational complexity of these models. According to the original LSTM paper [74], the computational complexity per time step, per weight, is \({\mathcal {O}}(1)\). Therefore, the overall complexity per time step is \({\mathcal {O}}(w)\), where w is the number of weights. When we extend the complexity with the number of observations N and the number of epochs \(\kappa\), the complexity becomes \({\mathcal {O}}(w \cdot N \cdot \kappa )\). Therefore, it becomes clear that the number of weights directly influences the computational complexity.
For models with all 52 features, the number of parameters was 4433, while after feature selection, the average number of parameters was reduced to 2699, translating to an average decrease of \(39.11\%\). This reduction in complexity is subsequently reflected in training and inference times, where the average times were reduced for training and testing by \(8.91\%\) and \(9.70\%\), respectively. This reduction, in both training and inference times, is also visible when analyzing the minimum and maximum values of the same table.
4.3.2 Heating, ventilation, and air conditioning dataset
For the HVAC boiler dataset, four predictor output variables were selected as follows. Two variables on the boiler temperature sensors, one variable from the hot water loop supply, and a variable denoting the hot water loop pump speed ratio. The output variables with the selected input features are shown in Table 4. Depending on the output variable, the number of selected features ranges from 5 to 12, from the 19 available variables. The selected features for each predictor are shown in descending order according to the computed importance ranks.
Similarly to the TEP dataset, Fig. 5 shows the prediction performance of an LSTMTF predictor on the HVAC boiler dataset. The same figure also shows the dynamics of the observed output variable over the 525,000 time-steps, while also highlighting the portion of fault-free data used for model training. In contrast to the TEP dataset where the signals appear to be stationary, here, the signals appear to have both trend and seasonality. This is expected as the HVAC depicts a boiler system utilized for heating an office building, and the data is generated for an entire year, with lower and constant values over the summer period (see the middle region of the top subfigure).
5 Fault detection results
5.1 Tennessee Eastman process dataset
Table 5 illustrates the TPR, FPR, and DD of two detectors on the prediction residuals given by two models (Predictor 1 and Predictor 2), with and without feature selection. As seen here, the detection performance is not heavily affected by feature selection. Moreover, in the case of the second detector, after feature selection, the TPR increases by \(2.51\%\), the FPR is reduced by \(0.01\%\) and the detection delay is decreased on average by 0.05. In the case of the first detector, a minor decrease in the TPR is observed after feature selection, namely \(0.24\%\). For the case where the first predictor utilized all the available features, the detector obtained better detection rates on F2 - F4, F6, F8 - F9, F11 - F12, F14 - F17 and F20. In the case of the second predictor, when all the features were utilized, the best TPR was obtained on F2, F15, and F20. However, here, in all 20 faulty cases, the detector yields a \(0.01\%\) FPR compared to \(0\%\) after feature selection. In the case of the second predictor, the detection delay is reduced for F6 and F13 from 0.98 and 0.8 to 0.2, respectively. Overall, on the 20 faulty sets, the results are equal or better after feature selection. Moreover, Fig. 6 depicts the distributions of prediction residuals computed on both the fault-free TEP subset and nine faulty subsets for Predictor 1.
The detection results of the proposed ensemble, on all faulty datasets, are illustrated in Table 6. As seen here, the ensemble yields \(0\%\) FPR on all data sets, without any detection delays, indicating that the ensemble generates an alert on the sample immediately after the first appearance of the fault. While on some faults, the TPR rate decreased from \(100\%\) to \(99.99\%\), there are no large variations in any of the metrics. The detection ensemble yields a \(99.99\%\) detection rate on F1, F2, F6, F8, F12, F13, and F20, while on the remaining faults, it yields \(100\%\) detection rate. This further illustrates the advantages of employing ensemble-based detection solutions. For example, comparing the average results from Tables 5 and 6, it can be observed that the false positive rates were reduced by \(0.03\%\) and the detection rates were increased by a maximum of \(3.76\%\) when switching to an ensemble approach.
It is worth mentioning that even on the harder-to-detect faulty sets, namely F3, F9, and F15, our proposed ensemble yields \(100\%\) detection rates with \(0\%\) false alerts. Recall that these three faults were identified in the literature as quality-unrelated faults [65], and many authors either opted to not test their models on these faults, or obtained smaller detection rates. This will be further illustrated in the following subsection, where our proposed ensemble is compared to well-established and state-of-the-art detection approaches applied to this particular system.
5.2 Heating, ventilation, and air conditioning dataset
Table 7 illustrates the detection results in terms of FPR, TPR, and DD for the proposed ensemble on the HVAC boiler system dataset. In the same table, the metrics are computed for a single detector in two configurations, namely with a reduced set of features and with all available features. The TPR is computed as the mean value over the 15 faulty datasets.
As shown in the table, the detector that utilized all features yielded 0.58% FPR and a TPR of 97.60%, while the detector with the reduced set of features shows an increase of 1.61% in TPR with 98.61%. The best detection results were obtained by the ensemble with 99.99% detection rate and 0% false alerts. In terms of DD, all approaches obtained a value of one. Conversely to the TEP dataset, here, each of the 15 faulty sets starts with faulty observations and the detection CUSUM is initialized each time with 0. On the TEP dataset there was no need to initialize the CUSUM as the faulty observations were preceded by fault-free observations in all the cases. In each experiment, the initial value of the CUSUM was also counted in the false negative decision and detection delay. Additionally, the distribution of the prediction errors for the HVAC fault-free and six faulty subsets is illustrated in Fig. 7.
5.3 Comparison with baselines
This subsection presents the comparison results between our proposed approach and other well-established techniques, on both feature selection and fault detection. For feature selection, the candidates for comparison are as follows: Fusion DNN [39], Repeated Elastic Net Technique (RENT) [40], RRelief [36], LASSO [45] and Random Forest Ensemble (RF) [42].
The parameters for the previous techniques were set as follows. For Fusion DNN, the authors proposed using the accuracy measure as the stopping criterion. To adapt this method to regression-based selection, the mean squared error was employed. For the RENT technique, 100 models were utilized with regularization parameters from the list [0.1, 1, 10] and l1 ratios from the list [0, 0.1, 0.25, 0.5, 0.75, 0.9, 1]. The default test size range of (0.2, 0.6) was used. For RRelief, the number of iterations was set to 100. In the case of the LASSO technique, the alpha parameter was set to 1. For RF, the number of estimators was set to 200. For RRelief, LASSO, and RF, the number of selected features was set to be equal to our approach, specifically 12 features on the TEP dataset and 9 features on the HVAC dataset.
On the TEP dataset, the first comparison, in terms of prediction performance and fault detection, for one predictor and detector, is shown in Table 8. For this experiment, Predictor 8 was chosen (see Table 1), and the input variables were selected utilizing other well-established methods. The RMSE and \(R^2\) were computed on the fault-free subset, while the TPR, FPR, and DD were computed on the faulty subsets.
As observed in Table 8, in terms of prediction RMSE values, the model utilizing the LASSO selected features yielded the highest value of 0.05486. Conversely, the best RMSE values were obtained by our approach and RF, with values of 0.04848 and 0.04889, respectively. Fusion DNN, RENT, and RRelief obtained similar RMSE values, with minor differences between them. Compared to the model that utilized all the features, all the tested techniques revealed an improvement in terms of RMSE. Similarly, in terms of \(R^2\) scores, the LASSO technique yielded the lowest score of 0.83. Our proposed approach and the remaining four techniques obtained 0.87 and 0.86 score values, higher than the model with all features.
In terms of detection performance, as illustrated in Table 8, the model with all available features performed the worst, with a 96.35% TPR and with the highest average detection delay of 0.15. The highest TPR values were obtained by our approach, with TPR values of 97.78%, 0% FPR, and a 0.07 detection delay. This is closely followed by the RENT technique, with a 0.1% difference in the TPR values, and a 0.4 difference in the detection delay. The lowest detection results were obtained by the model that utilized the features selected by the RRelief technique, with 97.04% TPR. Except for the model with all features, all the tested methods yielded TPR values above 97%, and average detection delays below 1. Two of the tested techniques resulted in a low FPR of 0.01%
Moving forward to comparisons on the HVAC boiler system dataset, Table 9 depicts the modeling and detection performances. As shown here, in terms of prediction error values, the best results were obtained by the model that utilized the features selected using our approach, with a RMSE value of 0.00703. The second-best RMSE values were obtained by the model utilizing the features selected by Random Forest, with 0.00789. The highest values were obtained with the features selected by the RENT technique with a RMSE value of 0.00926. The lowest \(R^2\) value of 0.84 was obtained by the RENT technique, followed by the model with all the features with a value of 0.86. The highest \(R^2\) values were obtained by our approach, followed by Random Forest, Fusion, and RRelief. Apart from the RENT technique, all other tested approaches yielded better RMSE and \(R^2\) values compared to the model that utilized all the available features.
As shown in Table 9, in terms of TPR the best results are obtained using the features selected by our approach, followed by the model with all features. The lowest TPR values were obtained by the model that utilized the features selected by the Fusion technique, with 56.60%. The third best results were obtained by Random Forest, with a 95.20% detection rate. RENT, RRelief and LASSO obtained similar detection results, all below 80%. The highest FPR were obtained with the features selected by Random Forest, LASSO and RENT with values averaging 5.6%. In contrast, the lowest false alert values were obtained by the Fusion technique and our approach with 0% FPR. The measured DD was equal for all the methods tested and equaled 1 sample.
Although these results validate the efficiency of the proposed method, they also highlight the performance of both the LSTMTF predictor and the CUSUM-based detector. Additionally, while other methods correctly identified the features that contribute to the best prediction performance, our proposed approach, which accounts for the sensitivity of each selected feature, exhibited higher TPR. Recognizing the importance of not only the previously observed output value in modeling and prediction but also the influence of other features is crucial, as faults may propagate to affect multiple features. Conversely, one of the drawbacks of our approach, compared to the other methods, is its limitation to time-series regression tasks, while the other techniques function for both regression and classification, respectively. Additionally, the proposed approach requires fault-free observations for all the procedures, including LSTMTF training, feature ranking and selection, and CUSUM parameter estimation.
Moving forward, Tables 10 and 11 present the comparison results between our proposed ensemble approach and 22 proposed fault detection solutions for the TEP dataset. During the selection process, both recent and popular works were considered. The results, displayed in the two tables, were extracted directly from the original papers as indicated in the column headers. The first table details recent and popular unsupervised, as well as semi-supervised, detection techniques, including several PCA variants [13, 66, 68], three ensemble-based approaches [64] and even semi-supervised CNN-based solutions [61]. In the second table, our approach is compared with supervised approaches, where each proposed solution was trained with both fault-free and faulty observations. In short, most of the supervised approaches thread fault detection as a classification problem. These approaches include popular machine learning techniques, such as LSTMs, GRUs, Transformers [59], multi-layer perceptrons [85], convolutional neural networks [87] and nonlinear SVMs [88].
As observed in Table 10, our proposed approach outperforms all the others in terms of FPR, TPR, and detection delays. As mentioned above, faults 3, 9 and 15 were identified as difficult to detect due to the absence of changes in the mean and variance of the measured signals [83]. Some authors specifically ignored evaluating their approach on these three faults. In the same table, it can be observed that the three ensemble-based approaches [64] outperform the single model-based solutions, especially on the F3, F9, and F15 datasets. However, these ensemble exhibit a higher FPR than the other methods. Our proposed approach yields a decrease of a maximum of 10.85% FPR and a maximum detection rate increase of 43.84%, compared to others. Similarly, in terms of detection delay, our method achieves instant detection on all faults.
The results of the supervised approaches, as shown in Table 11, clearly indicate an increase in the average detection performance, compared to the unsupervised methods. This behavior is expected, as the models are trained with fault-free and faulty measured values. Even in comparison with the supervised methods, our ensemble-based technique yields better results in terms of FPR, TPR, and DD. A very close TPR is obtained by the GRU classifier utilized in [59], with a 0.33% difference. Compared to others, in terms of FPR, our approach obtains a maximum of 10.72% decrease. While on specific faults the other methods obtained 100% TPR compared to 99.99%, our method remains consistent on all faults, both in terms of TPR and detection delays.
As was the case with unsupervised methods, some authors opted to exclude the detection of F3, F9, and F15. On these three faults, the worst performances are obtained by the nonlinear SVM with 0.40% detection rate on F3, 36.00% obtained by the Fuzzy Bayesian approach on F9, and finally, 1.10 % obtained by the nonlinear SVM in F15. In contrast, the GRU-based approach yields the best results on these three faults, apart from ours.
5.4 Discussions
Although the high percentage of TPR indicates the detection of the vast majority of faulty data points, DD indicates the speed of detection. In realistic scenarios, it is desirable to identify faults in the incipient phases to reduce the eventual costs that result from running systems in faulty states for prolonged periods. In addition, these faults can propagate to other subsystems, further increasing maintenance, repair, and replacement costs, or may affect the quality of the final product. Additionally, consider critical infrastructure systems, where higher DD can and sometimes have severe effects on the environment and human life. Lower false alert rates also help reduce the costs and energy invested in unwanted investigations or component replacement. Furthermore, false alerts over time decrease the user’s confidence in the detection and monitoring system, which can also have severe effects, as probable true events might be ignored or considered false alerts.
Our proposed approach utilizes machine learning models, which can be expensive in terms of computational requirements. Model training and feature selection can be performed offline, while the resulting parameters can be transferred to monitoring devices for real-time use. In this direction, we measured and demonstrated that LSTMTF models can be utilized on resource-constrained embedded devices, such as the Raspberry PI model. In [75], the CPU and RAM usage was measured in multiple scenarios, from running a single instance to 20 models on the same device. The experimental results revealed that CPU usage increased by an average of 30% when running more than one LSTMTF instance and averaged 25% with a single instance. Specifically, running more than two instances did not show significant CPU usage increase beyond 30%. The RAM usage was 5.5% (424 MB) and was partly attributed to loading the entire dataset into memory.
The monitored output variables were selected as the outputs, and each major component of the system, as described in Sect. 4. Although this approach yielded 99.99–100% detection rates with 0% false positive rates, the proposed ensemble can be further extended to include other predictors and detectors on multiple other measured signals. Furthermore, redundant groups of inputs can also be selected for each output variable, as described in Sect. 3. Moreover, if a complexity reduction is necessary, the number of detectors can also be decreased. Since there is no backward dependency between the undertaken steps in the feature ranking and selection procedures, these methods can be further optimized during implementation by employing parallelization techniques.
Nowadays, Transformers [92], Transformer-based approaches and pre-trained models are considered state-of-the-art in terms of neural network-based solutions. Although these models dominate the fields of natural language processing, speech analysis [93, 94] and text classification [95], multiple studies revealed that simpler models such as LSTMs, GRUs and others can obtain superior performance in other fields [59, 96, 97]. In [59] the authors introduce simple one-layer linear models and demonstrate on nine real-life datasets that linear models surprisingly outperform existing sophisticated Transformer-based models, as stated by the authors, often by a large margin. Moreover, in the direction of fault detection, in [59], a study utilized above for performance comparisons, the authors tested various supervised attention-based architectures and Transformers on the TEP dataset. As the authors highlight, the transformer architecture has underperformed compared to LSTMs in terms of TPR, FPR and DD in most of the experiments. Moreover, compared to the supervised deep learning methods from [59], our approach, which utilized models that were only trained with fault-free data, and incorporated a single hidden layer, obtained overall superior results.
6 Conclusions and future work
This paper presented a systematic approach to designing a fault detection framework for nonlinear industrial processes. The framework incorporates feature analysis, predictor and detector design, and decision aggregation. These components are linked in an ensemble-based solution. Each detector includes an LSTMTF predictor and a cumulative sum control chart. To construct individual detectors, a novel feature analysis methodology was introduced. In this direction, the energy distance was utilized in the development of feature ranking, together with forward-based feature selection methodologies. The effectiveness of our contribution was showcased in the context of fault detection on nonlinear systems.
Experimental assessment conducted on two publicly available datasets yielded promising results in both feature selection and fault detection directions. In terms of feature selection, the proposed approach obtained notable and comparable results to well-established techniques, with low RMSE values and up to 0.91 \(R^2\) values. Furthermore, the ensemble detection technique proved its efficiency in detecting every unknown fault with up to 100% detection rate and 0% FPR. This includes challenging faults, such as TEP faults 3,9, and 15, even compared to 22 other works, including supervised, semi-supervised, and unsupervised methods.
Despite the demonstrated efficiency of our proposed approach on both the TEP and HVAC datasets, further extensive experimental evaluation is necessary to confirm its applicability to various other nonlinear systems. Furthermore, this methodology is limited to time-series regression, unlike other methods, such as those used for comparison, which apply to both classification and regression tasks. Additionally, the proposed methodologies require fault-free observations for model training and feature analysis. Observations that need to be collected during normal operating conditions and must cover all the normal operating states of the system. Future work includes extensive evaluation of the proposed approach on other datasets, derived from both simulators and real nonlinear systems while additionally refining the proposed ensemble by incorporating heterogeneous units.
Data availability
Both datasets used in this study are publicly available for download. The Tennessee Eastman process dataset [81] is available at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6C3JR1 (16 December 2023), with additional relevant information in [33]. The heating, ventilation, and air conditioning Boiler Plant dataset [34] is available at: https://faultdetection.lbl.gov/data/ (26 February 2024)
References
Zhao Y, Jiang C, Vega MA, Todd MD, Hu Z (2023) Surrogate modeling of nonlinear dynamic systems: a comparative study. J Comput Inf Sci Eng 23(1):011001
Lahdhiri H, Said M, Abdellafou KB, Taouali O, Harkat MF (2019) Supervised process monitoring and fault diagnosis based on machine learning methods. Int J Adv Manuf Technol 102:2321–2337
Kim B, Alawami MA, Kim E, Oh S, Park J, Kim H (2023) A comparative study of time series anomaly detection models for industrial control systems. Sensors 23(3):1310
Yeganeh A, Chukhrova N, Johannssen A, Fotuhi H (2023) A network surveillance approach using machine learning based control charts. Expert Syst Appl 219:119660
Chen Z, Xiao F, Guo F, Yan J (2023) Interpretable machine learning for building energy management: a state-of-the-art review. Adv Appl Energy 9:100123
Chattopadhyay A, Pathak J, Nabizadeh E, Bhimji W, Hassanzadeh P (2023) Long-term stability and generalization of observationally-constrained stochastic data-driven models for geophysical turbulence. Environ Data Sci 2:1
Stojanović V (2023) Fault-tolerant control of a hydraulic servo actuator via adaptive dynamic programming. Math Model Control 3:181–191
Xin J, Zhou C, Jiang Y, Tang Q, Yang X, Zhou J (2023) A signal recovery method for bridge monitoring system using tvfemd and encoder-decoder aided lstm. Measurement 214:112797
Zhang H, Wang L, Shi W (2023) Seismic control of adaptive variable stiffness intelligent structures using fuzzy control strategy combined with lstm. J Build Eng 78:107549
Cao Y, Liu G, Luo D, Bavirisetti DP, Xiao G (2023) Multi-timescale photovoltaic power forecasting using an improved stacking ensemble algorithm based lstm-informer model. Energy 283:128669
Bhandari HN, Rimal B, Pokhrel NR, Rimal R, Dahal KR, Khatri RK (2022) Predicting stock market index using lstm. Mach Learning Appl 9:100320
Goodfellow I. Bengio Y, Courville A (2016) Deep Learning. MIT Press, Cambridge, Massachusetts . http://www.deeplearningbook.org
Yang J, Wang J, Ye Q, Xiong Z, Zhang F, Liu H (2023) A novel fault detection framework integrated with variable importance analysis for quality-related nonlinear process monitoring. Control Eng Pract 141:105733
Bokor J, Szabó Z (2009) Fault detection and isolation in nonlinear systems. Annu Rev Control 33(2):113–123
Elgohary TAA (2015) Novel Computational and Analytic Techniques for Nonlinear Systems Applied to Structural and Celestial Mechanics. Texas A &M University, Texas, United States
Cheng C-D, Tian B, Ma Y-X, Zhou T-Y, Shen Y (2022) Pfaffian, breather, and hybrid solutions for a (2+ 1)-dimensional generalized nonlinear system in fluid mechanics and plasma physics. Phys Fluids. https://doi.org/10.1063/5.0119516
Higgins JP (2002) Nonlinear systems in medicine. Yale J Biol Med 75(5–6):247
Villaverde AF et al (2019) Observability and structural identifiability of nonlinear biological systems. Complexity 2019:8497093
Pearson RK (1995) Nonlinear input/output modelling. J Process Control 5(4):197–211
Zimmermann H-G, Tietz C, Grothmann R (2012) Forecasting with recurrent neural networks: 12 tricks. Tricks of the Trade, Neural Networks, pp 687–707
Zhang L, Lin J, Karim R (2018) Adaptive kernel density-based anomaly detection for nonlinear systems. Knowl-Based Syst 139:50–63
Lazar M, Pastravanu O (2002) A neural predictive controller for non-linear systems. Math Comput Simul 60(3–5):315–324
Pilario KE, Shafiee M, Cao Y, Lao L, Yang S-H (2019) A review of kernel methods for feature extraction in nonlinear process monitoring. Processes 8(1):24
Bellman R, Kalaba R (1959) On adaptive control processes. IRE Trans Autom Control 4(2):1–9
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learning Res 3:1157–1182
Kira K, Rendell LA ( 1992) A practical approach to feature selection. In: Machine Learning Proceedings 1992, pp. 249– 256 . Elsevier
Jović, A., Brkić, K., Bogunović, N ( 2015) A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1200– 1205 . IEEE
Yoon S, MacGregor JF (2001) Fault diagnosis with multivariate statistical models part i: using steady state fault signatures. J Process Control 11(4):387–400
Amini N, Zhu Q (2022) Fault detection and diagnosis with a novel source-aware autoencoder and deep residual neural network. Neurocomputing 488:618–633
Yeung DS, Cloete I, Shi D, Ng W (2010) Sensitivity Analysis for Neural Networks. Springer, Berlin, Heidelberg
Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neural networks. Neural Comput 1(2):270–280
Rizzo ML, Székely GJ (2016) Energy distance. WIREs Comput Stat 8(1):27–38. https://doi.org/10.1002/wics.1375
Downs JJ, Vogel EF (1993) A plant-wide industrial process control problem. Comput Chem Eng 17(3):245–255
Granderson J, Lin G, Chen Y, Casillas A, Im P, Jung S, Benne K, Ling J, Gorthala R, Wen J, Chen Z, Huang S, Vrabie D (2022) Lbnl fault detection and diagnostics datasets https://doi.org/10.25984/1881324
Khaire UM, Dhanalakshmi R (2022) Stability of feature selection algorithm: a review. J King Saud Univ-Comput Inf Sci 34(4):1060–1073
Robnik-Šikonja M, Kononenko I, et al ( 1997) An adaptation of relief for attribute estimation in regression. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML’97), vol. 5, pp. 296– 304 . Citeseer
Htun HH, Biehl M, Petkov N (2023) Survey of feature selection and extraction techniques for stock market prediction. Financ Innov 9(1):26
Kaur S, Kumar Y, Koul A, Kumar Kamboj S (2023) A systematic review on metaheuristic optimization techniques for feature selections in disease diagnosis: open issues and challenges. Archiv Comput Methods Eng 30(3):1863–1895
Thakkar A, Lohiya R (2023) Fusion of statistical importance for feature selection in deep neural network-based intrusion detection system. Inf Fusion 90:353–363
Jenul A, Schrunner S, Huynh BN, Tomic O (2021) Rent: a python package for repeated elastic net feature selection. J Open Sour Softw 6(63):3323
Breiman L (2001) Random forests. Mach Learning 45:5–32
Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA (2009) A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinf 10:1–16
Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205
Zhao Z, Anand R, Wang M ( 2019) Maximum relevance and minimum redundancy feature selection methods for a marketing machine learning platform. In: 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 442– 452 . IEEE
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Stat Methodol 58(1):267–288
Zhang H, Wang J, Sun Z, Zurada JM, Pal NR (2019) Feature selection for neural networks using group lasso regularization. IEEE Trans Knowl Data Eng 32(4):659–673
Lei L, Du L-X, He Y-L, Yuan J-P, Wang P, Ye B-L, Wang C, Hou Z (2023) Dictionary learning lasso for feature selection with application to hepatocellular carcinoma grading using contrast enhanced magnetic resonance imaging. Front Oncol 13:1123493
Figueroa Barraza J, López Droguett E, Martins MR (2021) Towards interpretable deep learning: a feature selection framework for prognostics and health management using deep neural networks. Sensors 21(17):5888
Kumar RA, Franklin JV, Koppula N (2022) A comprehensive survey on metaheuristic algorithm for feature selection techniques. Mater Today Proc 64:435–441
Sun L, Chen Y, Ding W, Xu J, Ma Y (2023) Amfsa: adaptive fuzzy neighborhood-based multilabel feature selection with ant colony optimization. Appl Soft Comput 138:110211
Priyadarshini J, Premalatha M, Čep R, Jayasudha M, Kalita K (2023) Analyzing physics-inspired metaheuristic algorithms in feature selection with k-nearest-neighbor. Appl Sci 13(2):906
Ribeiro MT, Singh S, Guestrin C ( 2016) why should i trust you? explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135– 1144
Lundberg SM, Lee S-I (2017) A unified approach to interpreting model predictions. Adv Neural Inf Proc Syst 30:4765–4774
Man X, Chan E (2021) The best way to select features? Comparing mda, lime, and shap. J Financ Data Sci Winter 3(1):127–139
Marcílio WE, Eler DM ( 2020) From explanations to feature selection: assessing shap values as feature selection mechanism. In: 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 340– 347 . IEEE
Sánchez-Hernández SE, Salido-Ruiz RA, Torres-Ramos S, Román-Godínez I (2022) Evaluation of feature selection methods for classification of epileptic seizure eeg signals. Sensors 22(8):3066
Zhou K, Tong Y, Li X, Wei X, Huang H, Song K, Chen X (2023) Exploring global attention mechanism on fault detection and diagnosis for complex engineering processes. Process Saf Environ Prot 170:660–669
Huang T, Zhang Q, Tang X, Zhao S, Lu X (2022) A novel fault diagnosis method based on cnn and lstm and its application in fault diagnosis for complex systems. Artif Intell Rev 55:1–27
Lomov I, Lyubimov M, Makarov I, Zhukov LE (2021) Fault detection in tennessee eastman process with temporal deep learning models. J Ind Inf Integr 23:100216
Chadha GS, Schwung A (2017) Comparison of deep neural network architectures for fault detection in tennessee eastman process. In: 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), pp. 1– 8 . IEEE
Yang Y, Shi H, Tao Y, Ma Y, Song B, Tan S (2023) A semi-supervised feature contrast convolutional neural network for processes fault diagnosis. J Taiwan Inst Chem Eng 151:105098
Okaro IA, Jayasinghe S, Sutcliffe C, Black K, Paoletti P, Green PL (2019) Automatic fault detection for laser powder-bed fusion using semi-supervised machine learning. Addit Manuf 27:42–53
Tao H, Shi H, Qiu J, Jin G, Stojanovic V (2023) Planetary gearbox fault diagnosis based on fdknn-dgat with few labeled data. Meas Sci Technol 35(2):025036
Wang B, Mao Z (2019) Outlier detection based on a dynamic ensemble model: applied to process monitoring. Inf Fusion 51:244–258
Hu C, Xu Z, Kong X, Luo J (2019) Recursive-cpls-based quality-relevant and process-relevant fault monitoring with application to the tennessee eastman process. IEEE Access 7:128746–128757
Samuel RT, Cao Y (2016) Nonlinear process fault detection and identification using kernel pca and kernel density estimation. Syst Sci Control Eng 4(1):165–174
Song X, Sun P, Song S, Stojanovic V (2023) Finite-time adaptive neural resilient dsc for fractional-order nonlinear large-scale systems against sensor-actuator faults. Nonlinear Dyn 111(13):12181–12196
Yin S, Wang Y, Wang G, Khan A-Q, Haghani A (2018) Key performance indicators relevant fault diagnosis and process control approaches for industrial applications. J Control Sci Eng 2018:1–2
Choi SW, Lee C, Lee J-M, Park JH, Lee I-B (2005) Fault detection and identification of nonlinear processes based on kernel pca. Chemom Intell Lab Syst 75(1):55–67
Ren M, Liang Y, Chen J, Xu X, Cheng L (2023) Fault detection for nox emission process in thermal power plants using sip-pca. ISA Trans 140:46–54
Maran Beena A, Pani AK (2021) Fault detection of complex processes using nonlinear mean function based gaussian process regression: application to the tennessee eastman process. Arab J Sci Eng 46:6369–6390
Wang R, Zhuang Z, Tao H, Paszke W, Stojanovic V (2023) Q-learning based fault estimation and fault tolerant iterative learning control for mimo systems. ISA Trans 142:123–135
Tan Y, Hu C, Zhang K, Zheng K, Davis EA, Park JS (2020) Lstm-based anomaly detection for non-linear dynamical system. IEEE access 8:103301–103308
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Bolboacă R (2022) Adaptive ensemble methods for tampering detection in automotive aftertreatment systems. IEEE Access 10:105497–105517
Bolboacă R, Haller P (2023) Performance analysis of long short-term memory predictive neural networks on time series data. Mathematics 11(6):1432
Wu Z, Wang Q (2007) A single cusum chart using a single observation to monitor a variable. Int J Prod Res 45(3):719–741
Cameron AC, Windmeijer FA (1997) An r-squared measure of goodness of fit for some common nonlinear regression models. J Econ 77(2):329–342
Jockenhövel T, Biegler LT, Wächter A (2003) Dynamic optimization of the tennessee eastman process using the optcontrolcentre. Comput Chem Eng 27(11):1513–1531
Ricker N, Lee J (1995) Nonlinear modeling and state estimation for the tennessee eastman challenge process. Comput Chem Eng 19(9):983–1005
Rieth C, Amsel B, Tran R, Cook M (2017) Additional tennessee eastman process simulation data for anomaly detection evaluation. Harv Dataverse 1:2017
Krotofil M, Larsen J ( 2015) Rocking the pocket book: Hacking chemical plants. In: DefCon Conference, DEFCON
Basha N, Sheriff MZ, Kravaris C, Nounou H, Nounou M (2020) Multiclass data classification using fault detection-based techniques. Comp Chem Eng 136:106786
Shang J, Chen M, Ji H, Zhou D (2017) Recursive transformed component statistical analysis for incipient fault detection. Automatica 80:313–327
Heo S, Lee JH (2018) Fault detection and classification using artificial neural networks. IFAC-PapersOnLine 51(18):470–475
Kubat M, Kubat J (2017) An Introduction to Machine Learning, vol 2. Springer, Berlin, Heidelberg
Li X, Zhou K, Xue F, Chen Z, Ge Z, Chen X, Song K (2020) A wavelet transform-assisted convolutional neural network multi-model framework for monitoring large-scale fluorochemical engineering processes. Processes 8(11):1480
Onel M, Kieslich CA, Pistikopoulos EN (2019) aA nonlinear support vector machine-based feature selection approach for fault detection and diagnosis: application to the tennessee eastman process. AIChE J 65(3):992–1005
Yin S, Ding SX, Haghani A, Hao H, Zhang P (2012) A comparison study of basic data-driven fault diagnosis and process monitoring methods on the benchmark tennessee eastman process. J Process Control 22(9):1567–1581
Wu H, Zhao J (2018) Deep convolutional neural network model based chemical process fault diagnosis. Comput Chem Eng 115:185–197
D’Angelo MF, Palhares RM, Camargos Filho MC, Maia RD, Mendes JB, Ekel PY (2016) A new fault classification approach applied to tennessee eastman benchmark process. Appl Soft Comput 49:676–686
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances Neural Inf Process Syst 30
Karita S, Chen N, Hayashi T, Hori T, Inaguma H, Jiang Z, Someki M, Soplin NEY, Yamamoto R, Wang X, Watanabe S, Yoshimura T, Zhang W ( 2019) A comparative study on transformer vs rnn in speech applications. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449– 456
Tunstall L, Von Werra L, Wolf T (2022) Natural Language Processing with Transformers. O’Reilly Media Inc, Sebastopol, California
Murtadha A, Pan S, Bo W, Su J, Cao X, Zhang W, Liu Y (2023) Rank-aware negative training for semi-supervised text classification. Trans Assoc Comput Linguist 11:771–786
Buestán-Andrade P-A, Santos M, Sierra-García J-E, Pazmiño-Piedra J-P ( 2023) Comparison of lstm, gru and transformer neural network architecture for prediction of wind turbine variables. In: International Conference on Soft Computing Models in Industrial and Environmental Applications, pp. 334– 343 . Springer
Ezen-Can, A ( 2020) A comparison of lstm and bert for small corpus. arXiv preprint arXiv: 2009.05451https://doi.org/10.48550/arXiv.2009.05451
Zeng A, Chen M, Zhang L, Xu Q (2023) Are transformers effective for time series forecasting? Proc AAAI Conf Artif Intell 37:11121–11128
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bolboacă, R., Haller, P. & Genge, B. Feature analysis and ensemble-based fault detection techniques for nonlinear systems. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-10551-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00521-024-10551-1