[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Remarks on Sequential Caputo Fractional Differential Equations with Fractional Initial and Boundary Conditions
Next Article in Special Issue
Disentangling Sources of Multifractality in Time Series
Previous Article in Journal
Concurrency Conflict Modeling for Asynchronous Processing in Blockchain-Based Transactive Energy Systems
Previous Article in Special Issue
Review of the Natural Time Analysis Method and Its Applications
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Patch-Wise-Based Self-Supervised Learning for Anomaly Detection on Multivariate Time Series Data

1
Department of Intelligent Electronics and Computer Engineering, Chonnam National University, 77, Yongbong-ro, Buk-gu, Gwangju 61186, Republic of Korea
2
Department of Computational and Data Science, Astana IT University, Astana 010000, Kazakhstan
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2024, 12(24), 3969; https://doi.org/10.3390/math12243969
Submission received: 19 November 2024 / Revised: 13 December 2024 / Accepted: 16 December 2024 / Published: 17 December 2024
(This article belongs to the Special Issue Recent Advances in Time Series Analysis)

Abstract

:
Multivariate time series anomaly detection is a crucial technology to prevent unexpected errors from causing critical impacts. Effective anomaly detection in such data requires accurately capturing temporal patterns and ensuring the availability of adequate data. This study proposes a patch-wise framework for anomaly detection. The proposed approach comprises four key components: (i) maintaining continuous features through patching, (ii) incorporating various temporal information by learning channel dependencies and adding relative positional bias, (iii) achieving feature representation learning through self-supervised learning, and (iv) supervised learning based on anomaly augmentation for downstream tasks. The proposed method demonstrates strong anomaly detection performance by leveraging patching to maintain temporal continuity while effectively learning data representations and handling downstream tasks. Additionally, it mitigates the issue of insufficient anomaly data by supporting the learning of diverse types of anomalies. The experimental results show that our model achieved a 23% to 205% improvement in the F1 score compared to existing methods on datasets such as MSL, which has a relatively small amount of training data. Furthermore, the model also delivered a competitive performance on the SMAP dataset. By systematically learning both local and global dependencies, the proposed method strikes an effective balance between feature representation and anomaly detection accuracy, making it a valuable tool for real-world multivariate time series applications.

1. Introduction

In recent decades, massive amounts of data have been generated across various industries, such as manufacturing, finance, biotechnology, and energy. This surge in data production stems from the need to manage and optimize diverse systems and processes more efficiently [1,2]. These datasets are often collected simultaneously across multiple variables over time to monitor and analyze the complex operations of each industry in real time, resulting in multivariate time series data. For instance, in the manufacturing sector, variables such as temperature, pressure, and speed are recorded over time, while in the finance domain, indicators like stock prices, trading volumes, and interest rates are gathered in real time [3].
Multivariate time series data inherently capture the interactions among variables, making their effective processing and analysis a crucial challenge. These data are characterized by complex interdependencies among variables over time, which underscores the importance of analyzing them and performing anomaly detection [4,5,6]. Specifically, anomaly detection involves identifying abnormal states or events that deviate from the normal operation of a system, enabling early intervention before issues escalate. For example, in manufacturing, detecting minor equipment defects at an early stage can prevent major breakdowns, while in finance, identifying unusual trading patterns can help prevent financial fraud or accidents.
As a result, anomaly detection has emerged as a critical issue across various industries, driving continued research efforts to address these challenges effectively.
Various techniques have been widely employed to address the anomaly detection problem. In the early stages, traditional statistical forecasting-based methods, such as exponential smoothing, auto-regressive moving average (ARMA), and auto-regressive integrated moving average (ARIMA), were proposed for anomaly detection [7,8,9,10]. Deep learning, with its ability to effectively capture the complex relationships among variables in multivariate time series data, provides much more flexible and sophisticated models compared to traditional statistical methods. With recent advancements in deep learning, a variety of deep forecasting models have been developed, including recurrent neural network-based methods (e.g., DeepAR [11], LSTNet [12]), convolutional neural network-based methods (e.g., TCN [13], SCINet [14]), transformer-based methods (e.g., Informer [15], Autoformer [16], Fedformer [17], PatchTST [18]), and MLP-based methods (e.g., N-BEATS [19], NLinear [20], DLinear [20]) [21,22,23,24]. He, Y. et al. [25] demonstrated the applicability of TCN models in the field of time series anomaly detection by incorporating a multi-scale feature mixture method based on the TCN model. Li, X. et al. [26] proposed the ConvTrans-CL model, which combines 1D convolution and transformer, to address anomaly detection and representation tasks in time series temperature data with complex pattern distributions using contrastive learning. Xu, J. et al. [27] tackled issues arising from point-wise representation learning by suggesting that the attention weight distribution of the transformer model can derive correlations in time series data. Various time series forecasting models have shown promising performance when applied to the domain of time series anomaly detection. However, deep learning models require large amounts of data to achieve high performance. Compared to image or natural language data, multivariate time series data are often limited in quantity, posing a significant challenge in training deep learning models. Additionally, anomaly detection in time series data requires expert-labeled datasets, which is a time-consuming and expensive process [28,29]. To address these challenges, unsupervised learning approaches have been actively studied in recent years. Unsupervised learning involves reconstructing unlabeled data and calculating anomaly scores based on the differences between the reconstructed data and the original data to detect anomalies [30]. However, these methods may exhibit biased tendencies when trained on limited data and often struggle with new forms of time series patterns. In this work, we propose a hybrid approach: first pre-training a model using unsupervised learning and then fine-tuning the pre-trained model for supervised learning tasks based on available labeled data. This approach leverages the strong reconstruction capabilities of the model, improving performance while addressing data scarcity and labeling cost issues. Effectively reconstructing time series data requires learning both the local and global features of each variable. To achieve this, we incorporated a frequency-domain learning approach [31]. By transforming the time series data into the frequency domain using Fourier transformation, we can effectively capture both the low-frequency components (trends) and high-frequency components (volatility) of the time series data [32]. This enables better modeling of critical patterns in time series data and significantly enhances the accuracy of anomaly detection. Another challenge in forecasting and anomaly detection for multivariate time series data lies in capturing the interactions and asynchronous characteristics among variables. Real-world industrial data often exhibit time lags and differing physical measurements, making accurate modeling crucial. Recently, a channel independence-based approach was proposed, treating multivariate time series data as univariate series for each channel [18]. While this approach offers advantages in terms of fast convergence and noise resistance, it has limitations in learning inter-variable relationships. To address these limitations, we propose a method that considers channel interdependence, effectively learning the relationships among variables. In this paper, we introduce a patch-wise learning framework for anomaly detection, designed to tackle these challenges. The proposed framework consists of the following key components:
  • Maintaining continuous features through patching: Effectively learns local patterns while maintaining the continuity of time series data, providing better data representation compared to conventional simple sampling methods.
  • Incorporating various temporal information by learning channel dependencies and adding relative positional bias: Addresses the issue of missing inter-variable relationships in traditional channel-independent approaches while integrating diverse temporal information.
  • Achieving feature representation learning through self-supervised learning: Reduces dependence on labeled data and enables better learning of data feature representations.
  • Supervised learning based on anomaly augmentation for downstream tasks: Alleviates the scarcity of anomaly data while allowing the model to learn various types of anomalies effectively.
The remainder of this paper is organized as follows: Section 2 presents related work, Section 3 provides a detailed explanation of the methodology, and Section 4 describes the datasets and experimental setup. Finally, Section 5 concludes the study.

2. Related Work

Deep Learning in Time Series Anomaly Detection

Traditional anomaly detection methods have primarily relied on approaches such as similarity-based detection, window-based detection using hidden Markov models, and decomposition-based methods [33,34]. More recently, the rapid advancements in deep learning have catalyzed numerous studies on deep learning-based anomaly detection. Depending on the type of input data, methods have been designed for univariate or multivariate time series, or both. Based on labeling approaches, anomaly detection methods can be categorized into unsupervised and supervised techniques. For univariate time series, statistical methods and the analysis of 20 anomaly detection techniques have identified models suitable for specific anomaly types [6,35]. In contrast, multivariate time series anomaly detection emphasizes representation learning due to the complexity of multivariate data. Approaches based on graph neural networks and regression have been explored to address these challenges [36,37].
Among the unsupervised approaches, generative model-based anomaly detection methods such as DAGMM [38], LSTM-VAE [39], AnoGAN [40], and TadGAN [41] have been developed. On the other hand, supervised learning approaches have utilized RNN-based models, including LSTM and GRU, along with traditional methods like linear regression and random forests [42]. However, supervised approaches face challenges due to the difficulty of labeling anomalous instances and the limited availability of labeled data. To address these issues, self-supervised learning methods, such as the AnomalyBERT model, which leverages synthetic data for training, have been proposed [43].
Unsupervised learning eliminates the need for labeled data, significantly reducing labeling costs and time. It identifies anomalies by analyzing patterns and distributions in the data, making it particularly suitable for multivariate time series and unstructured datasets. However, due to the absence of labeled guidance, it is challenging to clearly distinguish between normal and anomalous data. Additionally, unsupervised methods are sensitive to noise, increasing the risk of false positives in volatile cases, and interpreting the anomalies can be difficult. To address these limitations, our method leverages unsupervised learning to extract deep feature representations by capturing data patterns and distributions. By combining labeled data in downstream tasks, the model improves the interpretability and accuracy of anomaly detection.
Transformer models, which have shown exceptional performance in image and natural language processing tasks, have also demonstrated effectiveness in time series analysis. For instance, Autoformer [16] employs an autocorrelation-based attention mechanism to identify dependencies at the sub-series level. FEDformer [17], on the other hand, applies attention mechanisms in the frequency domain to analyze seasonal and trend patterns using low- and high-frequency components for linear complexity modeling. These advancements underscore the growing interest in self-attention mechanisms within the frequency domain for time series modeling.
Frequency-domain information has been proven effective for capturing periodic patterns and enhancing feature representations. Models like Frequency-MLP [44] have shown the capability to model both short-term and long-term dependencies by leveraging this information. Transforming data into the frequency domain allows for more concise representations and clearer identification of periodic patterns. However, this can lead to the loss of temporal position information. Our method addresses this limitation by learning temporal and frequency information jointly, enabling the representation of seasonal and long-term dependencies while maintaining temporal context.
Patch-based modeling has also gained attention in time series forecasting and anomaly detection. Models such as PatchTST [18] and iTransformer [45] segment time series data into patches, using transformers to learn global representations. For anomaly detection, PatchAD [46] and MPFormer [47] use multi-scale patch embedding and attention mechanisms to capture both variable and temporal dependencies effectively. Unlike traditional methods [39,42], which analyze individual points in isolation, our approach processes data in patch units. This enables localized pattern learning and minimizes the influence of noise on time series data, improving anomaly detection performance.

3. Methods

In this section, we describe the framework for the proposed patch-wise learning approach. The framework consists of two phases. The first phase is a self-supervised learning-based representation learning process using patching. The second phase involves supervised learning based on anomaly augmentation of the patches. The framework is composed of several key components: positional embedding, projection, transformer encoder, and a linear layer. The input data undergo patching and either masking or abnormal augmentation before being passed through positional embedding and projection. These processes assign temporal context to the time series data, where the order of events is critical. The data, now including positional information, are projected into a high-dimensional space and fed into the transformer encoder. The transformer encoder processes the input on a patch level, learning temporal relationships among the patches as well as dependencies across channels. Subsequently, the linear layer receives the encoder’s output, predicting the masked data regions during the first phase and producing binary classification results indicating the normality or abnormality of each data patch during the second phase. By integrating these two phases, the pre-trained model is fine-tuned for downstream tasks, enhancing its overall performance. A detailed explanation of each phase is provided in the following sections. Figure 1 visualizes the proposed framework to provide a clearer understanding.

3.1. Problem Definition

Multivariate time series (MTS) consist of signals from multiple channels and are defined by C covariates at each timestamp t . Given a time series observation of length T , x 1 : T R C × T represents C channels and T time steps.
To process the data efficiently, the series are divided into P patches, where each patch contains T p = T P time steps. Thus, the input can be represented as X p a t c h e s = X 1 , X 2 , , X p , where each X p R C × T p .
As part of the input processing, a subset of these patches is randomly selected and masked. Specifically, the masked patches M p a t c h e s are set to zeros or replaced with learnable mask embedding, as shown in Equation (1):
X m a s k e d   p a t c h e s = X p a t c h e s M p a t c h e s
where M p a t c h e s is a binary mask matrix indicating which patches are masked, and represents element-wise multiplication.
To perform anomaly detection and prediction, the time series reconstruction process generates a series of the same length T , denoted as y 1 : T R C × T . The reconstruction involves learning to predict the masked portions of the input.
This reconstruction process involves learning a function F θ : X m a s k e d   p a t c h e s Y R C × T , which is trained to maximize the joint log-likelihood, as shown in Equation (2):
max θ E ( X ,   Y ) log p Y | F θ X m a s k e d   p a t c h e s
In this context, θ represents the trainable parameters of F θ . These parameters correspond to the weights or biases of the model that maps the input time series data X m a s k e d   p a t c h e s to the output Y , and they are optimized during the training process to fit the data. Initially, θ is initialized through self-supervised learning by focusing on the reconstruction of masked patches, and it is subsequently fine-tuned during the supervised learning process to enhance performance in anomaly detection and prediction tasks.
Instead of initially training the reconstruction model F θ through fully observable patches, we propose a robust masking-based self-supervised framework. This reconstruction model utilizes a transformer encoder to process the partially masked input, extracting temporal dependencies and channel-level relationships from the visible patches. The encoder processes each patch X p , generating a latent representation Z p = E n c o d e r ( X p , P ) . The combined latent representations Z = Z 1 ,   Z 2 ,   , Z p R P × d , where d is the embedding dimension, are then used to predict the masked portions of the input. Finally, the decoded output Y ^ reconstructs the full time series, ensuring the output maintains the same temporal and multivariate structure as the input.
This masking-based reconstruction process enables the model to learn more robust representations by focusing on both visible and masked regions, making it effective for downstream tasks, such as anomaly detection and prediction.

3.2. Self-Supervised Learning-Based Representation Learning

In time series data research, studies have focused on improving data representation learning performance through masked modeling [18,48]. Masked modeling involves masking a portion of the data and reconstructing it. This technique plays a crucial role in handling missing data and is particularly effective for learning abstract representations while conceptualizing locality within patches. In this paper, we propose a self-supervised learning method that reconstructs masked portions of the data using a patch-based masking approach. Masking techniques are particularly valuable for addressing anomalies caused by missing physical measurements, which often occur when electronic devices encounter issues. Sensors or measurement equipment in electronic devices can experience faults or errors during data collection for various reasons, leading to data loss or the recording of anomalous values. Missing physical measurements are a frequent challenge in time series data, making reconstruction or the supplementation of missing information essential. The self-supervised learning approach to reconstruct masked data is well-suited to tackling this issue. By masking missing data segments and reconstructing them, the model learns the patterns and characteristics of the time series data. A model trained using masking techniques contributes not only to the restoration of missing physical data but also to the improvement of overall data representation learning performance. The extent to which the reconstructed data match the original data serves as a key metric for evaluating the model’s learning performance. Moreover, during the reconstruction process, the model learns both the locality within patches and the global relationships across the time series, ultimately enhancing the overall expressiveness of the data representation.
Figure 2 illustrates this process. The structure takes time series data as input, applies patching, masks certain patches, and then restores them using a transformer-based encoder. In this process, the original time series data ( R N × T ) are first patched. Patching divides the time series data into small segments, which is advantageous for learning dependencies between time points within the transformer architecture. Each patch is represented in R N × P space, where N is the dimensionality of the time series, and P is the patch size.
This structure enables learning both local and global features of time series data through patch-based training, and frequency and temporal losses play a crucial role in ensuring consistency between the restored and actual values. Balancing these two loss functions is essential to optimizing the model’s performance. Therefore, this study’s approach, which aims to restore missing data caused by physical defects in electronic devices, benefits from using self-supervised learning with masking, as it simultaneously enhances the restoration performance and representation learning of time series data.

3.3. Anomaly Augmentation-Based Supervised Learning

In the second phase, the transformer model trained in the first phase using self-supervised learning is employed for anomaly detection by leveraging either transfer learning or weight initialization effects [49]. The goal of this phase is to evaluate time series data more accurately and identify anomalous events. Unlike the non-overlapping approach used in the first phase, the second phase divides the data using an overlapping approach. This method generates patches that contain more information, enabling more detailed analysis. For each patch, anomalies are randomly injected to create synthetic data. Figure 3 illustrates this process.
The synthetic anomaly generation method utilized in this phase comprises three approaches, as illustrated in Figure 4:
  • Soft replacement: Replaces data with segments from other intervals.
  • Uniform replacement: Replaces data within the patch with a constant value.
  • Peak noise: Adds noise to specific data segments to create extreme values.
In this step, the addition of anomalies to randomly selected patches generate labels indicating whether each patch is normal or anomalous. These labels are subsequently used for supervised learning. The encoder from the transformer model trained during the first phase is utilized to perform supervised learning based on the synthesized anomaly data. This process enhances the transformer’s ability to recognize anomalous conditions more effectively. Furthermore, unlike traditional unsupervised anomaly detection methods, the application of supervised learning allows for the integration of expert-labeled data. By leveraging supervised learning, anomaly scores are computed and used to identify anomalous conditions, enabling a more precise and robust detection process. Algorithm 1 represents the pseudocode of these methods, illustrating the flow of the algorithm.
Algorithm 1. Anomaly augmentation-based supervised learning.
Input: Multivariate time series data X R N × T  
Output: Anomaly scores Y s c o r e 0,1 1 × P
BEGIN
Step 1. Patching:
  Compute number of patches:
     P = floor ( T /patch size)
  Divide X into non-overlapping patches:
     X p a t c h e s = X 1 ,   X 2 , , X P ,   where   X p R N × T p
Step 2. Random Selection:
  Randomly select a subset of patches:
     X s e l e c e t e d X p a t c h e s
Step 3. Anomaly Augmentation:
  Apply anomaly transformations to selected patches:
     X a u g m e n t e d = A n o m a l y   A u g m e n t ( X s e l e c t e d )
  Assign labels:
     Y l a b e l s = 1   i f   X p X a u g m e n t e d   e l s e   0   f o r   a l l   X p X p a t c h e s
Step 4. 1D Convolution and Embedding:
  Pass all patches through a 1D convolution layer:
     E = Conv 1 D ( X p a t c h e s )
Step 5. Transformer Encoder:
  Encode embedding to capture global dependencies:
     Z = Transformer   Encoder ( E )
Step 6. Anomaly Score Prediction:
  Pass encoded embedding through a linear layer:
         Y s c o r e L i n e a r Z
Step 7. Loss Calculation:
  Compute binary cross-entropy loss:
    Loss = Binary Cross-entropy loss
Step 8. Model Training:
  Update model parameters θ by minimizing Loss: θ = Optimize(Loss)

3.4. Component

The proposed patch-wise learning framework consists of the following key components. The framework is designed with five main elements: (i) patching, (ii) channel dependency, (iii) 1D convolutional-based value embedding, and (iv) transformer encoder. Below, we provide detailed descriptions of the specific configurations for each component.
  • Patching: Patching involves segmenting multivariate time series data, composed of C variables, into patches according to a specified patch size. The original time series data x 1 : T R C × T are divided into P patches X p a t c h e s = X 1 ,   X 2 , , X p , where each patch X p R C × T p time steps, and P = T / p a t c h s i z e . While this approach is similar to the sliding window method, we do not use overlapping between patches in our framework. The patching mechanism is particularly advantageous for handling high-dimensional datasets with a large number of variables ( C ) and long temporal lengths ( T ). By dividing time series data into smaller, more manageable patches, the framework significantly reduces computational overhead. Using non-overlapping patch segmentation further optimizes computational costs. For high-dimensional datasets, the computational complexity of direct modeling tends to grow exponentially with the size of the input. However, with patching, the complexity scales linearly with the number of patches ( P ) and the patch size ( T p ), making the framework highly scalable. The flexibility of the framework, allowing for adjustments in both patch size and overlap, enables a balance between computational efficiency and representational capacity. This ensures its applicability to both small-scale and large-scale time series data. Consequently, the framework is a robust choice for real-world applications involving high-dimensional and long-sequence datasets.
  • Channel dependency: Channel independency treats each variable individually, analyzing them as independent entities. This approach is robust to the typical distribution shift challenges inherent in time series data [18]. However, for tasks like anomaly detection, where the relationships among variables are critical, it is essential to account for channel dependency during learning. This is because interactions among variables in time series data can serve as important clues for detecting anomalous patterns. For example, when the value of one variable changes drastically, its relationship with other variables can help determine whether this change is normal or indicative of an anomaly. Ignoring such interdependencies increases the risk of overlooking significant anomalous patterns, which can lead to reduced detection accuracy. By incorporating channel dependency into the learning process, the complex relationships among variables can be effectively reflected, enabling more accurate anomaly detection. Figure 5 illustrates a comparison between channel-independent and channel-dependent strategies. This framework adopts a channel dependency strategy to effectively capture and learn inter-variable relationships.
  • One-dimensional convolution-based value embedding: The value embedding in a transformer model is directly related to the three key matrices used in the self-attention mechanism: queries, keys, and values [50]. Since it plays a crucial role in the attention computation—the core mechanism of transformer models—it is essential to optimize its generation process. Traditionally, MLPs (multi-layer perceptrons) are used to generate the value embedding. However, MLPs do not explicitly capture positional or sequential correlations among different parts of the input data. To address this limitation, we utilized a 1D convolution-Based value embedding approach. Using 1D convolution allows the model to learn the relationships among adjacent values at each time step, effectively capturing local patterns. Additionally, since 1D convolution learns weights dynamically, it can adapt to the data and incorporate positional information. This embedding is represented in a the R N × P × 512 space. Furthermore, to incorporate dynamic positional information, relative positional encoding is applied within the layer, enhancing the representation of position-dependent features.
  • Transformer encoder: The transformer encoder used in the proposed framework is based on the Vanilla Transformer Encoder and serves as the backbone network for pre-training [50]. The projected embedding from the previous step is input into the transformer encoder to learn the complex dependencies across the time series. This process employs multi-head self-attention (MHSA) to capture correlations among patches, with batch normalization applied. The attention for each head is calculated as follows:
    H e a d i = A t t e n t i o n Q p , K p , V p = s o f t m a x W i Q Q p W i K K p T   d W i V V p

3.5. Loss Function

In this framework, we propose using frequency loss in addition to the standard time loss during the pre-training phases.
  • Time loss: This measures the difference between the restored patch and the actual values using the mean squared error (MSE). This serves as a measure of temporal consistency in the time series data.
L t i m e y , y ^ = E x 1 C c C y c y ^ c 2
  • Frequency loss: After transforming the time series data into the frequency domain using Fourier transformation, this loss measures the spectral difference between the restored patch and the actual values. It considers both high-frequency and low-frequency components, aiding the model in accurately restoring overall patterns. Although extensively studied in the field of time series forecasting, more research is needed in the anomaly detection domain [16,17,29,42].
L f r e q y , y ^ = E x 1 C c C tanh y c y ^ c  
Finally, the model’s performance is evaluated using a weighted sum of the time loss and frequency loss. Here, α is a hyper-parameter that adjusts the contribution of each loss term.
L t o t a l = α L t i m e + 1 α L f r e q
  • Anomaly loss: The model trained based on Equation (1) calculates anomaly loss using binary cross-entropy. The input data x t : t + L are divided into P patches, and patches are randomly selected for anomaly augmentation. The model then learns the anomaly label y ^ p for each augmented patch p . Based on the predicted label y ^ p , an anomaly score s p is derived by averaging. The model is then trained to minimize a binary cross-entropy loss function, as shown in Equation (7), based on this anomaly score.
L o s s = 1 P t = 1 P ( y p log s p + ( 1 y p ) l o g ( 1 s p ) )

4. Experiments

We conducted experiments on three popular multivariate datasets:
  • Mars Science Laboratory (MSL) dataset: The MSL dataset consists of sensor data collected from a NASA spacecraft and includes 55 telemetry channels. It is expertly labeled with anomaly data [51].
  • Soil Moisture Active Passive (SMAP) dataset: The SMAP dataset, like the MSL dataset, is a NASA-labeled dataset of soil samples and telemetry information collected by the Mars rover. Anomalous conditions were labeled by experts [51].
  • Server Machine Dataset (SMD) dataset: The SMD dataset is a server dataset collected over five weeks by a major internet company. The server data come from 28 different machines [52].
Figure 6 visualizes the detailed structure of each dataset. The MSL and SMAP datasets have smaller training sets compared to their test sets, while the SMD dataset consists of an equal number of samples in the training and test sets. Additionally, the SMD dataset has relatively lower anomaly ratios compared to its overall data size (Table 1).

4.1. Settings

All experiments were conducted using the PyTorch 2.3.1 framework on an NVIDIA H100 GPU. We list here common hyper-parameter settings for training. The AdamW optimizer was employed with an initial learning rate of 0.0001. A batch size of 16 was used, with the number of training epochs fixed at 5. The dimensionality of the series representation was set to 512. The patch size was 4. The patches were constructed in a non-overlapping manner. The detailed configurations for the pre-training phase and downstream are provided in Table 2.

4.2. Results of Experiment

For comparison, we selected DAGMM [38], LSTM-VAE [39], OmniAnomaly [52], MSCRED [53], THOC [54], USAD [55], GDN [56], AnomalyBERT [42], and AnomalyBERT* as baseline models. All results were reproduced using the official implementations under the same environment with default configurations. Unless otherwise specified, the architectures of each model were maintained with their default settings. To evaluate performance, we used the F1 score and F1 score with point adjustment [57] as the evaluation metrics.
In this context, AnomalyBERT* represents the experimental results reproduced in our environment, rather than the official implementation. A key difference lies in the sampling-based training approach. The original method uses a sampling strategy for data augmentation, determining training data per iteration rather than sequentially using the entire dataset. As a result, the sampled data can vary with each iteration, potentially compromising reliability and stability. Moreover, this partial training approach could lead to incomplete learning. In contrast, our method generates random anomalies while training on the entire dataset across epochs. By using all the data in training, no specific patterns or data points are excluded, ensuring data reproducibility, reliability, and stability. Based on this approach, we retrained the comparative AnomalyBERT model in our environment.
Table 3 presents the performance comparison between our proposed patch-wise learning method and baseline models, evaluated using the F1 score and F1 score with point adjustment. Bold indicates the best performance, and underline represents the second-best performance.
In the MSL dataset, the proposed method outperformed existing models, including DAGMM, LSTM-VAE, OmniAnomaly, MSCRED, THOC, USAD, GDN, AnomalyBERT, and AnomalyBERT*, by 23% to 205% based on the F1 score. Regarding the F1 score with point adjustment, USAD achieved the highest performance (0.927), while our model demonstrated competitive performance with a score of 0.818. The MSL dataset contains the smallest number of training samples among the SMAP and SMD datasets, with an anomaly-to-data ratio of 10%, the second highest after SMAP. This indicates that our method effectively performed in test data with appropriate anomalies, even with limited training data.
In the SMAP dataset, the proposed method achieved the second-highest F1 score performance, following AnomalyBERT, and outperformed other models by 21% to 82%. For the F1 score with point adjustment, MSCRED (0.945), AnomalyBERT* (0.929), and AnomalyBERT (0.914) achieved higher scores, while our model scored 0.842, which is slightly lower. The fact that AnomalyBERT* achieved better performance than the original AnomalyBERT validates the effectiveness of our sampling method. The high F1 score achieved by our method in the SMAP dataset, which has an anomaly ratio of over 12%, demonstrates its effective performance in datasets with higher anomaly proportions.
In the SMD dataset, the proposed method did not achieve competitive performance in terms of both the F1 score and the F1 score with point adjustment. However, it demonstrated competitive results compared to models such as DAGMM, MSCRED, THOC, and AnomalyBERT*. This can be attributed to the characteristics of the SMD dataset, which contains a significantly larger number of training and test samples compared to the MSL and SMAP datasets, but has a lower anomaly ratio of 4%. This suggests that there is potential for improvement through hyper-parameter tuning and fine-tuning. The strong F1 score performance of our method in the MSL and SMAP datasets highlights its ability to accurately evaluate the presence of anomalies at specific time points. The F1 score with point adjustment, which evaluates anomaly detection under relaxed conditions, does not fully reflect the detection of persistent anomaly patterns. Therefore, our proposed method has been demonstrated to be effective. All our proposed models and experimental results can be found at https://github.com/synapsespectrum/PatchwiseAD (accessed on 13 December 2024).

4.3. Effectiveness of Value Embedding

This study compared the performance of the proposed 1D convolution-based value embedding with that of traditional MLP-based value embedding.
Figure 7 presents the convergence rates and performance results of the experiments conducted on the datasets. The MLP-based value embedding, commonly used in prior works, exhibited slower convergence rates and lower performance compared to the 1D convolution-based value embedding. In contrast, the 1D convolution-based value embedding demonstrated faster convergence and superior performance, proving the effectiveness of the proposed approach over traditional methods. Furthermore, the 1D convolution-based value embedding, when utilizing pre-trained parameters, exhibited even more significant improvements in performance, reinforcing its efficiency and applicability.

5. Conclusions and Future Work

In this paper, we have proposed a framework for patch-wise learning to detect anomalies in multivariate time series data. The proposed framework consists of two key phases. In the first phase, data representation learning was performed using self-supervised learning with patching and masking techniques. In the second phase, the model’s performance was enhanced through supervised learning based on anomaly augmentation. This structure captures both local and global characteristics of the data while further improving model performance with anomaly-augmented supervised learning.
Additionally, an analysis of the effectiveness of value embedding revealed that combining 1D convolution with pre-trained value embedding resulted in faster convergence and higher F1 score performance compared to the traditional MLP-based approach. This demonstrates that the integration of a pre-trained strategy with 1D convolution’s ability to capture local features significantly contributes to improving the learning efficiency of transformer models.
This approach addresses the issue of label scarcity in many real-world datasets and has the advantage of reflecting expert-labeled data. Unsupervised learning can learn key patterns even from data without anomalous patterns and utilize these features during the supervised learning phase to build more sophisticated anomaly detection models. The proposed approach is applicable not only to time series data but also to various domains such as image and natural language processing, where it can leverage unlabeled data to build robust models. However, challenges remain, such as the difficulty of tuning various hyper-parameters (e.g., learning rate, batch size, data augmentation strategies) in both unsupervised and supervised learning stages. Additionally, even if the pre-trained model learns the general structure of the data, there is a risk of overfitting during the supervised learning phase when the labeled data are limited. These limitations could potentially be addressed by employing techniques such as genetic algorithms, evolutionary algorithms, or reinforcement learning to optimize network architectures or improve weight initialization. Additionally, while we utilized three types of anomaly augmentation, we plan to apply additional forms of augmentation to cover a wider range of anomaly detection scenarios.

Author Contributions

Conceptualization, S.O. and L.H.A.; methodology, S.O.; software, L.H.A.; validation, L.H.A., G.H.Y. and J.K.; formal analysis, D.T.V.; investigation, J.K.; resources, S.O.; data curation, S.O.; writing—original draft preparation, S.O. and L.H.A.; writing—review and editing, D.T.V. and M.H.; visualization, L.H.A. and D.T.V.; supervision, G.H.Y.; project administration, J.K.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (IITP-2024-RS-2022-00156287, 50). This work was partly supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (RS-2021-II212068, Artificial Intelligence Innovation Hub).

Data Availability Statement

All data are publicly available. MSL: https://github.com/thuml/Time-Series-Library (accessed on 13 December 2024); SMAP: https://nsidc.org/data/smap (accessed on 13 December 2024).; SMD: https://github.com/NetManAIOps/OmniAnomaly (accessed on 13 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Nguyen, D.K.; Sermpinis, G.; Stasinakis, C. Big Data, Artificial Intelligence and Machine Learning: A Transformative Symbiosis in Favour of Financial Technology. Euro. Fin. Manag. 2023, 29, 517–548. [Google Scholar] [CrossRef]
  2. Ao, S.-I.; Fayek, H. Continual Deep Learning for Time Series Modeling. Sensors 2023, 23, 7167. [Google Scholar] [CrossRef] [PubMed]
  3. Fan, J.; Liu, Z.; Wu, H.; Wu, J.; Si, Z.; Hao, P.; Luan, T.H. LUAD: A Lightweight Unsupervised Anomaly Detection Scheme for Multivariate Time Series Data. Neurocomputing 2023, 557, 126644. [Google Scholar] [CrossRef]
  4. Kim, B.; Alawami, M.A.; Kim, E.; Oh, S.; Park, J.; Kim, H. A Comparative Study of Time Series Anomaly Detection Models for Industrial Control Systems. Sensors 2023, 23, 1310. [Google Scholar] [CrossRef]
  5. Mejri, N.; Lopez-Fuentes, L.; Roy, K.; Chernakov, P.; Ghorbel, E.; Aouada, D. Unsupervised Anomaly Detection in Time-Series: An Extensive Evaluation and Analysis of State-of-the-Art Methods. Expert Syst. Appl. 2024, 256, 124922. [Google Scholar] [CrossRef]
  6. Braei, M.; Wagner, S. Anomaly Detection in Univariate Time-Series: A Survey on the State-of-the-Art. arXiv 2020, arXiv:2004.00433. [Google Scholar]
  7. Pincombe, B. Anomaly Detection in Time Series of Graphs Using ARMA Processes. Bull. Am. Soc. Overseas Res. 2005, 24, 2. [Google Scholar]
  8. Kozitsin, V.; Katser, I.; Lakontsev, D. Online Forecasting and Anomaly Detection Based on the ARIMA Model. Appl. Sci. 2021, 11, 3194. [Google Scholar] [CrossRef]
  9. Barrientos-Torres, D.; Martinez-Ríos, E.A.; Navarro-Tuch, S.A.; Pablos-Hach, J.L.; Bustamante-Bello, R. Water Flow Modeling and Forecast in a Water Branch of Mexico City through ARIMA and Transfer Function Models for Anomaly Detection. Water 2023, 15, 2792. [Google Scholar] [CrossRef]
  10. Xu, H.; Sun, Z.; Cao, Y.; Bilal, H. A Data-Driven Approach for Intrusion and Anomaly Detection Using Automated Machine Learning for the Internet of Things. Soft. Comput. 2023, 27, 14469–14481. [Google Scholar] [CrossRef]
  11. Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
  12. Lai, G.; Chang, W.-C.; Yang, Y.; Liu, H. Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 95–104. [Google Scholar] [CrossRef]
  13. Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal Convolutional Networks for Action Segmentation and Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 21–26 July 2017; pp. 1003–1012. [Google Scholar] [CrossRef]
  14. Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; Xu, Q. SCINet: Time Series Modeling and Forecasting with Sample Convolution and Interaction. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  15. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the The Thirty-Fifth AAAI Conference on Artificial Intelligence, Virtual Conference, 2–9 February 2021; 35, pp. 11106–11115. [Google Scholar] [CrossRef]
  16. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS 2021), Virtual Conference, 6–14 December 2021. [Google Scholar]
  17. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-Term Series Forecasting. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
  18. Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-Term Forecasting with Transformers. arXiv 2023, arXiv:2211.14730. [Google Scholar]
  19. Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-Beats: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting. arXiv 2020, arXiv:1905.1043. [Google Scholar]
  20. Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are Transformers Effective for Time Series Forecasting? arXiv 2022, arXiv:2205.13504. [Google Scholar]
  21. Jin, M.; Koh, H.Y.; Wen, Q.; Zambon, D.; Alippi, C.; Webb, G.I.; King, I.; Pan, S. A Survey on Graph Neural Networks for Time Series: Forecasting, Classification, Imputation, and Anomaly Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10466–10485. [Google Scholar] [CrossRef]
  22. Iqbal, A.; Amin, R. Time Series Forecasting and Anomaly Detection Using Deep Learning. Comput. Chem. Eng. 2024, 182, 108560. [Google Scholar] [CrossRef]
  23. Cui, Q.D.; Xu, C.; Xu, Y.; Ou, W.; Pang, Y.; Liu, Z.; Shen, J.; Baber, M.Z.; Maharajan, C.; Ghosh, U. Bifurcation and Controller Design of 5D BAM Neural Networks With Time Delay. Int. J. Numer. Model. 2024, 37, e3316. [Google Scholar] [CrossRef]
  24. Maharajan, C.; Sowmiya, C.; Xu, C. Delay Dependent Complex-Valued Bidirectional Associative Memory Neural Networks with Stochastic and Impulsive Effects: An Exponential Stability Approach. Kybernetika 2024, 60, 317–356. [Google Scholar] [CrossRef]
  25. He, Y.; Zhao, J. Temporal Convolutional Networks for Anomaly Detection in Time Series. J. Phys.: Conf. Ser. 2019, 1213, 042050. [Google Scholar] [CrossRef]
  26. Li, X.; Chen, Y.; Zhang, X.; Peng, Y.; Zhang, D.; Chen, Y. ConvTrans-CL: Ocean Time Series Temperature Data Anomaly Detection Based Context Contrast Learning. Appl. Ocean. Res. 2024, 150, 104122. [Google Scholar] [CrossRef]
  27. Xu, J.; Wu, H.; Wang, J.; Long, M. Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. arXiv 2022, arXiv:2110.02642. [Google Scholar]
  28. Wang, D.; Shang, Y. A New Active Labeling Method for Deep Learning. In Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN), Beijing, China, 6–11 July 2014; pp. 112–119. [Google Scholar] [CrossRef]
  29. Oh, S.; Ashiquzzaman, A.; Lee, D.; Kim, Y.; Kim, J. Study on Human Activity Recognition Using Semi-Supervised Active Transfer Learning. Sensors 2021, 21, 2760. [Google Scholar] [CrossRef] [PubMed]
  30. An, J.; Cho, S. Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability, Technical Report; SNU Data Mining Center: Seoul, Republic of Korea, 2015. [Google Scholar]
  31. Zhang, C.; Zhou, T.; Wen, Q.; Sun, L. TFAD: A Decomposition Time Series Anomaly Detection Architecture with Time-Frequency Analysis. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 2497–2507. [Google Scholar] [CrossRef]
  32. Yi, K.; Zhang, Q.; Cao, L.; Wang, S.; Long, G.; Hu, L.; He, H.; Niu, Z.; Fan, W.; Xiong, H. A Survey on Deep Learning Based Time Series Analysis with Frequency Transformation. arXiv 2023, arXiv:2302.02173. [Google Scholar]
  33. Gao, B.; Ma, H.-Y.; Yang, Y.-H. HMMs (Hidden Markov Models) Based on Anomaly Intrusion Detection Method. In Proceedings of the International Conference on Machine Learning and Cybernetics, Beijing, China, 4–5 November 2002; Volume 1, pp. 381–385. [Google Scholar] [CrossRef]
  34. Gao, J.; Song, X.; Wen, Q.; Wang, P.; Sun, L.; Xu, H. RobustTAD: Robust Time Series Anomaly Detection via Decomposition and Convolutional Neural Networks. arXiv 2021, arXiv:2002.09545. [Google Scholar]
  35. Paparrizos, J.; Kang, Y.; Boniol, P.; Tsay, R.S.; Palpanas, T.; Franklin, M.J. TSB-UAD: An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection. Proc. VLDB Endow. 2022, 15, 1697–1711. [Google Scholar] [CrossRef]
  36. Zhao, H.; Wang, Y.; Duan, J.; Huang, C.; Cao, D.; Tong, Y.; Xu, B.; Bai, J.; Tong, J.; Zhang, Q. Multivariate Time-Series Anomaly Detection via Graph Attention Network. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020; pp. 841–850. [Google Scholar] [CrossRef]
  37. Wong, L.; Liu, D.; Berti-Equille, L.; Alnegheimish, S.; Veeramachaneni, K. AER: Auto-Encoder with Regression for Time Series Anomaly Detection. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 1152–1161. [Google Scholar] [CrossRef]
  38. Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In Proceedings of the Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada, 30 April–3 May 2018. [Google Scholar]
  39. Park, D.; Hoshi, Y.; Kemp, C.C. A Multimodal Anomaly Detector for Robot-Assisted Feeding Using an LSTM-Based Variational Autoencoder. IEEE Robot. Autom. Lett. 2018, 3, 1544–1551. [Google Scholar] [CrossRef]
  40. Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Langs, G.; Schmidt-Erfurth, U. F-AnoGAN: Fast Unsupervised Anomaly Detection with Generative Adversarial Networks. Med. Image Anal. 2019, 54, 30–44. [Google Scholar] [CrossRef]
  41. Geiger, A.; Liu, D.; Alnegheimish, S.; Cuesta-Infante, A.; Veeramachaneni, K. TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 33–43. [Google Scholar] [CrossRef]
  42. Jia, W.; Shukla, R.M.; Sengupta, S. Anomaly Detection Using Supervised Learning and Multiple Statistical Methods. In Proceedings of the 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 1291–1297. [Google Scholar] [CrossRef]
  43. Jeong, Y.; Yang, E.; Ryu, J.H.; Park, I.; Kang, M. AnomalyBERT: Self-Supervised Transformer for Time Series Anomaly Detection Using Data Degradation Scheme. arXiv 2023, arXiv:2305.04468. [Google Scholar]
  44. Yi, K.; Zhang, Q.; Fan, W.; Wang, S.; Wang, P.; He, H.; Lian, D.; An, N.; Cao, L.; Niu, Z. Frequency-Domain MLPs Are More Effective Learners in Time Series Forecasting. In Proceedings of the 37th Annual Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
  45. Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. arXiv 2024, arXiv:2310.06625. [Google Scholar]
  46. Zhong, Z.; Yu, Z.; Yang, Y.; Wang, W.; Yang, K. PatchAD: A Lightweight Patch-Based MLP-Mixer for Time Series Anomaly Detection. arXiv 2024, arXiv:2401.09793. [Google Scholar]
  47. Zhang, H.; Li, F.; Xu, H.; Huang, S.; Liu, S.; Ni, L.M.; Zhang, L. MP-Former: Mask-Piloted Transformer for Image Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18074–18083. [Google Scholar] [CrossRef]
  48. Das, A.; Kong, W.; Sen, R.; Zhou, Y. A Decoder-Only Foundation Model for Time-Series Forecasting. arXiv 2024, arXiv:2310.10688. [Google Scholar]
  49. Yan, P.; Abdulkadir, A.; Luley, P.-P.; Rosenthal, M.; Schatte, G.A.; Grewe, B.F.; Stadelmann, T. A Comprehensive Survey of Deep Transfer Learning for Anomaly Detection in Industrial Time Series: Methods, Applications, and Directions. IEEE Access 2024, 12, 3768–3789. [Google Scholar] [CrossRef]
  50. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31th Annual Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  51. Hundman, K.; Constantinou, V.; Laporte, C.; Colwell, I.; Soderstrom, T. Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 387–395. [Google Scholar] [CrossRef]
  52. Su, Y.; Zhao, Y.; Niu, C.; Liu, R.; Sun, W.; Pei, D. Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2828–2837. [Google Scholar] [CrossRef]
  53. Zhang, C.; Song, D.; Chen, Y.; Feng, X.; Lumezanu, C.; Cheng, W.; Ni, J.; Zong, B.; Chen, H.; Chawla, N.V. A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 1409–1416. [Google Scholar] [CrossRef]
  54. Shen, L.; Li, Z.; Kwok, J.T. Timeseries Anomaly Detection Using Temporal Hierarchical One-Class Network. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS 2020), Virtual Conference, 6–12 December 2020. [Google Scholar]
  55. Audibert, J.; Michiardi, P.; Guyard, F.; Marti, S.; Zuluaga, M.A. USAD: UnSupervised Anomaly Detection on Multivariate Time Series. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 3395–3404. [Google Scholar] [CrossRef]
  56. Deng, A.; Hooi, B. Graph Neural Network-Based Anomaly Detection in Multivariate Time Series. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, Virtual Conference, 2–9 February 2021; Volume 35, pp. 4027–4035. [Google Scholar] [CrossRef]
  57. Kim, S.; Choi, K.; Choi, H.-S.; Lee, B.; Yoon, S. Towards a Rigorous Evaluation of Time-Series Anomaly Detection. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, Virtual Conference, 22 February–1 March 2022; Volume 36, pp. 7194–7201. [Google Scholar] [CrossRef]
Figure 1. Patch-wise learning framework: (Left) representation learning based on self-supervised learning using patching, (Right) supervised learning based on anomaly augmentation.
Figure 1. Patch-wise learning framework: (Left) representation learning based on self-supervised learning using patching, (Right) supervised learning based on anomaly augmentation.
Mathematics 12 03969 g001
Figure 2. Self-supervised learning-based representation learning architecture.
Figure 2. Self-supervised learning-based representation learning architecture.
Mathematics 12 03969 g002
Figure 3. Supervised learning for anomaly detection.
Figure 3. Supervised learning for anomaly detection.
Mathematics 12 03969 g003
Figure 4. Anomaly augmentation.
Figure 4. Anomaly augmentation.
Mathematics 12 03969 g004
Figure 5. Comparison of channel-dependent (CD) and channel-independent (CI) strategies in time series reconstruction.
Figure 5. Comparison of channel-dependent (CD) and channel-independent (CI) strategies in time series reconstruction.
Mathematics 12 03969 g005
Figure 6. Visualization of anomaly datasets.
Figure 6. Visualization of anomaly datasets.
Mathematics 12 03969 g006
Figure 7. (a) F1 score performance based on value embedding in the MSL dataset; (b) F1 score performance based on value embedding in the SMAP dataset; (c) F1 score performance based on value embedding in the SMD dataset.
Figure 7. (a) F1 score performance based on value embedding in the MSL dataset; (b) F1 score performance based on value embedding in the SMAP dataset; (c) F1 score performance based on value embedding in the SMD dataset.
Mathematics 12 03969 g007
Table 1. Summary of datasets.
Table 1. Summary of datasets.
DatasetNumber of FeaturesNumber of EntitiesTraining SizeTest Size
MSL552758,31773,729
SMAP2555135,183427,617
SMD3828708,405708,420
Table 2. Summary of configurations for pre-training and downstream.
Table 2. Summary of configurations for pre-training and downstream.
SettingPre-TrainDownstream
TaskMask ModelingAnomaly Detection
Patch Size2(MSL), 4(SMAP, SMD)2(MSL), 4(SMAP, SMD)
Masking Ratio0.4No
Batch Size1616
Learning TypeSelf-Supervised LearningSupervised Learning
Anomaly AugmentationNoYes
Percent of
Anomaly Augmentation
-Soft Replacement (50%),
Uniform Replacement (15%),
Peak Noise (15%)
Table 3. F1 scores for anomaly detection.
Table 3. F1 scores for anomaly detection.
ModelMSLSMAPSMD
F1 F 1 P A F1 F 1 P A FA F 1 P A
DAGMM0.1990.7010.3330.7120.2380.723
LSTM-VAE0.2120.6780.2350.7560.4350.808
OmniAnomaly0.2070.8990.2270.8050.4740.944
MSCRED0.1990.7750.2320.9450.0970.389
THOC0.1900.8910.2400.7810.1680.541
USAD0.2110.9270.2280.8180.4260.938
GDN0.2170.9030.2520.7080.5290.716
AnomalyBERT0.3020.5850.4570.9140.5350.830
AnomalyBERT*0.3180.7300.3400.9290.2550.705
Ours0.3900.8180.4130.8420.3370.725
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Oh, S.; Anh, L.H.; Vu, D.T.; Yu, G.H.; Hahn, M.; Kim, J. Patch-Wise-Based Self-Supervised Learning for Anomaly Detection on Multivariate Time Series Data. Mathematics 2024, 12, 3969. https://doi.org/10.3390/math12243969

AMA Style

Oh S, Anh LH, Vu DT, Yu GH, Hahn M, Kim J. Patch-Wise-Based Self-Supervised Learning for Anomaly Detection on Multivariate Time Series Data. Mathematics. 2024; 12(24):3969. https://doi.org/10.3390/math12243969

Chicago/Turabian Style

Oh, Seungmin, Le Hoang Anh, Dang Thanh Vu, Gwang Hyun Yu, Minsoo Hahn, and Jinsul Kim. 2024. "Patch-Wise-Based Self-Supervised Learning for Anomaly Detection on Multivariate Time Series Data" Mathematics 12, no. 24: 3969. https://doi.org/10.3390/math12243969

APA Style

Oh, S., Anh, L. H., Vu, D. T., Yu, G. H., Hahn, M., & Kim, J. (2024). Patch-Wise-Based Self-Supervised Learning for Anomaly Detection on Multivariate Time Series Data. Mathematics, 12(24), 3969. https://doi.org/10.3390/math12243969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop