Open AccessArticle

Stable Diffusion-Driven Conditional Image Augmentation for Transformer Fault Detection

Wenlong Liao

¹,

Yiping Jiang

²,

Rui Liu

^1,*,

Yun Feng

¹,

Yu Zhang

¹,

Jin Hou

³ and

Jun Wang

Electric Power Research Institute, Sichuan Electric Power Corporation, Chengdu 610072, China

Sichuan Electric Power Corporation, Chengdu 610094, China

School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, China

Author to whom correspondence should be addressed.

Information 2025, 16(3), 197; https://doi.org/10.3390/info16030197

Submission received: 9 February 2025 / Revised: 25 February 2025 / Accepted: 26 February 2025 / Published: 3 March 2025

(This article belongs to the Section Information Processes)

Download

Browse Figures

Figure 1
Diffusion and reverse processes. "> Figure 2
Schematic diagram of the latent diffusion model. "> Figure 3
Stable Diffusion-generated transformer images (a) and images in our dataset (b). "> Figure 4
Proposed model structure. "> Figure 5
LoRA schematic diagram. "> Figure 6
SAM schematic diagram. "> Figure 7
Some dataset images. "> Figure 8
Generated transformer equipment images in different stages. "> Figure 9
Generated image data under different weather conditions. "> Figure 10
Examples of the expanded results for each data augmentation method. "> Figure 11
Some YoloV7 detection results on real (a) and generated (b) images. "> Figure 11 Cont.
Some YoloV7 detection results on real (a) and generated (b) images. "> Figure 12
Comparison of various evaluation metrics. ">

Versions Notes

Abstract

Existing substation equipment image data augmentation models face challenges such as high dataset size requirements, difficult training processes, and insufficient condition control. This paper proposes a transformer equipment image data augmentation method based on a Stable Diffusion model. The proposed method incorporates the Low-Rank Adaptation (LoRA) concept to fine-tune the pre-trained Stable Diffusion model weights, significantly reducing training requirements while effectively integrating the essential features of transformer equipment image data. To minimize interference from complex backgrounds, the Segment Anything Model (SAM) is employed for preprocessing, thereby enhancing the quality of generated image data. The experimental results demonstrate significant improvements in evaluation metrics using the proposed method. Specifically, when implemented with the YOLOv7 model, the accuracy metric shows a 16.4 percentage point improvement compared to “Standard image transformations” (e.g., rotation and scaling) and a 2.3 percentage point improvement over DA-Fusion. Comparable improvements are observed in the SSD and Faster-RCNN object detection models. Notably, the model demonstrates advantages in reducing false-negative rates (higher Recall). The proposed approach successfully addresses key data augmentation challenges in transformer fault detection applications.

Keywords:

image data augmentation; stable diffusion; intelligent inspection of substations; object detection

1. Introduction

With the rapid development of object detection technology, transformer equipment inspection based on this technology has gradually become a research focus. However, transformer equipment image data face challenges such as insufficient data volume and sample imbalance. In transformer fault detection scenarios, collecting abnormal positive sample image data often encounters limitations due to data scarcity. Image augmentation techniques aim to address these challenges. Beyond conventional pixel-level transformations (e.g., rotation and translation), image augmentation is increasingly conceptualized as an image generation task, with two primary research directions emerging: Generative Adversarial Networks (GANs) [1] and Denoising Diffusion Probabilistic Models (DDPMs) [2].

Since GAN-generated images exhibit distributional similarity to original datasets [1], the original data can be directly used for training to enhance single-class image data [3]. To address multi-class image generation, Radford et al. proposed a Deep Convolutional Generative Adversarial Network (DCGAN), which generates transitional images between data categories through interpolation techniques [4]. To achieve conditional control over generated image categories, Antoniou et al. developed the Data Augmentation Generative Adversarial Network (DAGAN), producing new images consistent with the original class [5]. Yang et al. enhanced the diversity and quality of few-class samples by jointly training a variational autoencoder (VAE) and conditional GAN [6,7]. Hong et al. expanded augmentation possibilities by fusing features from multiple images to create novel styles [8].

Ho et al. established rigorous mathematical foundations for diffusion models and proposed a Denoising Diffusion Probabilistic Model (DDPM) [2], pioneering a novel research direction for image generation. The DDPM framework approaches image generation as a data degradation reversal process, employing the U-Net architecture [9] to maintain dimensional consistency between input and output during reverse diffusion. Building upon the DDPM, Song et al. introduced a Denoised Diffusion Implicit Model (DDIM) [10], which optimizes noise scheduling in the sampling process to significantly reduce inference steps. Dhariwal et al. enhanced the U-Net in the DDPM by integrating classifier guidance [11], enabling conditionally controlled image generation with metric superiority over GANs [12]. For data augmentation applications, Trabucco et al. developed DA-Fusion [13], implementing single-class augmentation via a textual inversion technique [14], which demonstrated improved performance on multiple public datasets.

In addition to strategy optimization, Cubuk et al. proposed AutoAugment [15], a reinforcement learning-based framework that adaptively adjusts augmentation policies according to task requirements and dataset characteristics, enhancing flexibility and applicability. Subsequently, Lim et al. developed a fast search algorithm to identify optimal augmentation strategies, improving model accuracy and generalization [16]. Hataya et al. introduced MADAO [17,18], which accelerates strategy search via approximate gradient-based methods to jointly optimize convolutional neural networks and augmentation policies. Additionally, domain-specific augmentation methods (e.g., for foggy or low-light conditions) have been explored [19,20].

Although the above studies have achieved valuable research results, in transformer fault detection scenarios, GAN-based image data augmentation models exhibit challenges such as training instability and insufficient condition control. Meanwhile, DDPM-based image augmentation requires immense computational power, with training large diffusion models typically demanding hundreds of GPU days (e.g., requiring 150–1000 V100 GPU days, as reported in [12]). To address these challenges, this paper proposes a transformer image data augmentation method based on the Stable Diffusion model, integrating Low-Rank Adaptation (LoRA) [21] and the Segment Anything Model (SAM) [22]. By fine-tuning the weights of the pre-trained Stable Diffusion model, the proposed method enables transformer image data augmentation with low dataset requirements, reduced computational demands, and enhanced condition controllability. These findings significantly advance substation equipment abnormality detection technology.

2. Proposed Approach

This section describes the proposed approach in two parts: the first part details the Stable Diffusion model framework, and the second part presents the proposed enhancements to address the identified challenges in transformer equipment image augmentation.

2.1. Diffusion Models

2.1.1. Latent Diffusion Models

Latent Diffusion Models (LDMs) provide a deep learning framework for image generation and editing, achieving state-of-the-art performance in domains such as image synthesis, inpainting, and super-resolution [23]. The framework divides the image generation process into two sequential phases, a diffusion process and a reverse process, illustrated in Figure 1. In the diffusion process

q

, the original data

x_{0}

is iteratively transformed into Gaussian noise

x_{T}

through noise addition. In contrast, the reverse process

p_{θ}

leverages neural networks to reconstruct

x_{0}

from noisy inputs

x_{T}

by predicting iterative denoising steps, thereby generating high-fidelity images. The architecture of LDMs is detailed in Figure 2.

First, LDMs use an autoencoder to learn a low-dimensional spatial representation that expresses the original image space as effectively as possible, a process called image-aware compression. Specifically, the autoencoder consists of an encoder

E

and a decoder

D

. Given an image

x \in R^{H \times W \times 3}

, the encoder encodes the image into the latent space

z = E (x)

, where

z \in R^{h \times w \times c}

; the decoder then reconstructs the image from the latent space

\tilde{x} = D (z) = D (E (x))

. In LDMs, the latent space learning improves upon the DDPM framework. The DDPM introduces a temporal denoising autoencoder

ϵ_{θ} (x_{t}, t)

where

t = 1 \dots T

, designed to predict the denoised image (i.e., the noise component of

x_{t}

) at timestep

t

. The objective function is defined as Equation (1).

L_{D D P M} = E_{x, ϵ \sim N (0,1), t} [{∥ϵ - ϵ_{θ} (x_{t}, t)∥}_{2}^{2}]

(1)

where

t

is sampled uniformly from

{1, \dots, T}

In LDMs, a pre-trained perceptual compression model enables encoding ztzt during training, permitting latent space optimization. The objective function becomes

L_{L D M} : = E_{E (x), ϵ \sim N (0,1), t} [∥ ϵ - ϵ_{θ} (z_{t}, t) ∥_{2}^{2}]

(2)

LDMs further incorporate conditional mechanisms through a cross-attention layer [24] integrated into the U-Net backbone. This extends the denoising autoencoder to

ϵ_{θ} (x_{t}, t, y)

, enabling multimodal control (e.g., text-guided image generation).

2.1.2. Stable Diffusion Model

The Stable Diffusion Model enhances Latent Diffusion Models (LDMs) through engineering improvements, replacing traditional Gaussian distributions with stable distributions to boost training stability. Trained on the LAION-2B-en dataset [25] for data quality optimization, this text-to-image model requires effective text-to-latent mapping through semantic encoding. The implementation employs a Contrastive Language-Image Pre-training (CLIP) text encoder [26] to convert textual conditions into conditional representations, enabling precise generative control.

Given dataset limitations and computational constraints in transformer equipment applications, we adopt a fine-tuning strategy using the stable-diffusion-v1-5 base model. This model initializes with v1.2 weights and undergoes fine-tuning on the LAION-Aesthetics v2.5+ dataset [25] at a

512 \times 512

resolution with 10% text condition reduction for enhanced controllability. While the

512 \times 512

resolution is lower than standard benchmarks like PASCAL VOC [27] (

500 \times 500

1500 \times 1500

) and COCO [28] (

640 \times 480

5000 \times 4000

), our dataset’s close-up imagery (see Section 3.1, Figure 3b) ensures sufficient per-pixel information density for object detection training.

In summary, the stable-diffusion-v1-5 pre-trained weights enable high-quality image generation through precise textual conditioning. However, the LAION-5B-en dataset exhibits conceptual bias issues (as illustrated in Figure 3, with prompt details provided in Section 3.2.2), resulting in either generated images lacking target dataset characteristics or significant deviations from expected outputs, thus rendering them unsuitable for training object detection models. Furthermore, transformer equipment images exhibit high scene complexity, which increases annotation difficulty and reduces augmentation effectiveness. Our methodology presented in subsequent sections specifically targets these challenges.

2.2. Proposed Stable Diffusion-Based Image Augmentation of Transformer Equipment

2.2.1. Structure of the Model

The stable-diffusion-v1-5 base model exhibits inherent conceptual biases, leading to two critical limitations: generated images either lack target dataset characteristics or deviate significantly from expected outputs, rendering them unsuitable for training object detection models. Additionally, transformer equipment image data are characterized by high scene complexity, which adversely impacts image augmentation effectiveness.

To overcome these challenges, this study enhances the stable-diffusion-v1-5 framework through integration of Low-Rank Adaptation (LoRA) and the Segment Anything Model (SAM). As illustrated in Figure 4, the proposed architecture combines LoRA-based fine-tuning with SAM-driven background segmentation to address dataset bias and scene complexity simultaneously.

2.2.2. Low-Rank Adaptation of Stable Diffusion Model

To address the conceptual bias in the stable-diffusion-v1-5 base model, this paper introduces Low-Rank Adaptation (LoRA) to incorporate dataset-specific image features into the pre-trained weights. LoRA achieves fine tuning through low-rank matrix updates on the original parameters, improving task-specific performance while reducing computational resource requirements. Compared to training diffusion models from scratch, this approach significantly reduces training costs while maintaining model stability. The principle is illustrated in the Figure 5.

Specifically, let the parameter matrix of the pre-trained model be

W_{0} \in R^{d \times k}

. LoRA achieves fine-tuning by adding a low-rank matrix

Δ W = B A

, where

\begin{matrix} B \in R^{d \times r} \end{matrix}

A \in R^{r \times k}

, and

r < < \min (d, k)

. The updating process can be expressed as Equation (3).

h = \begin{matrix} W_{0} + Δ W = W_{0} + B A \end{matrix}

(3)

Therefore, for the computation

h = W_{0} x

, the forward computation is updated as Equation (4).

h = (W_{0} + B A) x

(4)

During training, the reduced rank constraint

r ≪ \min (d, k)

significantly decreases trainable parameters. The fine-tuned parameters

W = W_{0} + B A

are obtained through merging pre-trained model weights with LoRA parameters during inference. This merging process maintains computational equivalence with the base model, transitioning from

h = W_{0} x

h = W x

without introducing additional latency. The LoRA-adapted model thus achieves inference speeds comparable to the original architecture, providing critical advantages for industrial deployment scenarios.

2.2.3. Introducing the SAM Image Segmentation Module

To enhance the LoRA model’s concept-specific focus, this study implements a scene simplification strategy by removing irrelevant elements from the image. The Segment Anything Model (SAM) is employed to identify and extract primary image components through its four-stage architecture, input layer, encoder, segmenter, and decoder, as detailed in Figure 6.

Specifically, the input layer receives various data types, including text, code, or structured data. The encoder adopts the transformer architecture [29], converting input data into vector representations via the self-attention mechanism to capture long-range dependencies, thereby providing rich contextual information for downstream processing. The segmenter utilizes the segmenter architecture [30] to partition input data according to encoded vector representations and produce segmentation masks. The decoder subsequently reconstructs the processed data to generate the final output. Through this pipeline, the SAM effectively isolates image subjects from backgrounds or irrelevant elements, facilitating data preparation for LoRA model optimization.

In this section, we introduced the LoRA and SAM modules into the Stable Diffusion model to address the concept bias issue present in the stable-diffusion-v1-5 base model and to enhance the LoRA model’s focus on specific concepts. At the same time, we ensured that the model requires a low amount of data (only 120 images, as detailed in Section 3.1) and low computational power for training (using a single Colorful iGame GeForce RTX 4060 Ti Ultra W OC 16GB (manufacturer: Colorful Technology Development Co., Ltd., Shenzhen, China), as described in Section 3.2.1). After the model construction, the subsequent sections present our experimental data and the series of experiments conducted to evaluate the proposed model.

3. Experiment Setup

3.1. Dataset

The dataset comprises field-collected close-up imagery from a Sichuan Province power company’s substation inspection system. It includes 100 normal images per transformer component and 20 oil leakage images, totaling 120 samples with varied resolutions. These field data exhibit two limitations: (1) limited sample size and (2) severe class imbalance with normal samples exceeding abnormal cases by a 5:1 ratio. Representative samples are shown in Figure 7, with detailed component-wise image distribution provided in Table 1.

3.1.1. Data Preprocessing

The dataset underwent preprocessing operations including Gaussian noise addition, 180° image flipping, and HSV-based brightness/saturation adjustments. Specifically, it underwent the following:

Noise addition injects Gaussian-distributed pixel variations;
Image flipping rotates samples by 180° along the vertical axis;
HSV color space operations modify brightness and saturation levels.

The original dataset was augmented to 1280 images through these transformations. To address resolution variance, all images were cropped and resized to

768 \times 512

pixels. The dataset was then partitioned into training, test, and validation subsets in a 7:2:1 ratio prior to network training to enhance model generalization.

3.1.2. Data Labeling

To train LoRA, we annotated 120 raw image samples, with each image corresponding to a separate text annotation file. During the annotation process, we employed two types of keywords: one type aimed at guiding LoRA to learn general features of the images, used for generating transformer component images, and the other type aimed at guiding LoRA to learn abnormal features, used for generating anomaly images. The specific keywords are presented in Table 2, and annotation examples are shown in Table 3.

In this paper, the LabelImg image labeling tool [31] was used to manually label the oil leakage data images of each part of the transformer in detail. A total of 120 raw images and the data images generated by our data augmentation model were processed. Each image has an XML file corresponding to its name, which records the name of the labeling target and the coordinate information of the upper-left and lower-right corners of the target. After the raw data are labeled, the labeling results are processed by coordinate transformation to obtain the labels of the expanded data.

3.2. Experimental Environment

3.2.1. Hardware and Software Environment

In order to train the model and perform the comparison experiments, the code for the model was implemented in this study using the Python programming language. For the construction of the neural network, PyTorch and TorchVision were chosen as the main frameworks, and CUDA technology was utilized to accelerate the training and inference process. The software environment used for the experiments includes the Windows 11 operating system, PyTorch version 2.1.0, TorchVision version 0.16.0, CUDA version 12.1, and Python version 3.10.11. The main hardware environment for the experiments is shown in Table 4.

3.2.2. Training Parameters

The experimental training in this paper is divided into two parts: the first is the training of the base model using the LoRA method when optimizing the Stable Diffusion model, and the cosine annealing learning rate adjustment strategy is used in the experiment to reduce the probability of the model falling into a locally optimal solution; then, the SSD [32], Faster-RCNN [33], and YOLOv7 [34] object detection models are trained. Detailed hyperparameters for both training parts are documented in Table 5.

In order to perform conditional data augmentation on transformer equipment image data, specific positive text prompts were used in the experiments for the stabilizing diffusion model. A series of negative prompts were also introduced specifically in this study. The purpose of these negative prompts is to explicitly instruct the stabilizing diffusion model to avoid generating specific types or low-quality images. Detailed specifications are provided in Table 6.

3.3. Evaluation Metrics

To evaluate object detection performance, this study employs three metrics: Precision (P), Recall (R), and mean Average Precision (mAP) under varying Intersection over Union (IoU) thresholds. Specifically, [email protected] (IoU > 0.5) and [email protected]:0.95 (IoU 0.5–0.95) are adopted as evaluation metrics.

The formula for calculating the selected evaluation metrics is shown below:

I o U = \frac{A \cap B}{A \cup B}

(5)

\begin{matrix} P = \frac{T P}{T P + F P} \end{matrix}

(6)

R = \frac{T P}{T P + F N}

(7)

A P = \int_{0}^{1} P (r) d r

(8)

\begin{matrix} m A P = \frac{1}{n} \sum_{1}^{n} A P_{i} \end{matrix}

(9)

where

A

is the true frame,

B

is the prediction frame,

T P

is the number of positive cases correctly judged as positive,

F P

is the number of negative cases incorrectly judged as positive,

F N

is the number of positive cases incorrectly judged as negative,

A P

is the area enclosed by the P-R curve, and

m A P

is the mean value of

A P

for each class.

4. Experimental Results and Analysis

4.1. Ablation Experiments

To evaluate the effectiveness of the integrated LoRA and SAM modules on image generation quality, we conducted ablation experiments comparing three configurations: (1) the unmodified stable-diffusion-v1.5 base model, (2) the LoRA-enhanced variant, and (3) the LoRA+SAM integrated architecture. This comparative analysis systematically isolates the contribution of each module to synthetic data quality improvement.

As shown in Figure 8, we generated images with the addition of the LoRA and SAM modules. Based on the generated results, we draw the following conclusions:

There is a significant difference between the data generated by the base model and the actual data. This leads to unsatisfactory performance across various metrics, reflecting the limitations of the base model in augmenting substation equipment image data.
After introducing the LoRA module, we observe a significant improvement in model performance. This indicates that LoRA fine tuning can effectively adjust the model, making it better suited for the generation of substation image data. However, due to various interference factors in the augmented data, the performance metrics have not yet reached an ideal level.
Further introducing the SAM module results in an ideal improvement in the model’s performance metrics. This shows that SAM can effectively reduce interference factors in the original image data, thus enhancing the image generation effect of the LoRA module.

The generated images were annotated (as described in Section 3.1.2) and trained under YOLOv7. The evaluation metrics after the addition of the modules were compared, with the results presented in Table 7.

From the data in Table 7, it is evident that the inclusion of the LoRA and SAM modules led to a significant improvement in the model’s performance across the evaluation metrics. Notably, there was an enhancement in the Recall metric, which typically indicates that, after data augmentation with the proposed model, the object detection model is better able to reduce false negatives. This improvement is of particular significance in the field of transformer anomaly detection.

Overall, the ablation experiment results fully validate the effectiveness of the modules introduced in this study in enhancing the object detection model’s ability to handle augmented data. By gradually incorporating the LoRA and SAM modules, we are able to significantly improve the Stable Diffusion model’s performance in a data augmentation environment.

4.2. Comparison Experiments

By applying the proposed model in object detection models such as SSD, Faster-RCNN, and YOLOv7, we compare and analyze the changes in the performance metrics of each model before and after the use of this method. Two data augmentation strategies are adopted in this paper:

Improve the dataset balance: by supplementing the image data of each part of the normal transformer, the balance between normal and abnormal data in the dataset is increased.
Improving dataset diversity: image data of substation equipment under sunny and rainy conditions were generated to increase the diversity of the dataset by taking weather changes as an example. The generated augmented image data were labeled and added to the training, validation, and test datasets in a ratio of 7:2:1, respectively.

LoRA fine tuning enables the introduction of some of the pre-trained features without changing the base model, thus generating high-quality image data under different weather conditions. This is evident in Figure 9, where the method is able to effectively generate the required image data in the absence of rainy-day images in the original dataset.

Through the data augmentation strategy described above, the proposed model was used to augment the original dataset. Conditional controls were applied to ensure a roughly balanced number of samples for each class, while incorporating both rainy and sunny conditions in the simulation. After augmentation, the dataset included 580 images under normal conditions, 290 images under sunny conditions, and 290 images under rainy conditions, in addition to 120 original images, resulting in a total of 1280 images. This was consistent with the number of images produced by conventional image augmentation methods (as detailed in Section 3.1.1). Since comparative models such as DAGAN [5] and AutoAugment [15] cannot perform condition-guided image augmentation, the dataset was expanded to 1280 images by augmenting each image multiple times. DA-Fusion [13] is capable of performing single-class condition augmentation, so the dataset was balanced in terms of class distribution using single-class conditional controls, and the total number of images was expanded to 1280 for the experiments. Examples of the expanded results for each data augmentation method are shown in Figure 10.

After data augmentation, we trained several commonly used object detection models, including SSD, Faster-RCNN, and YOLOv7. Some YoloV7 detection results on real and generated images are shown in Figure 11. The results indicate a significant improvement in performance metrics across all models (as shown in Figure 12 and Table 8). Among them, the YOLOv7 model stands out with a Precision of 0.916 and a Recall of 0.926. Notably, the proposed model outperforms DA-Fusion, a data augmentation method based on the Stable Diffusion model, in terms of Recall rate, suggesting a greater advantage in reducing false negatives. However, DA-Fusion excels in Precision. This phenomenon may be attributed to the higher similarity between the augmented data and the original data (as shown in Figure 10). Additionally, DA-Fusion can only generate single-class images, and when negative samples are more prevalent in the original dataset, the number of negative samples may exceed that of positive samples. In such cases, the model tends to predict more negative samples to minimize the misclassification of positive samples, thus improving Precision. In contrast, the proposed model generates more diverse and representative data, encompassing various objects and their performance under different weather conditions, leading to improved Recall.

In the context of transformer anomaly detection, Recall is crucial because failing to detect abnormal samples could lead to severe practical consequences. Although the proposed model may incur a certain risk of false positives compared to DA-Fusion, it significantly reduces the likelihood of missed detections. Therefore, the proposed model better meets the practical needs of transformer anomaly detection, demonstrating greater application value.

5. Conclusions

In this study, an innovative data augmentation approach is proposed to address the key challenges faced by existing data augmentation models when dealing with substation image data: the dependence on a large amount of negative sample data, the difficulty of training, and the lack of conditional control. Based on the Stable Diffusion open-source pre-trained model weights, the Stable Diffusion base model weights are fine-tuned by combining the idea of Low-Rank Adaptation (LoRA), and the dataset is pre-segmented using a generalized segmentation module (SAM). The core advantage of this method is its ability to utilize limited image data (e.g., 120 image data in this study) to generate high-quality image data of oil leakage from transformer equipment under different conditions.

To validate the effectiveness of the proposed method, this study conducted ablation and comparative experiments using transformer oil leakage image data. The ablation study confirmed the effectiveness of the LoRA and SAM modules. Furthermore, the augmented image data were applied to the training of three object detection models: SSD, Faster-RCNN, and YOLOv7. The performance of these models was compared with four existing data augmentation methods—standard image transformation augmentation, DAGAN, DA-Fusion, and AutoAugment. The experimental results demonstrated a significant improvement in various performance metrics, with particular advantages in Recall. Although the proposed model may introduce a slight increase in false positives compared to DA-Fusion, it effectively reduces the probability of missed detections. Thus, the model proposed in this study better aligns with the practical needs of transformer anomaly detection and holds higher application value.

Author Contributions

Conceptualization, Y.J. and W.L.; methodology, J.H.; validation, R.L., Y.F. and Y.Z.; formal analysis, R.L., Y.Z. and Y.F.; writing—original draft preparation, W.L., J.W. and Y.J.; writing—review and editing, R.L., W.L. and J.H.; visualization, J.W. and R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by State Grid Sichuan Electric Power Company Science and Technology Program, grant number 521997230014.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Substation instrumentation image data are not available due to privacy.

Conflicts of Interest

Authors Wenlong Liao, Yiping Jiang, Rui Liu, Yun Feng, Yu Zhang were employed by the company Sichuan Electric Power Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Farley, D.-W.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Sun, S.; Fan, J.; Sun, Z.; Qu, J.H.; Dai, T.T. A Survey of Image Data Augmentation Based on Deep Learning. Comput. Sci. 2024, 51, 150–167. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Antoniou, A.; Storkey, A.; Edwards, H. Data augmentation generative adversarial networks. arXiv 2017, arXiv:1711.04340. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Yang, H.; Zhou, Y. Ida-gan: A novel imbalanced data augmentation gan. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 8299–8305. [Google Scholar]
Hong, M.; Choi, J.; Kim, G. Stylemix: Separating content and style for enhanced data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14862–14870. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings Part III 18. Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Trabucco, B.; Doherty, K.; Gurinas, M.A.; Salakhutdinov, R. Effective Data Augmentation with Diffusion Models. arXiv 2023, arXiv:2302.07944. [Google Scholar]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. arXiv 2022, arXiv:2208.01618. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 113–123. [Google Scholar]
Lim, S.; Kim, I.; Kim, T.; Kim, C.; Kim, S. Fast AutoAugment. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 6665–6675. [Google Scholar]
Hataya, R.; Zdenek, J.; Yoshizoe, K.; Nakayama, H. Faster autoaugment: Learning augmentation strategies using backpropagation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXV 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 1–16. [Google Scholar]
Hataya, R.; Zdenek, J.; Yoshizoe, K.; Nakayama, H. Meta approach to data augmentation optimization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2574–2583. [Google Scholar]
Guo, Y.; Liang, R.; Wang, R. Cross-Domain Adaptive Object Detection in Foggy Weather Based on CNN Image Augmentation. Comput. Eng. Appl. 2023, 59, 187–195. [Google Scholar]
Zu, J.; Zhou, Y.; Chen, L. Low-Light Image Augmentation Combining Attention Mechanism and Dual-Branch Residual Network. Comput. Appl. 2023, 43, 1240–1247. [Google Scholar]
Hu, E.J.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Lin, H.; Cheng, X.; Wu, X.; Shen, D. Cat: Cross attention in vision transformer. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst. 2022, 35, 25278–25294. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Everingham, M.; Van~Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context; Springer: Berlin/Heidelberg, Germany, 2014; CoRR; abs/1405.0312. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]
Tzutalin. LabelImg. Git Code 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 1 December 2024).
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 9. [Google Scholar] [CrossRef] [PubMed]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]

Figure 1. Diffusion and reverse processes.

Figure 2. Schematic diagram of the latent diffusion model.

Figure 3. Stable Diffusion-generated transformer images (a) and images in our dataset (b).

Figure 4. Proposed model structure.

Figure 5. LoRA schematic diagram.

Figure 6. SAM schematic diagram.

Figure 7. Some dataset images.

Figure 8. Generated transformer equipment images in different stages.

Figure 9. Generated image data under different weather conditions.

Figure 10. Examples of the expanded results for each data augmentation method.

Figure 11. Some YoloV7 detection results on real (a) and generated (b) images.

Figure 12. Comparison of various evaluation metrics.

Table 1. Quantity of image data.

Part	Normal Data	Oil Leakage Data
Oil conservator	30	6
Cooling system components	35	7
Pebble bed	35	7

Table 2. Keywords used in LoRA training data labeling.

Prompt Type	Keywords
transformer components	byqsb, transformer, oil conservator/cooler/pebble bed
types of anomalies	ground oil/surface oil/greasy patch

Table 3. Annotation examples.

Data	Annotation Content	Data	Annotation Content
	byqsb, no humans, greasy patch, pebble floor, pebble bed, ground oil, surface oil, industrial pipe, cable, scenery, realistic, and machine		byqsb, cooler, no humans, outdoors, pebble floor, cable, industrial pipe, realistic, machinery, greasy patch, ground oil, and surface oil
	byqsb, oil conservator, no humans, sky, day, greasy patch, surface oil, scenery, outdoors, cloud, tree, and realistic		byqsb, cooler, no humans, scenery, outdoors, road, street, sign, sky, day, shadow, building, blue sky, real-world location, machinery, realistic, surface oil, and greasy patch

Table 4. Experimental hardware environment.

Component	Specification	Quantity
CPU	Intel Core i5-12400F	1
Memory	32GB	1
Graphics Card	Nvidia RTX 4060 Ti 16G	1

Table 5. Training parameter values.

Step	Parameter	Value
LoRA	lr	5 × 10⁻⁵
	lr scheduler	cosine annealing
	optimizer	AdamW8bit
	epoch	200
	batch size	2
	network dimension	64
Object detection	optimizer	SGD
	min lr	1 × 10⁻⁴
	max lr	1 × 10⁻²
	epoch	100
	batch size	8

Table 6. Model prompt words.

Stage	Positive Prompt	Negative Prompt
Base Model	distribution transformer and substations	paintings, worst quality, low quality, normal quality, lowres, watermark, monochrome, grayscale, and blurry
LoRA	byqsb, distribution transformer, substations, and cooler/oil conservator/greasy patch
LoRA+SAM

Table 7. Ablation experiment data in YOLOv7.

Stage	P/%	R/%	[email protected]	[email protected]:0.95
Base Model	0.598	0.621	0.608	0.471
+LoRA	0.761	0.794	0.702	0.658
+LoRA +SAM	0.916	0.926	0.966	0.711

Table 8. Performance metrics of different object detection models before and after data augmentation (Bold values represent the highest scores in each metric category).

Method	Object Detection Model	P/%	R/%	[email protected]/%	[email protected]:0.95/%
Standard Image Transformation	SSD	0.617	0.531	0.456	0.408
	Faster-RCNN	0.635	0.568	0.648	0.454
	YOLOv7	0.752	0.602	0.811	0.543
DAGAN [5]	SSD	0.672	0.861	0.516	0.445
	Faster-RCNN	0.656	0.587	0.456	0.431
	YOLOv7	0.812	0.470	0.897	0.613
DA-Fusion [13]	SSD	0.821	0.674	0.619	0.533
	Faster-RCNN	0.757	0.643	0.624	0.549
	YOLOv7	0.893	0.840	0.972	0.699
AutoAugment [15]	SSD	0.687	0.619	0.804	0.462
	Faster-RCNN	0.652	0.651	0.585	0.525
	YOLOv7	0.891	0.910	0.916	0.648
Ours	SSD	0.760	0.854	0.698	0.481
	Faster-RCNN	0.674	0.831	0.810	0.462
	YOLOv7	0.916	0.926	0.966	0.711

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liao, W.; Jiang, Y.; Liu, R.; Feng, Y.; Zhang, Y.; Hou, J.; Wang, J. Stable Diffusion-Driven Conditional Image Augmentation for Transformer Fault Detection. Information 2025, 16, 197. https://doi.org/10.3390/info16030197

AMA Style

Liao W, Jiang Y, Liu R, Feng Y, Zhang Y, Hou J, Wang J. Stable Diffusion-Driven Conditional Image Augmentation for Transformer Fault Detection. Information. 2025; 16(3):197. https://doi.org/10.3390/info16030197

Chicago/Turabian Style

Liao, Wenlong, Yiping Jiang, Rui Liu, Yun Feng, Yu Zhang, Jin Hou, and Jun Wang. 2025. "Stable Diffusion-Driven Conditional Image Augmentation for Transformer Fault Detection" Information 16, no. 3: 197. https://doi.org/10.3390/info16030197

APA Style

Liao, W., Jiang, Y., Liu, R., Feng, Y., Zhang, Y., Hou, J., & Wang, J. (2025). Stable Diffusion-Driven Conditional Image Augmentation for Transformer Fault Detection. Information, 16(3), 197. https://doi.org/10.3390/info16030197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stable Diffusion-Driven Conditional Image Augmentation for Transformer Fault Detection

Abstract

1. Introduction

2. Proposed Approach

2.1. Diffusion Models

2.1.1. Latent Diffusion Models

2.1.2. Stable Diffusion Model

2.2. Proposed Stable Diffusion-Based Image Augmentation of Transformer Equipment

2.2.1. Structure of the Model

2.2.2. Low-Rank Adaptation of Stable Diffusion Model

2.2.3. Introducing the SAM Image Segmentation Module

3. Experiment Setup

3.1. Dataset

3.1.1. Data Preprocessing

3.1.2. Data Labeling

3.2. Experimental Environment

3.2.1. Hardware and Software Environment

3.2.2. Training Parameters

3.3. Evaluation Metrics

4. Experimental Results and Analysis

4.1. Ablation Experiments

4.2. Comparison Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI