Open AccessArticle

Applications of the FusionScratchNet Algorithm Based on Convolutional Neural Networks and Transformer Models in the Detection of Cell Phone Screen Scratches

Business College, Southwest University, Chongqing 402460, China

College of Letters & Science, University of Wisconsin-Madison, Madison, WI 53706, USA

School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan 430079, China

Author to whom correspondence should be addressed.

Electronics 2025, 14(1), 134; https://doi.org/10.3390/electronics14010134

Submission received: 30 November 2024 / Revised: 28 December 2024 / Accepted: 30 December 2024 / Published: 31 December 2024

Download

Browse Figures

Figure 1
Structure of the FS-Net designed for mobile screen scratch detection. "> Figure 2
(a) Part of the network structure of ResNet50; and (b) residual connection block. "> Figure 3
GLFI module structure. "> Figure 4
Spatial attention and channel attention modules. "> Figure 5
BA attention module. "> Figure 6
Sample dataset display. "> Figure 7
Different overlap levels with the same IoU values. "> Figure 8
Overlap between predicted and actual boxes. "> Figure 9
(a) Regression error curves of different loss functions; and (b) variation trends of IoU box plots under different loss functions. "> Figure 10
(a) Regression error curves of different loss functions; and (b) trends in the IoU box plots under different loss functions. ">

Versions Notes

Abstract

Screen defect detection has become a crucial research domain, propelled by the growing necessity of precise and effective quality control in mobile device production. This study presents the FusionScratchNet (FS-Net), a novel algorithm developed to overcome the challenges of noise interference and to characterize indistinct defects and subtle scratches on mobile phone screens. By integrating the transformer and convolutional neural network (CNN) architectures, FS-Net effectively captures both global and local features, thereby enhancing feature representation. The global–local feature integrator (GLFI) module effectively fuses global and local information through unique channel splitting, feature dependency characterization, and attention mechanisms, thereby enhancing target features and suppressing noise. The bridge attention (BA) module calculates an attention feature map based on the multi-layer fused features, precisely focusing on scratch characteristics and recovering details lost during downsampling. Evaluations using the PKU-Market-Phone dataset demonstrated an overall accuracy of 98.04%, an extended intersection over union (EIoU) of 88.03%, and an F1-score of 65.13%. In comparison to established methods like you only look once (YOLO) and retina network (RetinaNet), FS-Net demonstrated enhanced detection accuracy, computational efficiency, and resilience against noise. The experimental results demonstrated that the proposed method effectively enhances the accuracy of scratch segmentation.

Keywords:

scratch detection; machine vision; dual-branch feature fusion; bridge attention mechanism

1. Introduction

The screen surface condition, as a crucial element of mobile phones, directly influences the product quality [1,2]. Therefore, inspecting the screen surface before a phone is released is of significant importance. At present, manual visual inspection is the predominant technique for defect detection. However, this approach is extremely time-consuming, and the results may vary not only between different inspectors but also for the same inspector under different conditions [3], making it inadequate for the efficiency and precision required in industrial production.

Scratches are subtle features on the surface of objects, varying in length, direction, and depth. The natural texture or patterns of the product surface frequently obscure scratches, complicating precise feature extraction. In addressing these challenges, numerous researchers have investigated diverse methodologies in the domain of scratch visual detection. An image classification approach that utilizes a sliding window technique for sub-region sampling within the original image was proposed by Weimer et al. [4]. As classifiers, CNNs serve to distinguish between defective and non-defective sub-regions. However, the accuracy of this approach is affected by the size of the image window. Recent studies have shown that combining CNNs with transformers allows for more accurate object detection, which has practical uses such as mobile scratch detection. A new lightweight architecture, which surpasses lightweight CNNs and transformer models while achieving a balance between accuracy and computational efficiency, was proposed by Maaz et al. [5]. Detecting minor scratches on mobile surfaces is essential, and this hybrid method effectively integrates global and local feature extraction capabilities. A hybrid method that integrates CNNs and transformer models for real-time object detection was devised by Pan et al. [6]. This model integrates multi-scale feature processing and uncertainty reduction methods to enhance the detection speed and precision, rendering it appropriate for scratch detection in dynamic mobile settings. The you only look once version 5 (YOLOv5) model was optimized by Zhao et al. [7] to enable it to focus more precisely on the detection of surface scratches on mobile phone screens as a specific target. Modifications to the model encompass feature layer optimization and a streamlined design, both essential for mobile device applications. The faster region-based convolutional neural network (Faster R-CNN), developed by Ren et al. [8], presents training challenges and necessitates considerable annotation efforts and data, akin to semantic segmentation networks.

To address the challenges in detecting surface screen defects in electronic products, the multi-layer residual network was integrated into the CNN architecture by Ming et al. [9]. However, some false detection problems still remained. Currently, the mainstream edge detection algorithms, such as the Laplacian, Canny [10,11], Sobel and Prewitt [12,13], are commonly used to detect scratches. While these algorithms perform well on specific scratch images, they struggle to extract edge features with complex surface texture or low scratch contrast, leading to false positives or missed detections. The Kokaram algorithm [14] is one of the commonly used methods for scratch detection, which constructs a cosine distribution of scratch brightness decay. It utilizes median filtering and the Hough transform for selection, followed by Gibbs sampling to derive the scratch skeleton for authenticity assessment. This method is time-consuming and vulnerable to noise interference.

In recent years, the field of image segmentation has seen the emergence of many innovative algorithms integrating transformer and CNN architectures. The connectionist temporal classification network (CTC-Net), which combines a CNN encoder, a transformer encoder, fuzzy C-means (FCM), and a transformer decoder, was proposed by Yuan et al. [15]. The fuzzy C-means in CTC-Net combines the features from both domains, enhancing the network’s representation ability and achieving good results in medical image segmentation. The modified convolution and transformer hybrid encoder–decoder network (MCV-Unet), a hybrid network with atrous CNN layers and ViT layers, along with skip connections, was introduced by Xu and Wang [16], which performs well in ultrasound image semantic segmentation, outperforming 17 baseline methods. A network that combines a CNN and vision transformer with skip connections (CViTS-Net), which includes an MSF block and a DGL block, along with novel skip connections, was developed by Kanadaath et al. [17]. These algorithms contribute to the advancement of image segmentation technology.

Furthermore, TransBridge is a lightweight algorithm developed for image segmentation, utilizing the advantages of both the CNN and transformer architectures to improve the segmentation efficacy [18]. This combined architecture demonstrates significant potential in image segmentation [19]. Inspired by these algorithmic structures [3], this paper presents the FS-Net algorithm, which brings several notable innovations. It combines the transformer and CNN networks, where the CNNs extract detailed local features like the screen texture and the transformer captures the global context, enhancing the detection of scratches by considering both local details and global relationships. Effective fusion of global and local information is achieved by splitting channels, characterizing feature dependencies, and using an attention mechanism for feature enhancement and noise suppression through the GLFI module. Attention is calculated based on multi-layer fused features by the BA module, precisely focusing on scratch characteristics and recovering details lost during downsampling for improved precision. To further improve the model’s ability to detect fine scratches, this study employs targeted strategies for data preprocessing and noise suppression. This guarantees consistent performance even when dealing with noisy or faint scratch features, which is essential for accurate scratch detection on mobile screens.

The main innovations of this paper are as follows:

Proposal of a detection architecture that combines the transformer and CNN networks to effectively capture scratches on the surface of mobile phone screens.
Proposal of the GLFI module, which is designed to facilitate the effective fusion of two branch features through fine-grained interactions for improving the detection accuracy.
Proposal of a detection algorithm that combines the transformer and CNN networks to calculate attention based on multi-layer fusion features through the BA module to improve the detection accuracy.

2. FS-NET

This model is comprised of the following four components: the CNN branch, the transformer branch, the GLFI module, and the BA module [20]. Feature extraction is carried out by utilizing the CNN and the transformer branches. Then, the features from these two branches are merged through the GLFI module. Moreover, attention is calculated based on multi-scale fused features by the BA module.

The model effectively leverages the advantages of both the transformer and CNN architectures and employs a convolutional network to gradually increase the receptive field, obtaining low-level contextual information from the input image [21]. The transformer branch, however, captures the global contextual information. The transformer processes the input as a one-dimensional sequence, focusing solely on modeling global information. Still, it lacks complex low-level features and positional data, which makes it harder to directly upsample these data to the original resolution for reconstruction to work well. This results in segmentation contours that are not straight. Conversely, CNNs excel at extracting low-level features, which helps compensate for the transformer’s limitations. This paper employs the GLFI module to merge the features extracted from both branches [22].

When processing the input features, the GLFI module first performs a channel-splitting operation, accurately dividing the input feature map into

G

groups with the dimensions of

X \in R^{C \times H \times W}

along the channel dimension, so that the dimension of each group of features becomes

X_{q} \in R^{\frac{C}{G} \times H \times W}

. The process is shown in the GLFI module in Figure 1. Then, these sub-feature groups are processed in parallel along the channel dimension. During the training process, each sub-feature

X_{q}

gradually develops a semantic response. For each sub-feature group, the GLFI module uses a permutation unit to meticulously describe the dependency relationships of features in both the spatial and channel dimensions. To determine information such as the shape of the scratch in the spatial dimension, specific calculations capture the positional associations between features. The interactions of features across various channels are examined to assess the importance of information within each channel. After all the sub-features are aggregated, the channel shuffle operation enhances the information interaction between different sub-features, realizing the integration and refinement of local and global information. This allows the network to understand the image from multiple scales and to accurately extract scratch features, handling various complex scratch situations and interference factors. The process is further complemented and perfected by the BA module. By bridging the three layers of the fused features to calculate attention, it effectively captures the details of the mobile screen edges that are easily lost during downsampling, ultimately obtaining finer feature information and providing strong support for the entire network to accurately identify and locate scratches. The structure of the FS-Net is illustrated in Figure 1.

2.1. CNN Branch

The features from training samples can be automatically summarized by CNNs without the need for complex manual feature extraction operations. As the depth of the convolutional layers escalates, training the network becomes progressively more arduous. In deep networks, the weight updates become smaller, which causes a degradation effect that makes the performance worse than in networks that are not as deep. Additionally, when the number of layers reaches a certain level, issues such as gradient explosion or gradient vanishing may occur [23]. To address these issues, the CNN branch selects Residual Network 50 (ResNet50) [24] as the backbone network, whose the partial network structure and inherent residual connection block structure are as illustrated in Figure 2.

The unique residual connections in ResNet50 create shortcut connections between each layer and the previous layer, effectively preventing network degradation while maintaining feature extraction performance. Generally, networks based on ResNet50 consist of five layers, with each layer performing two downsampling operations on the feature maps. In this paper, the original image X∈

ℝ

^112×112×3 is input into the CNN, resulting in output features

ℊ^{0}

∈

ℝ

^7×7×256,

ℊ^{1}

∈

ℝ

^14×14×128, and

ℊ^{2}

∈

ℝ

^28×28×64 after three convolutional layers. These features are then fused with the features of the same dimensions from the transformer.

2.2. Transformer Branch

The transformer branch utilizes an encoder–decoder architecture. Initially, the input image X of the cell phone screen surface is divided into 49 equal parts (7 × 7), and the resolution of each part is set to 16 × 16 pixels. In order to take advantage of the spatial location information, each image block is transformed into a one-dimensional vector through a linear mapping layer, and the position encoding of the same dimension is added to obtain the embedding sequence e, which is set to

ℝ

^49×256. The encoder receives the embedding sequence as input, consisting of an L-layer multi-headed self-attention mechanism and a multi-layer perceptron. The self-attention (SA) mechanism aggregates the global information, as shown in Equation (1):

\begin{matrix} S A (z_{i}) = S o f t m a x (\frac{q_{i} k^{T}}{\sqrt{D_{h}}}) v, \end{matrix}

(1)

where

(q, k, v) = z W_{q k v}, W_{q k v} \in ℝ^{256 \times 3 D_{h}}

represents the projection matrix,

z_{i} \in ℝ^{1 \times 256}

q_{i} \in ℝ^{1 \times D_{h}}

represents the ith row of the input vector

z

and the query vector

q

k \in ℝ^{1 \times D_{h}}

represents the key vector,

v \in ℝ^{1 \times D_{h}}

represents the value vector, and

D_{h}

represents the dimension of the key vector. The self-attention mechanism captures the relationship between inputs by allowing the input vectors to interact in different linear projection spaces. As an extension to self-attention, multi-head self-attention (MSA) establishes different projection information in multiple projection spaces. Multiple output matrices are generated by projecting the input matrix and subsequently concatenating them. In the last transformer, layer normalization is applied to obtain the encoded sequence

z^{L} \in ℝ^{49 \times 256}

. The decoder component employs a progressive upsampling operation. First, the dimension of

z^{L}

is reshaped to

t^{0} \in ℝ^{7 \times 7 \times 256}

, and then two consecutive standard upsampling convolutional layers are used to restore the spatial resolution, obtaining

t^{1} \in ℝ^{14 \times 14 \times 128}

and

t^{2} \in ℝ^{28 \times 28 \times 64}

, respectively. The feature maps

t_{0}

t_{1}

and

t_{2}

of different scales are fused with the corresponding features of the CNN branch. The process is shown in the transformer branch module in Figure 1.

2.3. GLFI

In order to effectively combine the encoded features of the CNN and the transformer branches, this paper proposes a new fusion module, the GLFI module, as shown in Figure 3.

First, the input is split into G groups along the channel dimension by the GLFI module. Each group of features is divided into

F_{X}

and

S_{X}

. The split features utilize embedded average pooling and group normalization operations to generate new features, subsequently enhancing the feature representation via

F C (\cdot)

in Figure 3. The permutation unit delineates the interdependence of features across spatial and channel dimensions. Finally, the new features

\hat{F_{X}}

and

\hat{S_{X}}

are integrated and component feature communication is performed through channel permutation operations. This module facilitates the integration of global and local information, addressing the limitation of CNNs and transformers, which tend to concentrate on a singular feature.

Upon receiving the encoded feature maps, processed by both the CNN and the transformer branches, the features are firstly transformed into a mixed feature map and then introduced into the GLFI module. The input feature map is grouped, with each group constituting a sub-feature. For a given feature map X, with a dimension of

X \in R^{C \times H \times W}

(where C, H and W represent the number of channels, spatial height, and width, respectively), X is divided along the channel dimension into

g

groups, with the dimension of

X_{q}

being

X_{q} \in R^{(c g) \times H \times W}

. During training, each sub-feature

X_{q}

gradually acquires a semantic response. Then, using the attention module, the corresponding importance coefficients are generated for each sub-feature.

At the beginning of each attention unit, the input of X_q is divided into two branches along the channel dimension, namely

X_{q 1}, X_{q 2} \in R^{\frac{C}{2 G} \times H \times W}

. The segment following “splitting” in the center of Figure 3 above illustrates the two branches in distinct colors. The

F_{X}

branch employs the inter-channel dependence to produce a channel attention map, while the

S_{X}

branch captures the spatial dependence among features to generate a spatial attention map. The model concurrently processes semantic and locational information through attention mechanisms. For the channel attention branch, first channel statistics

S \in R^{\frac{C}{2 G} \times 1 \times 1}

are generated by global average pooling to embed global information, which can be calculated by shrinking

X_{k 1}

along the spatial dimension

H \times W

, as shown in Equation (2):

\begin{matrix} S = F_{g p} (X_{k 1}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{k 1} (i, j) . \end{matrix}

(2)

The global average pooling operation commonly used in CNNs for reducing the spatial dimensions of feature maps into a single summary value per channel is represented by Equation (2).

In addition, the sigmoid activation function is used to create a compact feature to accurately and adaptively select important channels or feature locations. The channel attention’s final output is shown in Equation (3):

\begin{matrix} X_{q 1}^{'} = σ (F_{c} (S)) \cdot X_{q 1} = σ (W_{1} S + b_{1}) \cdot X_{q 1}, \end{matrix}

(3)

where

W_{1} \notin R^{\frac{C}{2 G} \times 1 \times 1}

and

b_{1} \notin R^{\frac{C}{2 G} \times 1 \times 1}

are used to scale and translate channel statistics data S.

Spatial attention differs from channel attention and serves as a complement to it. First, the group norm on

X_{k 2}

is used to obtain spatial statistics. Then,

F C (\cdot)

is employed to enhance the feature representation of

X_{k 2}

. The final spatial attention output is shown in Equation (4):

\begin{matrix} X_{q 2}^{'} = σ (W_{2} \cdot G N (X_{q 2}) + b_{2}) \cdot X_{q 2}, \end{matrix}

(4)

where

W_{2}

b_{2}

are parameters with the shape of

R^{\frac{C}{2 G} \times 1 \times 1}

. Then, the results of the two attentions are connected, that is,

X_{q}^{'} = [X_{q 1}^{'}, X_{q 2}^{'}] \in R^{\frac{C}{2 G} \times H \times W}

. At this stage, it is consistent with the input size of this group.

All the features are consolidated. The final output of the spatial attention module retains the same shape as X, enabling seamless integration with other networks. In a single spatial attention module, each branch has a channel count of

\frac{C}{2 G}

, leading to a total number of parameters of

\frac{3 C}{G}

. Due to the small size of G in comparison to the millions of parameters in the network, the GLFI module is relatively lightweight. The structures of the spatial attention module and channel attention module are illustrated in Figure 4.

2.4. BA Attention Module

By combining the multi-layer features to compute the attention weights, the module enables the network to learn and focus on target structures of varying shapes and sizes, further enhancing the segmentation accuracy of FS-Net. As a plug-and-play module, the BA module can be positioned at any location within the network. The fused feature

f^{i}

is first input into the BA module, which consists of two parts: the fusion module and the generator. After

f^{i}

enters the fusion module, it passes through a global average pooling layer to calculate the average value of the feature map for each channel, resulting in a corresponding feature vector. This is then processed through fully connected (

F C

) and batch normalization (

B N

) to compute the compressed feature

S_{i}

as follows:

\begin{matrix} S_{i} = B N_{i} (F C (G A P (f^{i})), \end{matrix}

(5)

where

S_{i} \in ℝ^{1 \times 1 \times 64}

flattens and accumulates it to obtain the fusion feature

I_{B A} \in ℝ^{1 \times 64}

, as shown in Equation (6):

\begin{matrix} I_{B A} = \sum_{i = 0}^{2} S_{i} . \end{matrix}

(6)

Then,

I_{B A}

is introduced into the generator, passing through the ReLU activation function and fully connected layer, to finally obtain the normalized attention weight

W

by the sigmoid function.

The final structure of the BA module is illustrated in Figure 5.

3. Loss Function

This section discusses the content related to the loss function of the FS-Net algorithm in the detection of scratches on mobile phone screens. The function includes the module loss function and the combined loss function. The former calculates the binary cross-entropy loss of each module output, and the latter is constructed using specific hyperparameters. Segmentation is realized through end-to-end training and upsampling of the feature map, which is of great significance to improving the network performance and detection accuracy.

3.1. Module Loss Functions

To enhance the network’s feature extraction ability, the outputs from the transformer branch, the fusion module, and the output module are deconvoluted to the original resolution to generate segmentation maps. The binary cross-entropy loss is computed for each output as shown in Equation (7):

L_{t r a n s} = L_{f u s i o n} = L_{o u t} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{P} (g_{j}^{i} (l o g (t_{j}^{i}) + (1 - g_{j}^{i}) l o g (1 - t_{j}^{i}))),

(7)

where

L_{f u s i o n}

L_{t r a n s}

and

L_{o u t}

represent the binary cross-entropy loss functions of the transformer branch, fusion module and output module, N represents the number of training samples,

P

represents the total number of pixels in the segmentation map, i represents the training sample, j represents the jth pixel in the segmentation map,

t_{j}^{i}

represents the predicted value of the jth pixel in the binary map output by the prediction head of the ith training sample,

t_{j}^{i}

= 0 represents the predicted background image,

t_{j}^{i}

= 1 represents the predicted pixel value of the mobile phone screen surface area, and

g_{j}^{i}

= 0 represents the true value of the jth pixel in the ith training sample.

3.2. Combined Loss Functions

The network undergoes end-to-end training utilizing a combined loss function. The final segmentation result is obtained by upsampling the last feature map to the original resolution. The overall training loss function is as follows:

\begin{matrix} L_{total} = α L_{trans} + γ L_{fusion} + β L_{out}, \end{matrix}

(8)

where α, β and γ are adjustable hyperparameters, set to 0.5, 0.3, and 0.2, respectively, based on the experimental results. Parameter α controls the weight of the binary cross-entropy loss function of the transformer branch. A larger α emphasizes the importance of the transformer’s feature extraction ability in learning, which is beneficial for capturing global information. Parameter β, which is set to 0.3, governs the weight of the loss function of the fusion module. It makes it easier to find the best balance during the feature fusion process, which makes sure that the CNN and transformer features are combined well in the model. Parameter γ, with a value of 0.2, controls the weight of the loss function of the output module. This makes the model focus on minimizing the difference between the predicted output and the true labels.

4. Experiment and Result Analysis

This section analyses the results of the FS-Net algorithm. In this research the PKU-Market-Phone dataset is utilized, subsequently being processed and partitioned. Under specific experimental environments and parameter settings, data augmentation techniques are applied to improve the performance of the algorithm.

4.1. Datasets

The experimental data in this paper are derived from the PKU-Market-Phone dataset, publicly released by Peking University in 2022 for the segmentation of mobile screen surface defects. The dataset follows the PASCAL VOC format and consists of 1200 images of mobile screens, covering three typical types of surface defects: oil stains, scratches, and spots. Each defect category contains 400 images, which are captured using industrial cameras with a resolution of 1920 × 1030. Given that no datasets with a resolution suitable for the FS-Net model could be found, this study employed the bilinear interpolation to resize the images to 112 × 112 to meet the model’s input requirements. Data augmentation techniques, like random cropping and rotation, can be applied to each training image during training to increase the algorithm’s robustness and generalizability. As illustrated in Figure 6, the dataset was divided into three subsets: 60% as the training set, 20% as the validation set, and 20% as the test set.

4.2. Experimental Environment and Parameters

This experiment utilizes the Python programming language and the PyTorch deep learning framework. The main parameters for algorithm training included the use of the SGD optimizer and an image size of 112 × 112. Each experimental model was trained for 300 iterations. Detailed experimental environment parameters are shown in Table 1.

4.3. Data Augmentation

Due to the limited size of the PKU-Market-Phone dataset, data augmentation techniques were applied to the images by rotating them 90°, 180°, and 270°, as well as by performing horizontal and vertical flipping. Finally, a trainable dataset for mobile screen surface defects was created with a step size of 512. This study used test-time augmentation techniques during the testing phase to further improve the previously validated model’s performance. Ultimately, a total of 6480 training images and 4320 testing and validation images were generated.

4.4. Evaluation Metrics

To validate the effectiveness of the proposed method, three evaluation metrics were employed: overall accuracy (

O A

F_{1} - S c o r e

, and

E I o U

. These metrics are defined in Equations (9)–(12):

\begin{matrix} O A = \frac{T P + T N}{T P + T M + F N + F P}, \end{matrix}

(9)

\begin{matrix} P r e c i s i o n (P) = \frac{T P}{T P + F P}, \end{matrix}

(10)

\begin{matrix} F_{1} - S c o r e = 2 \times \frac{P \times R}{P + R}, \end{matrix}

(11)

\begin{matrix} E I o U = \frac{T P}{T P + F P + F N}, \end{matrix}

(12)

where

T P

represents the pixels correctly predicted as scratches,

F P

denotes the pixels of intact screens incorrectly predicted as scratches,

F N

indicates the pixels of scratches mistakenly predicted as intact screens,

T N

refers to the number of non-scratch pixels, and

P

denotes precision.

During the defect detection process for mobile screens, multiple prediction boxes may be generated around the same target defect. Non-maximum suppression (NMS) is typically employed to select the best bounding box, thereby avoiding redundant detection results. The intersection over union (IoU) serves as the criterion for NMS, evaluating the disparity between the actual box and the predicted box, which is calculated by dividing the area of the intersection of the two boxes by the area of their union.

The traditional IoU algorithm has the following disadvantages. First, a significant flaw in the NMS algorithm is that the IoU fails to accurately reflect the degree of overlap between the predicted and actual boxes in screen defect detection. As shown in Figure 7, the green area represents the actual box, the red area indicates the predicted box, and the gray shaded area represents the overlapping region. Although the degrees of overlap are different in these three instances, the IoU values are identical. The left scenario is deemed optimal, while the right scenario is regarded as suboptimal. However, the conventional IoU calculation method may erroneously assess their degrees of overlap as equivalent. Secondly, when the predicted box and the actual box do not overlap, as defined in Equation (12), the IoU equals zero. This function fails to reflect the distance between the predicted box and the actual box, and since the loss is also zero, there is no gradient backpropagation, making it impossible to perform learning and training.

The generalized intersection over union (

GIoU

) loss function is an improved version of the traditional

IoU

[25]. It pertains to the scenario in which there is no overlap between the actual box and the predicted box, yielding an

IoU

of zero. The mathematical expression for the

GIoU

loss function is shown in Equation (13):

\begin{matrix} L_{C I o U} = 1 - G I o U = 1 - I o U + \frac{C - (A \cap B)}{C}, \end{matrix}

(13)

where

I o U

represents the intersection over union between the predicted box and the actual box, while

C

denotes the area of the smallest enclosing rectangle that can simultaneously contain both the predicted and the actual boxes. This method alleviates the gradient vanishing problem in non-overlap cases. When the actual box and the predicted box exhibit a containment relationship, the area penalty term in the

GIoU

loss function becomes null, rendering the loss function incapable of accurately representing the positional relationship between the predicted and the actual boxes. As shown in Figure 8, when a containment relationship exists between the predicted and the actual boxes, the penalty term in the GIoU portion in the loss function degrades to the IoU loss.

To resolve the limitation of the IoU and GIoU, which concentrate exclusively on overlapping regions, the distance intersection over union (DIoU) introduces a technique that assesses positional information by analyzing the Euclidean distance between the centers of the actual box and the predicted box [26]. The mathematical equation of the loss function of the model is shown in Equation (14):

\begin{matrix} L_{D I o U} = 1 - D I o U = 1 - I o U + \frac{ρ^{2} b, b^{g t}}{C^{2}}, \end{matrix}

(14)

where

C

represents the diagonal distance of the smallest enclosing rectangle that can contain both the predicted and actual boxes, while

b

denotes the centers of the predicted box and the actual box, respectively.

The complete intersection over union (

CIoU

) represents an enhancement over the

GIoU

by integrating the benefits of both the

GIoU

and

DIoU

, while also incorporating the aspect ratio to more accurately align the predicted box’s shape with that of the actual box. The equation for the

CIoU

loss function is provided in Equations (15)–(17):

\begin{matrix} L_{C I o U} = 1 - I o U + \frac{ρ^{2} b, b^{g t}}{c^{2}} + α γ, \end{matrix}

(15)

\begin{matrix} γ = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}, \end{matrix}

(16)

\begin{matrix} α = \frac{γ}{(1 - I o U) + γ}, \end{matrix}

(17)

where γ represents the aspect ratio difference between the two boxes, and α is a balancing parameter used to control the weight of the overlapping area between the predicted box and the actual box, allowing for corresponding adjustments. Parameters

w^{g t}

and

h^{g t}

denote the width and height of the actual box, while

w

and

h

represent the width and height of the predicted box, respectively. In the context of the

C I o U

, if the aspect ratios of the predicted box and the actual box are congruent, the penalty term equals zero. Furthermore, in the

C I o U

equation, the gradients of

w

and

h

are found to be the opposites of γ, indicating that

w

and

h

cannot both increase or decrease simultaneously. By separating the influencing factors of the aspect ratios of the predicted box and the actual box, the EIoU [27] introduced a solution to this problem. It calculates the lengths and widths of the predicted and actual boxes individually, addressing the problems present in

C I o U

calculations. Parameters

w^{c}

and

h^{c}

represent the width and height of the smallest enclosing rectangles for the predicted and actual boxes.

A loss function used in object detection tasks to improve the alignment between predicted bounding boxes and ground truth bounding boxes, namely the

EIoU

loss, is represented by Equation (18):

L_{E I o U} = 1 - I o U + \frac{ρ^{2} b, b^{g t}}{c^{2}} + \frac{ρ^{2} w, w^{g t}}{{(w^{c})}^{2}} + \frac{ρ^{2} h, h^{g t}}{{(h^{c})}^{2}} .

(18)

This algorithm comprehensively considers factors such as the intersection area of the two boxes, the positions of their centers, and the differences in their aspect ratios when calculating the loss function values. Conversely, conventional NMS solely considers overlapping regions, potentially resulting in erroneous elimination of defect boxes when multiple defect targets coincide, as well as misclassification of proximate detection boxes as identical due to neglecting the center point positions, culminating in missed detections. By comparison, the EIoU provides a more thorough assessment of the similarity between the two boxes, demonstrating better performance in practical applications. Therefore, this study selects the EIoU as the loss function for the NMS to enhance the algorithm performance and detection accuracy.

Experiments comparing the performance under different IoU conditions indicate that the EIoU significantly improves the performance of mobile screen defect detection. The extended intersection over union variant was incorporated into you only look once version 8 (YOLOv8) by Yun et al. [28] and markedly enhanced the precision, recall, and mean average precision (mAP) across IoU thresholds. This optimization improves the detection precision for diverse defect categories, tackling issues such as shape irregularity and low contrast. It is asserted by Liang et al. [29] that the context-enhanced network (CE-Net) integrates adaptive receptive field attention with spatial graph reasoning to improve the feature representation for defect detection. Bounding box predictions are enhanced by the EIoU, thereby increasing the precision for minor and irregular defects. It is stated by Zhao et al. [30] that an EIoU-based network variant improves the defect detection performance by strengthening its spatial accuracy and resistance to lighting variations.

This study presents simulations demonstrating that the EIoU loss exhibits superior convergence speed and enhanced regression accuracy relative to the IoU, GIoU, and CIoU losses, particularly in scenarios where most anchor boxes are of inferior quality. The simulation results of the low quality of most anchor boxes are shown in Figure 9a,b.

The results strongly validate the exceptional performance of the EIoU loss in object detection. In comparison to the IoU, GIoU, and CIoU losses, the EIoU loss demonstrates accelerated convergence and enhanced regression accuracy. The regression error of the EIoU loss is significantly lower than that of other traditional loss functions, as shown in Figure 9a. The EIoU loss consistently demonstrates exceptional performance, as illustrated in Figure 9b, despite facing certain challenges. Its explicit consideration of three geometric factors in bounding box regression gives the EIoU loss a distinct advantage in handling complex scenarios.

The importance of the EIoU is further emphasized in Figure 10a,b. Compared with other loss functions, the EIoU not only converges faster and has a lower regression error than other loss functions but also performs better in improving the quality of high-quality samples. Although there may be some limitations, the EIoU loss indeed provides a more reliable solution for bounding box regression in object detection.

The previously mentioned simulation experiments indicate that the EIoU loss attains quicker convergence. With its effective handling of high-quality samples and precise consideration of geometric factors, the EIoU loss contributes to lower localization errors in object detection tasks, which plays a crucial role in enhancing model performance.

In order to further demonstrate the superiority of the EIoU over other measurement methods like the IoU, an overall ablation study was carried out and the results are provided in Table 2. Here, this study selected the average precision (

AP

) as the indicator for evaluating the measurement methods. The loss function used in object detection tasks to improve the alignment between predicted bounding boxes and ground truth bounding boxes, namely the EIoU loss, is represented by Equation (19).

A P = \frac{1}{n} \sum_{i = 0}^{n} p_{i},

(19)

where

p_{i}

represents the average value at an equally spaced recall level. Since it is necessary to conduct comparisons under the same experimental conditions when evaluating the impact of different loss functions on performance, the weight of the bounding box regression (BBR) is uniformly set to 2.5, which can ensure that all the loss functions can perform BBR operations when the same weights are applied. The experimental results in Table 2 show that when the same weights are applied, the average precision of the EIoU reaches 37.0, which is significantly better than other detection methods such as the IoU and GIoU.

5. Experimental Results Analysis

This section discusses the experimental results and analysis of the models developed through ablation experiments and the improved model presented in this paper, as tested on the PKU-Market-Phone dataset.

5.1. Ablation Experiments

As an innovative algorithmic architecture, FS-Net is primarily composed of four crucial components, with the aim of addressing the challenging task of detecting scratches on mobile phone screens. The convolutional neural network branch adopts ResNet50 as its backbone network. In the five-layer structure, each layer conducts two downsampling operations, whereby distinct scale features such as 7 × 7 × 256 can be derived from the input images with the dimensions of 112 × 112 × 3. These features furnish abundant information for subsequent processing. The transformer branch employs an encoder–decoder architecture. Initially, the input image is partitioned into 49 segments of 7 × 7, each having a resolution of 16 × 16 pixels. Via the linear mapping layer and position encoding, an embedding sequence is obtained. The encoder aggregates the global information through L-layer multi-headed self-attention and multi-layer perceptron to update the state of each block. The decoder restores the spatial resolution through progressive upsampling, thereby capturing global context information. In fusing the features of the two branches, the GLFI module plays a pivotal role. It groups and subdivides the branches along the channel dimension. One branch generates a channel attention map via global average pooling to embed global information and selects significant channels, while the other branch utilizes group normalization and fully connected layer operations to generate a spatial attention map for capturing the spatial dependence of features. Thus, it integrates global and local information, enhances target features, and mitigates noise, offering the benefits of reduced parameters and high computational efficiency. A fusion module and a generator are encompassed by the BA module. The attention weight is computed via operations like global average pooling, allowing the network to concentrate on target structures of varying shapes and sizes, thereby improving the segmentation accuracy of FS-Net. It can be strategically placed at any location within the network, facilitating network optimization.

By introducing a cascade neural network that integrates the CNN and transformer architectures, along with the GLFI module, this study improved the you only look once version 1 (YOLOv1) model and utilized the enhanced model for segmenting screen images. A total of four groups of ablation experiments were conducted to verify the effectiveness of each improvement point. The optimal settings were selected after conducting several experiments, as shown in Table 3.

From Table 3, it can be observed that when using the YOLOv1 model for the experiments, its performance metrics are relatively lower. However, when the GLFI module is introduced into the YOLOv1 model, the OA decreases by 0.11, while the F1-score and EIoU values improve by 2.24% and 1.24%, respectively. This indicates that the introduction of the GLFI module aids in enhancing the model’s segmentation accuracy.

When the transformer module is incorporated into the YOLOv1 model, the OA increases by 0.7%, and the F1-score and EIoU values also show some improvements. This suggests that the addition of the transformer module enhances the model’s feature extraction capabilities.

When both the transformer and GLFI modules are introduced simultaneously, the model exhibits the best performance. The overall accuracy reaches 98.04, which is 1.66 points higher than the standard YOLOv1 model. The F1-score is 88.03, which is 3.91 points higher, and the accuracy of the EIoU value is 65.13, which is 2 points higher than the YOLOv1 model. This clearly demonstrates that the combined use of the transformer and GLFI modules significantly enhances the model’s performance.

5.2. Comparison of Attention Modules

In this experiment, the performance of the proposed GLFI module is compared with other commonly used attention modules in terms of the surface scratch extraction task. The experimental results are presented in Table 4.

From Table 4, it can be seen that the GLFI module demonstrates remarkable performance across all the metrics, achieving OA, F1-score, and EIoU values of 98.05%, 88.05%, and 65.15%, respectively. This indicates that the GLFI module can significantly enhance the accuracy of the relevant tasks by performing classification and delineation with higher precision. Furthermore, the number of parameters of the GLFI module is 24.20 × 10⁶ MB, which is lower compared to most other attention modules. This implies that it can maintain high performance while consuming relatively less memory, enabling it to have greater advantages under resource-constrained conditions.

In comparison, other attention modules such as the selective kernel network (SKNet), convolutional block attention module (CBAM), efficient channel attention network (ECA-Net), bottleneck attention module (BAM), dual attention network (DANet), gated context network (GC-Net), and spatial group-wise enhance network (SGE-Net) have relatively lower OA, F1-score, and EIoU values. Their parameter counts also vary, with some being higher than those of the GLFI module. This highlights the superiority of the GLFI module in terms of both performance and memory efficiency.

5.3. Transformer Scale Analysis

The scale of the transformer is determined by the size of the hidden layers and the number of transformer layers. To verify the impact of the transformer scale on the segmentation performance, this paper conducted ablation experiments. In the experiments, the base model had a hidden layer size of 512 and 8 attention heads, while the large model had a hidden layer size of 768 and 12 attention heads. The results in Table 4 indicate that while the larger model provides a certain degree of improvement in terms of screen segmentation performance, the enhancement is relatively minor and comes with an additional computational cost, leading to an increased training time. To improve efficiency and reduce computational costs, the “Base” model was used in all the experiments.

5.4. Comparison with Mainstream Methods

The numerical comparison results with the two-stage object detection algorithm such as the region-based convolutional neural network (R-CNN) and Faster R-CNN, and one-stage object detection algorithms such as the YOLO, YOLOv1 to you only look once version 3 (YOLOv3), and RetinaNet on the PKU-Market-Phone dataset are presented in Table 5, with the optimal results presented in bold. All the models were tested using the same dataset and under the same experimental environment. Images with varied resolutions and complexities of scratches were chosen for comparative analysis in order to compare the performance of these models in extracting scratches on different mobile screen surfaces.

As shown in Table 5, the proposed FS-Net model exhibits remarkable advantages from multiple aspects. Concerning the EIoU index, FS-Net attained 65.13%. Compared to RetinaNet’s 63.92%, it exhibited an increase of 1.21%. In comparison to Faster R-CNN’s 62.34%, FS-Net’s EIoU index increased by 2.79%. In comparison to the region-based fully convolutional network’s (R-FCN) 62.87%, FS-Net exhibited an increase of 2.26% in this index. FS-Net achieved a higher 0.92% for this index above the cascade region-based convolutional neural network’s (Cascade R-CNN) 64.21%. The balanced learning region-based convolutional neural network’s (Libra R-CNN) accuracy is 63.98%, while FS-Net exceeds it by 1.15%. In contrast to YOLO’s 63.12%, FS-Net had an improvement of 2.01%. This shows that FS-Net has improved target positioning accuracy and can better measure the overlap between the predicted and actual boxes.

In comparison to RetinaNet’s 86.24%, FS-Net improved by 1.79%. Compared to Faster R-CNN’s 83.45%, FS-Net is 4.58% higher. FS-Net is 3.71% greater than R-FCN’s 84.32%. FS-Net exceeds Cascade R-CNN’s 85.67% by 2.36%. FS-Net surpasses Libra R-CNN’s 85.34% by 2.69%. In comparison to YOLO’s 84.12%, it has risen by 3.91%. This demonstrates that FS-Net performs better in balancing precision and recall and can adapt more effectively to different detection scenarios.

In comparison to RetinaNet’s 97.51%, the OA of FS-Net exhibits an enhancement of 0.53%. In comparison to Faster R-CNN’s 96.12%, there is an increase of 1.92%. In contrast to YOLO’s 96.39%, FS-Net shows an improvement by 1.65%. This indicates that FS-Net exhibits a markedly better performance in accurately identifying targets and can more precisely ascertain the presence and categories of targets within images.

In terms of the running time, FS-Net only took 39.1 s to process 53 images. Compared with the classic R-CNN’s 43.2 s, it was 4.1 s faster. In comparison to the other models, it possessed distinct advantages, improving the detection efficiency and delivering results for practical applications more rapidly.

Regarding the training parameters, FS-Net only requires training with 24.2 × 10⁶ MB of parameters. Compared with R-CNN’s 30.6 × 10⁶ MB, Faster R-CNN’s 28.9 × 10⁶ MB, R-FCN’s 27.5 × 10⁶ MB, Cascade R-CNN’s 29.3 × 10⁶ MB, Libra R-CNN’s 28.7 × 10⁶ MB, and YOLO’s 25.3 × 10⁶ MB, etc., FS-Net significantly reduces the computational cost and enables the model to be efficiently trained and deployed in an environment with limited resources.

In conclusion, the FS-Net model excels across all the metrics, offering a more efficient and precise solution for object detection. Furthermore, to visually depict the comparative outcomes, this study selected four images as experimental subjects, with the results displayed in Table 6.

On the one hand, in the context of screen scratch processing, the FS-Net method proposed in this paper demonstrates significant advantages. Comprehensive recognition of all the scratch details often poses a challenge for the R-CNN and Faster R-CNN networks. In contrast, FS-Net not only accurately detects scratches of various sizes but also better identifies edge details. On the other hand, in the presence of complex backgrounds or interferences, the detection results from these aforementioned networks may lead to numerous false positives. They might mistakenly identify non-scratch areas as scratches or miss some shallower scratches. Nevertheless, the detection outcomes from FS-Net exhibit greater accuracy and clarity, being devoid of any false positives. When interference factors such as stains or highlights are present on the screen, FS-Net effectively eliminates these distractions and accurately extracts scratch information. Analysis of the scratch-processing results from networks like Faster R-CNN, and FS-Net reveals that the proposed FS-Net method can more effectively and comprehensively handle screen scratches. Ultimately, the processed screen scratch targets demonstrate enhanced integrity and resemblance to the actual conditions.

6. Conclusions and Future Research

6.1. Conclusions

This research introduced a novel method for detecting scratches on mobile screens, featuring several significant contributions. A hybrid architecture that integrates transformer and CNN networks was proposed, effectively combining global and local features for thorough scratch detection. The GLFI module improved the interactions between global and local data, thereby augmenting the detection precision. Furthermore, the BA module employed multi-layer fused features to accurately concentrate on scratch characteristics, enhancing the precision. Preprocessing strategies and noise suppression techniques improved the detection of subtle scratch features, ensuring reliable performance. The results revealed the superiority of the FS-Net algorithm compared to the established methods such as R-CNN, YOLO, and RetinaNet. These advancements underscore its efficiency for industrial applications.

6.2. Limitations

The algorithm has certain limitations in terms of its generalization ability. The current experiment mainly relies on the PKU-Market-Phone dataset, which is focused on the defect detection scenario of mobile phone screens. However, in practical applications, there are iPads and other devices that are not involved in the training set. The screens of these devices have significant differences in materials, sizes, and display technologies compared to mobile phone screens. The iPad screen may adopt special display technologies, and the characteristics of scratches on its image may exceed the pattern range learned by the algorithm from the mobile phone screen data, resulting in a significant decrease in the accuracy of the detection algorithm. In addition, the diversity of defect types and manifestations of different devices in different usage environments far exceeds the scope covered by the training set, which further weakens the adaptability of the algorithm to new devices and new scenarios.

At the same time, in practical applications, the transition from mobile phone screen detection to the detection of other device screens involves domain differences in many aspects, including the characteristics of image acquisition devices, changes in shooting angles, and the diversity of illumination conditions. The algorithm is devoid of an appropriate domain adaptation mechanism and may fail to automatically modify the model to accommodate these changes. For example, the brightness, contrast, and other characteristics of mobile phone screen images captured under different illumination conditions are different from the characteristic distribution of mobile phone screen images in the training set. Consequently, the algorithm may encounter challenges in precisely identifying defects within them. Moreover, due to the significant differences in the data distribution of different device screens, such as differences in the resolution and color spaces, the algorithm may not be able to effectively map features when processing the data of new device screens, resulting in reduced detection accuracy and reliability.

6.3. Future Research

Future research should focus on addressing the problems faced by the current algorithm in order to expand its application range and improve its performance. To address the algorithm’s inadequate generalization capability, it is essential to significantly augment the dataset to encompass diverse devices, including iPads, and integrate various materials, screen dimensions, and multiple defect samples across a range of real-world usage scenarios. At the same time, in-depth exploration of advanced domain adaptation techniques is required. For instance, creating novel algorithms that autonomously adjust to the attributes of various devices, investigating unsupervised domain adaptation techniques that extract domain-invariant features from unlabeled data of new devices, and employing adversarial training to mitigate domain discrepancies can improve the algorithm’s applicability across diverse devices. In addition, to enhance the robustness of the algorithm, adaptive feature extraction methods should be explored. With the help of deep learning architectures, hierarchical and context-aware features can be learned, enabling the model to dynamically adapt to different screen and defect manifestations. It is essential to investigate methods for addressing noise and variability factors, including fluctuations in lighting conditions and image quality in real-world situations, through the application of image preprocessing techniques and uncertainty estimation.

On the other hand, to meet the requirements of real-time detection, the speed and efficiency of the algorithm should be optimized. This can be achieved through model compression, pruning, and quantization. Meanwhile, the incremental learning and online adaptation capabilities can also be studied, enabling the algorithm to continuously learn new defect patterns and device characteristics during the production process, reducing the dependence on retraining with large-scale datasets. Furthermore, exploring the integration of the algorithm with other technologies also holds great potential.

Author Contributions

Conceptualization, Z.C., K.L. and S.T.; methodology, Z.C.; validation, S.T.; data curation, K.L.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C., C.Z. and K.L.; supervision, Z.C. and C.Z.; project administration, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data can be shared upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, C.; Zhang, X.; Huang, Y.; Tang, C.; Fatikow, S. A novel algorithm for defect extraction and classification of mobile phone screen based on machine vision. Comput. Ind. Eng. 2020, 146, 106530. [Google Scholar] [CrossRef]
Jian, C.; Gao, J.; Ao, Y. Automatic surface defect detection for mobile phone screen glass based on machine vision. Appl. Soft Comput. 2017, 52, 348–358. [Google Scholar] [CrossRef]
Kuang, Y.; Zhang, K.; Xie, H. Adaptive intelligent detection technology for digital products’ shell surface. J. South China Univ. Technology. Nat. Sci. 2015, 43, 1–8. [Google Scholar]
Weimer, D.; Scholz-Reiter, B.; Shpitalni, M. Design of deep convolutional neural network architectures for automated feature extraction in industrial inspection. CIRP Ann. 2016, 65, 417–420. [Google Scholar] [CrossRef]
Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Shahbaz Khan, F. Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 3–20. [Google Scholar]
Pan, Y.; Zhou, C.; Su, L.; Hassan, H.; Huang, B. Bridging the Gap: A Fusion of CNN and Transformer Models for Real-Time Object Detection. In Proceedings of the 2023 IEEE 11th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 8–10 December 2023; Volume 11, pp. 1916–1921. [Google Scholar]
Zhao, J.; Zhu, B.; Peng, M.; Li, L. Mobile phone screen surface scratch detection based on optimized YOLOv5 model (OYm). IET Image Process. 2023, 17, 1364–1374. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Ming, W.; Cao, C.; Zhang, G.; Zhang, H.; Zhang, F.; Jiang, Z.; Yuan, J. Application of convolutional neural network in defect detection of 3C products. IEEE Access 2021, 9, 135657–135674. [Google Scholar] [CrossRef]
Kuijper, A. p-Laplacian driven image processing. In Proceedings of the 2007 IEEE International Conference on Image Processing, San Antonio, TX, USA, 16 September–19 October 2007; Volume 5, pp. V-257–V-260. [Google Scholar] [CrossRef]
Lin, X.; Wang, J.; Lin, C. Research on 3D reconstruction in binocular stereo vision based on feature point matching method. In Proceedings of the 2020 IEEE 3rd International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 27–29 September 2020; pp. 551–556. [Google Scholar]
Shang, J.Y.; Zhang, Y.; Zhang, Q.B.; Wang, W.S. Distorted target recognition based on prewitt operator combined with MACH filter. Key Eng. Mater. 2013, 552, 523–528. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, W.; Li, S.; Zhu, K.; Chen, H. Optimization of Binocular Vision Ranging Based on Sparse Stereo Matching and Feature Point Extraction. IEEE Access 2024, 12, 153859–153873. [Google Scholar] [CrossRef]
Bruni, V.; Vitulano, D. A generalized model for scratch detection. IEEE Trans. Image Process. 2004, 13, 44–50. [Google Scholar] [CrossRef] [PubMed]
Yuan, F.; Zhang, Z.; Fang, Z. An effective CNN and Transformer complementary network for medical image segmentation. Pattern Recognit. 2023, 136, 109228. [Google Scholar] [CrossRef]
Xu, Z.; Wang, Z. MCV-UNet: A modified convolution & transformer hybrid encoder-decoder network with multi-scale information fusion for ultrasound image semantic segmentation. PeerJ Comput. Sci. 2024, 10, e2146. [Google Scholar] [PubMed]
Kanadath, A.; Jothi, J.A.A.; Urolagin, S. CViTS-Net: A CNN-ViT Network with Skip Connections for Histopathology Image Classification. IEEE Access 2024. [Google Scholar] [CrossRef]
Deng, K.; Meng, Y.; Gao, D.; Bridge, J.; Shen, Y.; Lip, G.; Zheng, Y. Transbridge: A lightweight transformer for left ventricle segmentation in echocardiography. In Simplifying Medical Ultrasound: Second International Workshop, ASMUS 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, 27 September 2021; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 63–72. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Zhang, L. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Zhao, Y.; Chen, J.; Zhang, Z.; Zhang, R. BA-Net: Bridge attention for deep convolutional neural networks. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 297–312. [Google Scholar]
Chunmou, C. Lead line image enhancement algorithm based on histogram equalization and Laplace. Foreign Electron. Meas. Technol. 2019, 38, 131–135. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Deshpande, A.; Estrela, V.V.; Patavardhan, P. The DCT-CNN-ResNet50 architecture to classify brain tumors with super-resolution, convolutional neural network, and the ResNet50. Neurosci. Inform. 2021, 1, 100013. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence (AAAI), New York, NY, USA, 7 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Yunpeng, G.; Rui, Z.; Mingxu, Y.; Sabah, F. YOLOv8-TDD: An Optimized YOLOv8 Algorithm for Targeted Defect Detection in Printed Circuit Boards. J. Electron. Test. 2024, 1–12. [Google Scholar] [CrossRef]
Liang, A.; Wang, Q.; Wu, X. Context-Enhanced Network with Spatial-Aware Graph for Smartphone Screen Defect Detection. Sensors 2024, 24, 3430. [Google Scholar] [CrossRef] [PubMed]
Zhao, C.; Pan, J.; Tan, Q.; Wu, Z.; Chen, Z. DSU-Net: Dynamic Stacked U-Net for Enhancing Mobile Screen Defect Detection. In Proceedings of the 2023 China Automation Congress (CAC), Chongqing, China, 17–19 November 2023; pp. 7454–7459. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Park, J. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Su, R.; Huang, W.; Ma, H.; Song, X.; Hu, J. SGE net: Video object detection with squeezed GRU and information entropy map. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 689–693. [Google Scholar]
Shi, X.; Zhou, S.; Tai, Y.; Wang, J.; Wu, S.; Liu, J.; Xu, K.; Peng, T.; Zhang, Z. An improved faster R-CNN for steel surface defect detection. In Proceedings of the 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), Shanghai, China, 26–28 September 2022; pp. 1–5. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]

Figure 1. Structure of the FS-Net designed for mobile screen scratch detection.

Figure 2. (a) Part of the network structure of ResNet50; and (b) residual connection block.

Figure 3. GLFI module structure.

Figure 4. Spatial attention and channel attention modules.

Figure 5. BA attention module.

Figure 6. Sample dataset display.

Figure 7. Different overlap levels with the same IoU values.

Figure 8. Overlap between predicted and actual boxes.

Figure 9. (a) Regression error curves of different loss functions; and (b) variation trends of IoU box plots under different loss functions.

Figure 10. (a) Regression error curves of different loss functions; and (b) trends in the IoU box plots under different loss functions.

Table 1. Experimental environment’s parameters.

Equipment	Computer Configuration Parameters
Operating system	Linux
Type of operating system	Ubuntu20.04
RAM	64 G
CPU	12 vCPU Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz
GPU	RTX 3080 (10 GB) × 1
Hard disk drive	System disk: 30 GB Data disks: 50 GB SSD
Development language	Python 3.8
Deep learning framework	PyTorch 2.0.0

Table 2. Overall ablation results of the detection method.

Method	AP
IoU	36.8
GIoU	36.8
CIoU	36.9
EIoU	37.0

Table 3. Comparison between ablation experiments of different modules.

Method	OA/%	F1-Score/%	EIoU/%
YOLOv1	96.38	84.12	63.13
YOLOv1 + GLFI	96.27	86.36	64.37
Transformer + YOLOv1	97.08	85.97	64.01
Transformer + YOLOv1 + GLFI	98.04	88.03	65.13

Table 4. Performance of different attention modules.

Attention Module	OA/%	F1-Score/%	EIoU/%	Number of Parameters/10⁶ MB
SKNet [31]	97.20	86.60	64.40	26.15
CBAM [32]	97.10	85.95	64.01	28.09
ECA-Net [33]	97.05	87.10	64.13	25.65
BAM [34]	97.00	86.20	64.25	25.80
DANet [35]	96.90	85.80	64.10	26.50
GC-Net [36]	96.40	84.10	63.12	28.08
SGE-Net [37]	96.25	86.35	64.35	25.50
GLFI	98.05	88.05	65.15	24.20

Table 5. Performance metrics of different models.

Method	OA/%	F1-Score/%	EIoU/%	Time/s	Parameters/10⁶ MB
R-CNN	95.27	81.34	60.63	43.2	30.6
Faster R-CNN [38]	96.12	83.45	62.34	41.5	28.9
R-FCN [39]	96.54	84.32	62.87	40.8	27.5
Cascade R-CNN [40]	97.23	85.67	64.21	40.2	29.3
Libra R-CNN [41]	97.15	85.34	63.98	39.8	28.7
RetinaNet [42]	97.51	86.24	63.92	40.6	27.6
YOLO	96.39	84.12	63.12	42.6	25.3
YOLOv1 [43]	95.87	82.76	61.89	41.9	24.8
YOLOv2 [43]	96.18	83.65	62.56	41.2	25.1
YOLOv3 [43]	96.84	84.87	63.54	41.8	25.7
FS-Net	98.04	88.03	65.13	39.1	24.2

Table 6. Experimental results of different networks.

Serial Number	1	2	3	4	5
R-CNN
Faster R-CNN
R-FCN
Cascade R-CNN
Libra R-CNN
RetinaNet
YOLO
YOLOv1
YOLOv2
YOLOv3
FS-Net

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, Z.; Liang, K.; Tang, S.; Zhang, C. Applications of the FusionScratchNet Algorithm Based on Convolutional Neural Networks and Transformer Models in the Detection of Cell Phone Screen Scratches. Electronics 2025, 14, 134. https://doi.org/10.3390/electronics14010134

AMA Style

Cao Z, Liang K, Tang S, Zhang C. Applications of the FusionScratchNet Algorithm Based on Convolutional Neural Networks and Transformer Models in the Detection of Cell Phone Screen Scratches. Electronics. 2025; 14(1):134. https://doi.org/10.3390/electronics14010134

Chicago/Turabian Style

Cao, Zhihong, Kun Liang, Sheng Tang, and Cheng Zhang. 2025. "Applications of the FusionScratchNet Algorithm Based on Convolutional Neural Networks and Transformer Models in the Detection of Cell Phone Screen Scratches" Electronics 14, no. 1: 134. https://doi.org/10.3390/electronics14010134

APA Style

Cao, Z., Liang, K., Tang, S., & Zhang, C. (2025). Applications of the FusionScratchNet Algorithm Based on Convolutional Neural Networks and Transformer Models in the Detection of Cell Phone Screen Scratches. Electronics, 14(1), 134. https://doi.org/10.3390/electronics14010134

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Applications of the FusionScratchNet Algorithm Based on Convolutional Neural Networks and Transformer Models in the Detection of Cell Phone Screen Scratches

Abstract

1. Introduction

2. FS-NET

2.1. CNN Branch

2.2. Transformer Branch

2.3. GLFI

2.4. BA Attention Module

3. Loss Function

3.1. Module Loss Functions

3.2. Combined Loss Functions

4. Experiment and Result Analysis

4.1. Datasets

4.2. Experimental Environment and Parameters

4.3. Data Augmentation

4.4. Evaluation Metrics

5. Experimental Results Analysis

5.1. Ablation Experiments

5.2. Comparison of Attention Modules

5.3. Transformer Scale Analysis

5.4. Comparison with Mainstream Methods

6. Conclusions and Future Research

6.1. Conclusions

6.2. Limitations

6.3. Future Research

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI