[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Uncertainty of Satellite-Derived Glacier Flow Velocities in a Temperate Alpine Setting (Juneau Icefield, Alaska)
Previous Article in Journal
Modeling the Effects of Drivers on PM2.5 in the Yangtze River Delta with Geographically Weighted Random Forest
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dual-Domain Prior-Driven Deep Network for Infrared Small-Target Detection

1
Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences, Shenyang 110016, China
2
Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China
3
Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110169, China
4
University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2023, 15(15), 3827; https://doi.org/10.3390/rs15153827
Submission received: 15 June 2023 / Revised: 28 July 2023 / Accepted: 28 July 2023 / Published: 31 July 2023
Figure 1
<p>The workflow of the proposed dual-domain prior-driven deep network (DPDNet). The targets in <math display="inline"><semantics><mover accent="true"><mi>x</mi><mo>¯</mo></mover></semantics></math> become more salient after being processed by the sparse-characteristic-driven module. The high-frequency-characteristic-driven module is embedded into the main detection network to extract and fuse dual-domain features of the targets.</p> ">
Figure 2
<p>The comparison results between the original images and the corresponding <math display="inline"><semantics><mover accent="true"><mi>x</mi><mo>¯</mo></mover></semantics></math> generated by the sparse-characteristic-driven module. The small targets are more discernible to the naked eye in <math display="inline"><semantics><mover accent="true"><mi>x</mi><mo>¯</mo></mover></semantics></math>. The targets of each image are highlight by the red dotted squares.</p> ">
Figure 3
<p>Examples showing the outputs of the high-frequency characteristic extraction module. The first line shows the original images. The second line shows the output after applying Equation (15).</p> ">
Figure 4
<p>An illustration of the detection module: The input image is downsampled to different resolutions by the encoder. There are densely distributed nodes in the skip connection module, where the high-frequency-characteristic-driven module is inserted and used to repeatedly fuse the multilayer features. Finally, the feature resolution is restored by the decoder, and the targets are predicted.</p> ">
Figure 5
<p>An illustration of connections among nodes in the network: It mainly consists of two U-Nets with different numbers of layers, and each internal node fuses the features of the layer from the same level as well as the upper and lower layers. In the same layer, nonadjacent nodes conduct skip connections, as shown by the dotted line in the figure.</p> ">
Figure 6
<p>An illustration of the high-frequency-characteristic-driven module in the network; it consists of two parts: The input feature map is propagated through the spatial-domain pathway and the frequency pathway. Then, the features of the two paths are integrated.</p> ">
Figure 7
<p>Visualization results of feature maps propagated among sub-nodes after the high-frequency-characteristic-driven module.</p> ">
Figure 8
<p>(<b>a</b>–<b>x</b>) are some representative infrared images in the SIRST dataset with different backgrounds. For a better display, the demarcated area is enlarged by red square [<a href="#B33-remotesensing-15-03827" class="html-bibr">33</a>].</p> ">
Figure 9
<p>The comparison results between the proposed DPDNet and previous CNN-based methods. The DPDNet achieves a higher true positive rate at the same false positive rate.</p> ">
Figure 10
<p>The definition of the feature layers.</p> ">
Figure 11
<p>The 3D visualization results of the original images and <math display="inline"><semantics><mover accent="true"><mi>x</mi><mo>¯</mo></mover></semantics></math>. The second and fourth lines show the pixel values of the original images and <math display="inline"><semantics><mover accent="true"><mi>x</mi><mo>¯</mo></mover></semantics></math>, respectively. The target pixels are more distinguishable from the background pixels in <math display="inline"><semantics><mover accent="true"><mi>x</mi><mo>¯</mo></mover></semantics></math> than those in the original images.</p> ">
Figure 12
<p>The feature maps and their corresponding activation maps. The target regions are well activated and utilized during the feature propagation.</p> ">
Versions Notes

Abstract

:
In recent years, data-driven deep networks have demonstrated remarkable detection performance for infrared small targets. However, continuously increasing the depth of neural networks to enhance performance has proven impractical. Consequently, the integration of prior physical knowledge related to infrared small targets within deep neural networks has become crucial. It aims to improve the models’ awareness of inherent physical characteristics. In this paper, we propose a novel dual-domain prior-driven deep network (DPDNet) for infrared small-target detection. Our method integrates the advantages of both data-driven and model-driven methods by leveraging the prior physical characteristics as the driving force. Initially, we utilize the sparse characteristics of infrared small targets to boost their saliency at the input level of the network. Subsequently, a high-frequency feature extraction module, seamlessly integrated into the network’s backbone, is employed to excavate feature information. DPDNet simultaneously emphasizes the prior sparse characteristics of infrared small targets in the spatial domain and their prior high-frequency characteristics in the frequency domain. Compared with previous CNN-based methods, our method achieves superior performance while utilizing fewer convolutional layers. It has a performance of 78.64% IoU, 95.56 Pd, and 2.15 × 10−6 Fa on the SIRST dataset.

1. Introduction

Infrared detection systems offer numerous advantages, including strong concealment, a wide detection range, and resistance to environmental interference. These systems find widespread applications on mobile platforms such as satellites, airplanes, and drones. Due to the long imaging distance, the target in the image plane only occupies a few or dozens of pixels, which leads to the lack of intrinsic features (such as the color, shape, and texture) [1]. Furthermore, the presence of complex backgrounds introduces significant interference, which causes the precise detection of the small infrared targets to become difficult [2].
After years of development, the field of infrared small-target detection has made significant progress. On the one hand, scholars’ understanding of specific domain knowledge has been deepening continuously. It began with early filtering-based methods [3,4] that simply modeled infrared small targets as protrusions in smooth backgrounds. Subsequently, more advanced methods [5,6] began to take into account the image edge structures and the targets with high local contrast. In recent times, some methods [7,8] have been achieved based on low-rank sparse recovery. The performance of model-driven detection methods has gradually improved.
On the other hand, with the rapid development of deep neural networks in the field of computer vision, scholars have transferred deep network models that were originally developed for visible-light image object detection to infrared image processing tasks. Liu et al. [9] first proposed a CNN-based method. They designed a multilayer perceptron (MLP) network with five layers for infrared small-target detection. Hou et al. [10] treated small targets in infrared images as a special type of sharp noise and established a mapping network between feature maps and small-target likelihood in the image. They then applied a threshold to the likelihood map to segment the real targets. Dai et al. [11] proposed an attentional local contrast network (ALCNet) that fused detailed features with deep semantic features to enhance the effect of infrared small-target detection. Deep learning offers the capability to construct end-to-end systems that autonomously extract multilevel features while learning the target detection task. This breakthrough overcomes the restrictions of manually engineered features and classifiers, leading to substantial advancements in detection performance.
However, both model-driven and data-driven approaches still suffer from pressing challenges:
  • Purely model-driven methods face challenges in achieving precise modeling because of their heavy dependence on scholars’ expertise and experience. These methods are typically simplified forms of real data, which limits their ability to address complex real-world scenarios. Consequently, they encounter difficulties in terms of their detection performance and robustness.
  • While current data-driven methods have demonstrated favorable results, achieving precise detection of infrared small targets is still a difficult task. The main challenges arise from the limited proportion of pixels occupied by targets, the serious imbalance between the foreground and background, and the weak semantic connections between the targets and the environment [12]. Since data-driven approaches fail to leverage domain-specific knowledge effectively, simply deepening the neural network has shown minimal impact on improving detection performance.
  • Most deep networks for infrared small-target detection rely on supervised learning, which utilizes labeled data. However, the simulation of infrared data is not perfect, and measured data lack samples in areas of infrared target detection and recognition [13]. Even if a batch of high-quality labeled samples is amassed, the training models are still sensitive to the variations of backgrounds, potentially leading to poor generalization performance on new datasets [14].
To tackle the aforementioned challenges, we propose a novel dual-domain prior-driven deep network (DPDNet) for infrared small-target detection, which integrates advantages of both the model-driven and data-driven methods. We embed reliable physical priors into different network levels by leveraging the sparsity and the high-frequency characteristics in the spatial and frequency domains, respectively. This integration enriches the limited annotated data, reduces the network complexity, and leads to better feature representations. Therefore, the proposed method improves the detection performance for infrared small targets. Additionally, the interpretability and reliability of the CNN are also enhanced through this augmentation.
Our proposed method focuses on existing domain knowledge to guide the model at both the input and the inner levels of the network. Given that the development of recovering the low-rank and sparse matrices is well established, the infrared patch-image model [15,16,17] based on unsupervised learning offers a reliable theoretical basis for the sparse-characteristic-driven module. In addition, targets in infrared imagery exhibit higher radiation intensity than the background, which causes distinct high-frequency characteristics in the frequency domain. To capture the specific high-frequency characteristics of small targets, a hypercomplex infrared Fourier transform approach is used as the high-frequency-characteristic-driven module. The above two characteristics enable the network to exploit more feature information of infrared small targets.
We conducted some visible analysis to illustrate the impact on embedding physical characteristics in different network levels. A 3D visualization result was used to analyze the effect on the input data after introducing sparse characteristics. Additionally, we employed class activation mapping to visualize the inner features of the network, which can be used to analyze the relationship between the activated features and the high-frequency-characteristic-driven module.
The contributions are summarized as follows:
  • We propose a novel dual-domain prior-driven deep network for infrared small-target detection, which integrates the data-driven methods with the model-driven approaches.
  • We guide supervised data-driven models by proposing prior-driven modules that embed domain knowledge at both the input and the inner levels of the network.
  • We analyze the effectiveness and reliability of the prior-driven modules in guiding learning and enhancing the expression capability of target features.
The rest of this research is organized as follows: In Section 2, we introduce the background about the interpretable physical information and the prior domain knowledge used for our infrared small-target detection. Section 3 introduces each module and the specific structure of the DPD network in detail. In Section 4, we validate the detection performance of our method and compare it with other state-of-the-art methods. In addition, an ablation experiment is presented to validate the effectiveness of each module in the proposed network. Section 5 presents a visualization analysis of the impact on different network levels after incorporating physical priors. Finally, Section 6 concludes the paper.

2. Background

2.1. Informed Machine Learning

The complexity of contemporary machine learning models poses challenges for human comprehension of the precise decision-making process during inference [18]. By harnessing the domain knowledge accumulated by scholars, it is possible to strengthen models’ interpretability and robustness while reducing data requirements [19]. The fusion of domain knowledge and data-driven approaches is mainly manifested by knowledge embedding, which constructs deep network models with physical prior knowledge. On the one hand, embedding domain knowledge into the models allows the powerful fitting capability to describe complex mapping relationships among high-dimensional variables, thereby improving the model’s accuracy. On the other hand, incorporating prior knowledge ensures that the predicted results correspond to fundamental physical mechanisms and common sense.

2.2. Sparse Characteristic of Infrared Small Targets

An infrared image f F can be represented as a linear combination of background image f B , target image f T , and noise image f N :
f F = f B + f T + f N  
The input image can be represented as a block tensor F by using a sliding window with a certain step size to traverse the image from left to right and from top to bottom. Stack the obtained image blocks into a three-dimensional cube; then, Equation 1 can be transformed into
F = B + T + N
where B , T , N R I × J × P denote the background block tensor, target block tensor, and noise block tensor, respectively [7]. I and J are the patch height and width, respectively, and P is the patch number.
Since the target occupies a small proportion of the entire infrared image, it can be regarded as a sparse matrix. In contrast, the background pixels display correlation, even if they cross distant positions in the image, which is denoted as nonlocal self-correlation. According to this characteristic, we can consider the background image as a low-rank matrix. Therefore, target detection can be transformed into an optimization problem of the low-rank and sparse matrices’ recovery.

2.3. High-Frequency Characteristics of Infrared Small Targets

The high-intensity saliency of the specific frequency domain in infrared small targets makes the use of a dual-domain feature extraction method highly effective. This approach aims to capture the high-frequency features present in the image’s frequency domain. These features are fused with spatial-domain features to form a comprehensive representation that combines spatial and frequency characteristics.
In the development of infrared image feature extraction, traditional methods [20,21,22] mainly rely on techniques such as histogram equalization, filtering, and gray-level stretching. Nevertheless, these methods have limitations of concurrently incorporating both spatial- and frequency-domain information, which restricts the effectiveness of their practical application. With the application of wavelet transform and local contrast enhancement [16,23,24], it becomes feasible to effectively extract spatial- and frequency-domain information from infrared images. In addition, the recent development of deep learning techniques has brought new ideas for dual-domain feature extraction from infrared images. Utilizing deep learning models [25,26,27] enables the automatic extraction of dual-domain information from infrared images.

3. Methods

The workflow of the DPDNet is shown in Figure 1. At the input level, we utilize the sparse characteristics in the spatial domain as the driving module. This module utilizes the preliminary targets and background imagery recovered from the sparse and low-rank matrices to enhance the original image x , resulting in a more salient target in x ¯ . Considering the inherent temporal physical characteristics present in infrared images, we incorporate the high-frequency physical characteristics of infrared images as the driving module at the inner level of the network. Repeating the extraction and fusion of dual-domain features makes more efficient use of frequency-domain physical characteristics. Inspired by Dense Nested Net [28], we embed the high-frequency-characteristic-driven module into a multilayer complex dense network to construct the primary detection module. The skip connections are specifically utilized to fuse low-level and high-level features in the detection module.

3.1. Sparse-Characteristic-Driven Module

To incorporate prior spatial-domain physical knowledge into the training data, the sparse-characteristic-driven module operates on the neural network before training. Background image blocks in infrared images mostly come from one or more low-rank subspaces. In addition, due to the lack of nonlocal self-similarity, small targets constitute sparse components. Through the decomposition of the infrared image into the low-rank and sparse matrices, it becomes possible to distinguish background information from target information. The target features of the original image x can be enhanced and represented as follows:
x ¯ b a c k g r o u n d = f o l d x < T d o w n
x ¯ t a r g e t = f o l d x > T u p
x ¯ = w 0 x     w 1 x ¯ b a c k g r o u n d + w 2 x ¯ t a r g e t
where f o l d x denotes the low-rank and sparse decomposition applied to the original image, x ¯ b a c k g r o u n d and x ¯ t a r g e t are obtained after adaptive dual-threshold segmentation ( T d o w n and T u p are thresholds), + is the pixel-level image fusion, and is the channel-level image fusion.
The data processed by the sparse-characteristic-driven module show richer target information, which improves the feature quality in the subsequent feature extraction processes. Figure 2 shows the resulting images x ¯ after the sparse-characteristic-driven module. The targets in x ¯ are more discernible to the naked eye.

3.2. High-Frequency Characteristic Extraction Module

Inspired by the successful application of the hypercomplex Fourier transform (HFT) in RGB color image processing tasks [29], we propose a hypercomplex infrared Fourier transform (HIFT) to extract high-frequency characteristics. The HIFT is a development based on the Fourier transform. The HIFT extends the real-valued signals to the hypercomplex signals and applies them to infrared images to extract features in the frequency domain. The hypercomplex input is specified to be a quaternion formula, which can be regarded as an extension of the concept of complex numbers. The complex quaternion tuple that represents the hypercomplex signal is denoted as f m , n :
f m , n = ω 0 f 0 + ω 1 f 1 i + ω 2 f 2 j + ω 3 f 3 k
where m and n are discrete spatial variables, i , j , k are imaginary units, ω 0 , ω 1 , ω 2 , ω 3 are weights, f 0 , f 1 , f 2 , f 3 are feature matrices, and f 0 is the motion feature. Hence, f 0 = 0 for single-frame static input images. f 1 is the gray value of the image:
f 1 = G r a y x ,   y
f 2 is the infrared radiation [30]:
f 2 = Φ m i n + g / 255 r × Φ m a x Φ m i n 1 r
where g denotes the gray level of each pixel of an image; r 0 , 1 is a constant, which typically depends on the parameters of the environment and the infrared sensor. We replace r with the normalized grayscale image value. The maximum and minimum radiation intensities of the target are denoted as ϕ m a x and ϕ m i n , respectively, and f 3 is the local thermal variation of the image, which can be calculated as follows:
f 3 = L i n i , j n i n L o u t i , j n o u t
where
L i n i , j = k , l N i n i , j g k , l μ i n i , j
L o u t i , j = k , l N o u t i , j g k , l μ o u t i , j
μ i n i , j = 1 n i n k , l N i n i , j g k , l
μ o u t i , j = 1 n o u t k , l N o u t i , j g k , l
where L i n i , j is the rectangular neighborhood of pixel i ,   j , and its length and width are the largest length and width of the targets in the dataset. L o u t i , j is a larger rectangular neighborhood of pixel i ,   j than L i n i , j , except for those pixels in L i n i , j . g k , l is the intensity of the pixel in the neighborhood of the k -th row and the l -th column, and μ is the mean intensity of the neighborhood. n i n and n o u t are the pixel numbers of L i n i , j and L o u t i , j , respectively.
After defining the feature matrices, the discrete HIFT formula is
F H s , t = 1 M N m = 0 M 1 n = 0 N 1 e ω 2 π m s / M + n t / N f m , n
where ω is a unit pure quaternion, and ω 2 = 1 . M , N denote the height and width of the image, respectively.
The inverse transformation of HIFT is defined as follows:
f m , n = 1 M N s = 0 M 1 t = 0 N 1 e ω 2 π m s / M + n t / N F H s , t
The HIFT possesses the capability to capture both local frequency and phase information in infrared images. Therefore, it addresses the limitation of the traditional Fourier transform, which can only capture global frequency information. The HIFT enables a more accurate representation of local signal features and enhances the robustness of the network when dealing with nonstationary signals in infrared small-target detection tasks. The effectiveness of the high-frequency characteristic extraction module in processing infrared images is shown in Figure 3. This module is mainly applied in the detection module, which will be further introduced in Section 3.3.

3.3. Detection Module

The detection module consists of four main structures: an encoder, skipping connections, a high-frequency-characteristic-driven module, and a decoder, as shown in Figure 4.
Encoder: In the encoder part of the network, we build upon the traditional U-Net [31] architecture. This consists of multiple convolutional and pooling layers for feature extraction from the input image while progressively reducing the size of the feature map. Inspired by the Dense Nested Net [28] approach, we adopt a nested network structure consisting of two U-Net subnetworks with varying depths. This design enables the module to capture high-level features while preserving the representation of small targets within the deep network.
Skip connections: The skip connections are introduced to establish connections between the encoder and the decoder. These structures facilitate the fusion of high-level and low-level features, enlarging the receptive field of the feature map and improving the localization of small targets. Within the skip connection paths, we create multiple sub-nodes. All of these intermediate nodes exhibit dense connectivity, forming a nested-shaped network structure. As illustrated in Figure 5, each node has the capacity to receive features from surrounding nodes. In particular, nonadjacent nodes conduct skip connections in the same feature layer. Consequently, the representation of small targets is still preserved in the deeper neural network layers, yielding improved outcomes.
High-frequency-characteristic-driven module: In each feature extraction and fusion pathway, we apply a high-frequency-characteristic-driven module, as illustrated in Figure 6. This module consists of two parts (the frequency-domain part and the spatial-domain part). The high-frequency characteristic extraction module, as detailed in Section 3.2, is designed to extract frequency-domain features that propagate among different nodes.
f T o u t = F H f i n
Inspired by CBAM [32], the attention mechanism, which includes both the channel attention and the spatial attention, is used to extract features in the spatial domain. The channel attention dynamically adjusts the feature weights in the channel dimension, allowing the network to focus on more relevant and informative channels. The spatial attention dynamically adjusts the feature weights in the spatial dimension, enabling the network to focus on important regions within the feature maps.
f S o u t = A f i n
In Figure 6, we integrate dual-domain features:
f D = f T o u t f S o u t
where f i n denotes the input feature, f T o u t is the frequency-domain portion of the output feature, and f S o u t is the spatial portion of the output feature. A · indicates the attention extraction of spatial features. The fused dual-domain features are denoted as f D , and denotes the channel-level image fusion.
This module enables the network to acquire inherent features in the dual domain and dynamically adjusts the feature weights of different convolutional layers based on the input data characteristics. Figure 7 illustrates the feature map of each sub-node when employing the high-frequency-characteristic-driven module. It can be seen that the features of small targets are fully preserved at both high and low levels. This validates the effectiveness of the module in capturing and retaining crucial information related to small targets.
Decoder: The decoder has the opposite structure to the encoder, which gradually increases the size of the feature maps by employing multiple upsampling operations. After each upsampling operation, the decoder also performs feature fusion with the corresponding high-frequency-characteristic-driven module to strengthen the feature information.

4. Results

4.1. Dataset

We used the SIRST dataset [33] to validate our network. The SIRST dataset is a single-frame infrared small-target dataset containing 427 infrared images from various scenes, with over 10% of the images containing multiple targets. Many of the infrared small targets in SIRST are extremely dim and hidden in complex backgrounds, with approximately 55% of the target areas in the dataset being less than 0.02%. Moreover, only 35% of the objects in the dataset contain pixels in the brightest part of the image. Representative images from the SIRST dataset are shown in Figure 8.

4.2. Implementation Details

We utilized the dataset described in Section 4.1 for training. All images of different sizes were first cropped to a resolution of 256 × 256 and normalized. The U-Net paradigm with ResNet [34] was chosen as our deep feature extraction backbone. The number of downsampling layers was chosen as four ( i = 4 ) in our network. Our network was trained using the Soft-IoU loss function and optimized with the Adam method [35]. CosineAnnealingLR was used as the learning rate decay strategy. We initialized the model’s weights and biases using the Xavier method [36]. The batch size was set to four, and the learning rate was set as 0.001. All experiments were implemented in PyTorch on a computer equipped with an NVidia GeForce 3090Ti GPU.

4.3. Evaluation Metrics

Pioneering CNN-based works [11,32,34] mainly use pixel-level evaluation metrics like IoU, precision, and recall values. However, the overall target localization is the most important criteria for SIRST detection [28]. To better evaluate the proposed network, we used the probability of detection and the false alarm rate to evaluate localization ability, and we used IoU to evaluate shape description ability.
Intersection over union (IoU): IoU is a pixel-level evaluation metric for detection accuracy, which is calculated by the ratio of the intersection area and the union area between the predicted values and the ground-truth labels. The calculation formula is expressed as follows:
I o U = i N T P i i N T i + P i T P i
where N is the number of targets in the test set, T P denotes the true positive prediction, T denotes the small-target region in the ground-truth label, and P denotes the small-target region predicted by the model.
The ROC curve is used to show the trade-off between the true positive rate and the false positive rate at various classification thresholds.
TPR = T P T P + F N
FPR = F P F P + T N
where F N denotes the number of false negative predictions, F P stands for the number of false positive predictions, and T N represents the number of true negative predictions.
Probability of detection and false alarm rate (Pd and Fa): These are object-level evaluation metrics. Pd measures the ratio of correctly predicted targets to the total number of targets, while Fa measures the ratio of incorrectly predicted pixels to all pixels in the image. Pd and Fa are defined as follows:
P d = T c o r r e c t T a l l
F a = P f a l s e P a l l
where T c o r r e c t is the number of correctly detected targets, and T a l l is the number of all targets. P f a l s e is the number of false predicted pixels, and P a l l is the total number of image pixels. If the centroid derivation of the target is larger than a maximum allowed derivation, we consider those pixels as falsely predicted ones. We set the maximum centroid derivation as three in this paper.

4.4. Effect Verification

To validate the performance of the proposed method, we compared it with pure model-driven methods (Top-Hat [4], WSLCM [37], and IPI [3]) and previous CNN-based methods (MDvsFA-cGAN [38], ACM [33], ALCNet [11], and DNANet [28]). The results are shown in Table 1.
For all compared traditional methods (Top-Hat [4], WSLCM [37], and IPI [3]), we first obtained prediction values and then suppressed noise by setting a threshold. The adaptive thresholds for traditional methods were calculated using Equation (24). For the CNN-based approaches (MDvsFA-cGAN [38], ACM [33], ALCNet [11], and DNANet [28]), we followed their original papers and employ their fixed thresholds.
T = m a x M a x P × 0.7 , σ P × 0.5 + A v g P
where M a x P denotes the maximum value of the output, σ P denotes the standard deviation of the output, and A v g P denotes the average value of the output.
Compared with the traditional methods, our method achieved significant improvements. This is because the traditional methods are usually designed for specific scenarios (e.g., specific target size and cluttered background), and manually selected parameters limit their generalization performance.
Similarly, our method outperformed the previous best CNN-based method by 2.4% in the IoU, performed well in the evaluation of the Fa, and maintained a high level of the Pd. In addition, as shown in Figure 9, the proposed DPDNet achieved a higher true positive rate at the same false positive rate. This is because we embedded physical modules in the deep network based on physical prior knowledge. These physical modules guide the extraction of target features and reduce the network’s dependence on structural complexity.
Interestingly, it can be seen in Table 2 that increasing the depth of the feature extraction network by stacking convolutional blocks improved the detection performance to some extent. However, our method using a shallow network still achieved a higher value of the IoU than SOTA DNANet-ResNet34 [28]. This merit can be attributed to the combination of data-driven and model-driven approaches in our method, which enables the network to obtain more physical information about the target features during the feature extraction. This proves that embedding knowledge into the deep network can compensate for the limitations of reducing model complexity. Thanks to the utilization of a shallower neural network architecture, the proposed DPDNet has significantly fewer parameters compared with previous CNN-based methods.

4.5. Ablation Study

To demonstrate the effectiveness of each module in our proposed method, we conducted ablation studies. We used ResNet10 and Dense-Net as shared modules and evaluated the impacts on the sparse-characteristic-driven module and the high-frequency-characteristic-driven module separately, as shown in Table 3. The results show that the integration of both the sparse-characteristic-driven module and the high-frequency-characteristic-driven module led to a significant improvement in the IoU compared with the baseline. Moreover, the value of Fa suffered a remarkable decrease when simultaneously using the above two modules.
Table 4 shows the effectiveness of integrating the high-frequency-characteristic-driven module in different feature layers. Figure 10 shows the definition of the feature layers. As more modules are added in deeper layers, the value of IoU gradually improves, while the value of Fa decreases significantly, and the value of Pd remains at a high level. In addition, gradually adding the high-frequency-characteristic-driven module can reduce the number of network parameters, effectively saving computational resources.

5. Discussion

In this section, we focus on the physical interpretability of DPDNet. First, the sparse-characteristic-driven module enhances the contrast between the targets and the background information, which helps the network to capture stronger physical features during training. Second, the dual-domain prior-driven module preserves the physical interpretability of targets in different feature layers. This promotes the acquisition of meaningful outcomes during the training process. According to above two aspects, our approach has higher physical interpretability and is more reliable than other deep neural networks.

5.1. Physics Explanation of x ¯

Detecting small targets in infrared images is a challenging task due to the unclear representation of the targets and the poor contrast between targets and the background [39]. Figure 11 shows the 3D visualization results of the original images and x ¯ . It shows that the values of the target pixels in x ¯ are more prominent compared with the background information. Consequently, in the subsequent feature extraction processes, the target information in x ¯ is better preserved.
In Section 3.2, we described the equation of x ¯ , where w 0 , w 1 , and w 2 are weight coefficients. Theoretically, increasing w 2 could make the target representation more prominent in x ¯ . However, we found that setting w 2 too large led to overfitting of the network. Therefore, we chose w 0 = 1 2 ,   w 1 = 0.5 4 ,     w 2 = 1.5 4 to balance the target representation and the overfitting issues.

5.2. Physical Interpretability Discussion

In deep learning models, understanding the internal workings is a great challenge due to their high complexity and black-box nature. The activation map provides visual demonstrations of the regions in the feature maps that guide detection. The first row of Figure 12 shows that the feature maps propagate in the mesh nodes of DPDNet. The second row shows the corresponding activation maps of these nodes. It can be observed that the target regions are well activated and utilized during the feature propagation, which effectively explains the high performance on the values of IoU, Pd, and Fa achieved by our network.

6. Conclusions

In this paper, we propose a novel dual-domain prior-driven deep network for infrared small-target detection. DPDNet consists of three key modules: the sparse-characteristic-driven module, the high-frequency-characteristic-driven module, and the main detection module. The sparse-characteristic-driven module utilizes spatial-domain prior physical knowledge to guide the learning and decision-making process. The high-frequency-characteristic-driven module incorporates dual-domain knowledge into the feature extraction layer to enhance the representation of small targets. The main detection module efficiently exploits the intrinsic information of small targets by repeated fusion and enhancement. Our research demonstrates the significant impact on the detection performance and the interpretability of infrared small targets by embedding physical knowledge into deep learning. The experimental results on the SIRST dataset show that DPDNet achieves superior performance compared with previous SOTA CNN-based methods. Moreover, our proposed method is a lightweight network. By decreasing the depth of the convolutional layer, the network has only 1.84 million parameters.

Author Contributions

Conceptualization, Y.L.; methodology, Y.H.; software, Y.H.; validation, Y.H.; writing—original draft preparation, Y.H.; writing—review and editing, Y.H., J.Z. and C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [Infrared vision theory and method] grant number [2023-JCJQ-ZD-011-12] and the APC was funded by [Infrared vision theory and method].

Data Availability Statement

The data presented in this study are openly available in SIRST at 10.1109/WACV48630.2021.00099, reference number [33].

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Song, Q.; Wang, Y.; Dai, K.; Bai, K. Single frame infrared image small target detection via patch similarity propagation based background estimation. Infrared Phys. Technol. 2020, 106, 103197. [Google Scholar] [CrossRef]
  2. Fan, J.; Wei, J.; Huang, H.; Zhang, D.; Chen, C. IRSDT: A Framework for Infrared Small Target Tracking with Enhanced Detection. Sensors 2023, 23, 4240. [Google Scholar] [CrossRef] [PubMed]
  3. Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the SPIE’s International Symposium on Optical Science, Engineering, and Instrumentation, Denver, CO, USA, 4 October 1999; pp. 74–83. [Google Scholar] [CrossRef]
  4. Rivest, J.F.; Fortin, R. Detection of dim targets in digital infrared imagery by morphological image processing. Opt. Eng. 1996, 35, 1886–1893. [Google Scholar] [CrossRef]
  5. Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
  6. Chen, C.L.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2014, 52, 574–581. [Google Scholar] [CrossRef]
  7. Dai, Y.; Wu, Y. Reweighted Infrared Patch-Tensor Model With Both Nonlocal and Local Priors for Single-Frame Small Target Detection. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
  8. Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared Patch-Image Model for Small Target Detection in a Single Image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef] [PubMed]
  9. Liu, M.; Du, H.; Zhao, Y.; Dong, L.; Hui, M. Image Small Target Detection based on Deep Learning with SNR Controlled Sample Generation. In Current Trends in Computer Science and Mechanical Automation; Wang, S.X., Ed.; De Gruyter Open Ltd.: Warsaw, Poland, 2017; Volume 1, pp. 211–220. Available online: https://www.webofscience.com/wos/alldb/full-record/WOS:000594997600023 (accessed on 6 May 2023).
  10. Hou, Q.; Wang, Z.; Tan, F.; Zhao, Y.; Zheng, H.; Zhang, W. RISTDnet: Robust Infrared Small Target Detection Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7000805. [Google Scholar] [CrossRef]
  11. Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
  12. Lu, Z.; Zhi-Yong, Z.; Shan-Zhu, X.; Huan-Zhang, L.U. Detection of Dim Infrared Targets by Multi-Direction Prediction of Background. J. Signal Process. 2010, 26, 1646–1651. [Google Scholar]
  13. Bingwei, H.; Zhiyong, S.; Hongqi, F.; Ping, Z.; Weidong, H.; Xiaofeng, Z.; Jianguo, L.; Hongyan, S.; Wei, J.; Yongjie, Z.; et al. A Dataset for Infrared Image Dim-Small Aircraft Target Detection and Tracking Under Ground/Air Background; Science Data Bank: Beijing, China, 2019; p. 912518825. [Google Scholar] [CrossRef]
  14. Huang, Z.; Yao, X.; Liu, Y.; Dumitru, C.O.; Datcu, M.; Han, J. Physically Explainable CNN for SAR Image Classification. ISPRS J. Photogramm. Remote Sens. 2022, 190, 25–37. [Google Scholar] [CrossRef]
  15. Xia, C.; Li, X.; Zhao, L.; Shu, R. Infrared Small Target Detection Based on Multiscale Local Contrast Measure Using Local Energy Factor. IEEE Geosci. Remote Sens. Lett. 2020, 17, 157–161. [Google Scholar] [CrossRef]
  16. Chen, Z.; Chen, S.; Zhai, Z.; Zhao, M.; Jie, F.; Li, W. Infrared small-target detection via tensor construction and decomposition. Remote Sens. Lett. 2021, 12, 900–909. [Google Scholar] [CrossRef]
  17. Zhang, X.; Ding, Q.; Luo, H.; Hui, B.; Chang, Z. Zhang. Infrared small target detection based on an image-patch tensor model. Infrared Phys. Technol. 2019, 99, 55–63. [Google Scholar] [CrossRef]
  18. Beckh, K.; Müller, S.; Jakobs, M.; Toborek, V.; Tan, H.; Fischer, R.; Welke, P.; Houben, S.; von Rueden, L. Explainable Machine Learning with Prior Knowledge: An Overview. arXiv 2021, arXiv:2105.10172. [Google Scholar]
  19. Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-informed machine learning. Nat. Rev. Phys. 2021, 3, 6. [Google Scholar] [CrossRef]
  20. Wu, L.; Fang, S.; Ma, Y.; Fan, F.; Huang, J. Infrared small target detection based on gray intensity descent and local gradient watershed. Infrared Phys. Technol. 2022, 123, 104171. [Google Scholar] [CrossRef]
  21. Li, F.; Liu, S.; Qin, H. Dim Infrared Targets Detection Based on Adaptive Bilateral Filtering. Guangzi Xuebao/Acta Photonica Sin. 2010, 39, 1129–1131. [Google Scholar] [CrossRef]
  22. Wang, F.; Li, C.; Wu, B.; Yu, K.; Jin, C. Infrared small target detection method based on multi-scale feature fusion. J. Phys. Conf. Ser. 2021, 2024, 012012. [Google Scholar] [CrossRef]
  23. Zhang, X.; Ru, J.; Wu, C. Infrared Small Target Detection Based on Gradient Correlation Filtering and Contrast Measurement. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5603012. [Google Scholar] [CrossRef]
  24. Zhang, Y.; Zhang, J.; Wang, D.; Chen, C. Infrared small target detection based on morphology and wavelet transform. In Proceedings of the 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), Zhengzhou, China, 8–10 August 2011; pp. 4033–4036. [Google Scholar] [CrossRef]
  25. Fan, Z.; Bi, D.; Xiong, L.; Ma, S.; He, L.; Ding, W. Dim infrared image enhancement based on convolutional neural network. Neurocomputing 2018, 272, 396–404. [Google Scholar] [CrossRef]
  26. IR Image Small Target Detection Based on Multi-Scale Feature Fusion. Available online: https://www.webofscience.com/wos/alldb/full-record/CSCD:2104565 (accessed on 6 May 2023).
  27. Zhang, Y.; Zhang, Y.; Shi, Z.; Zhang, J.; Wei, M. Design and Training of Deep CNN-Based Fast Detector in Infrared SUAV Surveillance System. IEEE Access 2019, 7, 137365–137377. [Google Scholar] [CrossRef]
  28. Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense Nested Attention Network for Infrared Small Target Detection. arXiv 2022, arXiv:2106.00487. [Google Scholar] [CrossRef]
  29. Ell, T.A. Quaternion-Fourier transforms for analysis of two-dimensional linear time-invariant partial differential systems. In Proceedings of the 32nd IEEE Conference on Decision and Control, San Antonio, TX, USA, 15–17 December 1993; Volume 2, pp. 1830–1841. [Google Scholar] [CrossRef]
  30. Zhang, R.; Mu, C.; Xu, M.; Xu, L.; Shi, Q.; Wang, J. Synthetic IR Image Refinement Using Adversarial Learning With Bidirectional Mappings. IEEE Access 2019, 7, 153734–153750. [Google Scholar] [CrossRef]
  31. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
  32. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
  33. Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. arXiv 2020, arXiv:2009.14530. [Google Scholar] [CrossRef]
  34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
  35. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  36. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; Available online: https://www.semanticscholar.org/paper/Understanding-the-difficulty-of-training-deep-Glorot-Bengio/b71ac1e9fb49420d13e084ac67254a0bbd40f83f (accessed on 6 May 2023).
  37. Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Small Target Detection Based on the Weighted Strengthened Local Contrast Measure. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1670–1674. [Google Scholar] [CrossRef]
  38. Miss Detection vs. False Alarm: Adversarial Learning for Small Object Segmentation in Infrared Images. IEEE Conference Publication. IEEE Xplore. Available online: https://ieeexplore.ieee.org/document/9009584 (accessed on 6 May 2023).
  39. Kwan, C.; Budavari, B. Enhancing Small Moving Target Detection Performance in Low-Quality and Long-Range Infrared Videos Using Optical Flow Techniques. Remote Sens. 2020, 12, 24. [Google Scholar] [CrossRef]
Figure 1. The workflow of the proposed dual-domain prior-driven deep network (DPDNet). The targets in x ¯ become more salient after being processed by the sparse-characteristic-driven module. The high-frequency-characteristic-driven module is embedded into the main detection network to extract and fuse dual-domain features of the targets.
Figure 1. The workflow of the proposed dual-domain prior-driven deep network (DPDNet). The targets in x ¯ become more salient after being processed by the sparse-characteristic-driven module. The high-frequency-characteristic-driven module is embedded into the main detection network to extract and fuse dual-domain features of the targets.
Remotesensing 15 03827 g001
Figure 2. The comparison results between the original images and the corresponding x ¯ generated by the sparse-characteristic-driven module. The small targets are more discernible to the naked eye in x ¯ . The targets of each image are highlight by the red dotted squares.
Figure 2. The comparison results between the original images and the corresponding x ¯ generated by the sparse-characteristic-driven module. The small targets are more discernible to the naked eye in x ¯ . The targets of each image are highlight by the red dotted squares.
Remotesensing 15 03827 g002
Figure 3. Examples showing the outputs of the high-frequency characteristic extraction module. The first line shows the original images. The second line shows the output after applying Equation (15).
Figure 3. Examples showing the outputs of the high-frequency characteristic extraction module. The first line shows the original images. The second line shows the output after applying Equation (15).
Remotesensing 15 03827 g003
Figure 4. An illustration of the detection module: The input image is downsampled to different resolutions by the encoder. There are densely distributed nodes in the skip connection module, where the high-frequency-characteristic-driven module is inserted and used to repeatedly fuse the multilayer features. Finally, the feature resolution is restored by the decoder, and the targets are predicted.
Figure 4. An illustration of the detection module: The input image is downsampled to different resolutions by the encoder. There are densely distributed nodes in the skip connection module, where the high-frequency-characteristic-driven module is inserted and used to repeatedly fuse the multilayer features. Finally, the feature resolution is restored by the decoder, and the targets are predicted.
Remotesensing 15 03827 g004
Figure 5. An illustration of connections among nodes in the network: It mainly consists of two U-Nets with different numbers of layers, and each internal node fuses the features of the layer from the same level as well as the upper and lower layers. In the same layer, nonadjacent nodes conduct skip connections, as shown by the dotted line in the figure.
Figure 5. An illustration of connections among nodes in the network: It mainly consists of two U-Nets with different numbers of layers, and each internal node fuses the features of the layer from the same level as well as the upper and lower layers. In the same layer, nonadjacent nodes conduct skip connections, as shown by the dotted line in the figure.
Remotesensing 15 03827 g005
Figure 6. An illustration of the high-frequency-characteristic-driven module in the network; it consists of two parts: The input feature map is propagated through the spatial-domain pathway and the frequency pathway. Then, the features of the two paths are integrated.
Figure 6. An illustration of the high-frequency-characteristic-driven module in the network; it consists of two parts: The input feature map is propagated through the spatial-domain pathway and the frequency pathway. Then, the features of the two paths are integrated.
Remotesensing 15 03827 g006
Figure 7. Visualization results of feature maps propagated among sub-nodes after the high-frequency-characteristic-driven module.
Figure 7. Visualization results of feature maps propagated among sub-nodes after the high-frequency-characteristic-driven module.
Remotesensing 15 03827 g007
Figure 8. (ax) are some representative infrared images in the SIRST dataset with different backgrounds. For a better display, the demarcated area is enlarged by red square [33].
Figure 8. (ax) are some representative infrared images in the SIRST dataset with different backgrounds. For a better display, the demarcated area is enlarged by red square [33].
Remotesensing 15 03827 g008
Figure 9. The comparison results between the proposed DPDNet and previous CNN-based methods. The DPDNet achieves a higher true positive rate at the same false positive rate.
Figure 9. The comparison results between the proposed DPDNet and previous CNN-based methods. The DPDNet achieves a higher true positive rate at the same false positive rate.
Remotesensing 15 03827 g009
Figure 10. The definition of the feature layers.
Figure 10. The definition of the feature layers.
Remotesensing 15 03827 g010
Figure 11. The 3D visualization results of the original images and x ¯ . The second and fourth lines show the pixel values of the original images and x ¯ , respectively. The target pixels are more distinguishable from the background pixels in x ¯ than those in the original images.
Figure 11. The 3D visualization results of the original images and x ¯ . The second and fourth lines show the pixel values of the original images and x ¯ , respectively. The target pixels are more distinguishable from the background pixels in x ¯ than those in the original images.
Remotesensing 15 03827 g011
Figure 12. The feature maps and their corresponding activation maps. The target regions are well activated and utilized during the feature propagation.
Figure 12. The feature maps and their corresponding activation maps. The target regions are well activated and utilized during the feature propagation.
Remotesensing 15 03827 g012
Table 1. The comparison results of different methods in terms of IoU, Pd, and Fa metrics on the SIRST dataset. For IoU and Pd, larger values indicate higher performance. For Fa, smaller values indicate higher performance.
Table 1. The comparison results of different methods in terms of IoU, Pd, and Fa metrics on the SIRST dataset. For IoU and Pd, larger values indicate higher performance. For Fa, smaller values indicate higher performance.
Method I o U × 10 2 P d × 10 2 F a × 10 6
(Tr = 50%)
Filtering-Based: Top-Hat [4]7.14379.841012
Local-Contrast-Based: WSLCM [8]1.15877.955446
Local-Rank-Based: IPI [3]25.6785.5511.47
CNN-Based: MDvsFA-cGAN [38]60.389.3556.35
CNN-Based: ACM [33]70.3393.913.728
CNN-Based: ALCNet [11]73.3396.5730.47
CNN-Based: DNANet [28]76.2497.7112.8
DPDNet (Ours)78.6495.562.15
Table 2. The performance comparison between the previous SOTA data-driven methods with more convolutional layers and the DPDNet with fewer layers.
Table 2. The performance comparison between the previous SOTA data-driven methods with more convolutional layers and the DPDNet with fewer layers.
Method I o U × 10 2 P d × 10 2 F a × 10 6 #Params(M)
(Tr = 50%)
DNANet-ResNet10 [28]76.2497.7112.82.61
DNANet-ResNet18 [28]77.4798.482.354.7
DNANet-ResNet34 [28]77.5498.12.518.79
DPDNet-ResNet10 (Ours)78.6495.562.151.81
Table 3. The verification results for the sparse-characteristic-driven module and the high-frequency-characteristic-driven module.
Table 3. The verification results for the sparse-characteristic-driven module and the high-frequency-characteristic-driven module.
Shared ModuleSparse-Characteristic-Driven ModuleHigh-Frequency-Characteristic-Driven Module I o U × 10 2 P d × 10 2 F a × 10 6
(Tr = 50%)
ResNet10
and
Dense-Net
96.2497.7112.8
78.1894.445.09
78.5395.196.23
78.6495.562.15
Table 4. The verification results of adding the high-frequency-characteristic-driven module in different feature layers.
Table 4. The verification results of adding the high-frequency-characteristic-driven module in different feature layers.
Method Description I o U × 10 2 P d × 10 2 F a × 10 6 #Params(M)
(Tr = 50%)
Layer 077.0894.817.022.6
Layer 0, 178.0494.814.82.5
Layer 0, 1, 278.296.30.722.27
Layer 0, 1, 2, 378.6495.562.151.84
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hao, Y.; Liu, Y.; Zhao, J.; Yu, C. Dual-Domain Prior-Driven Deep Network for Infrared Small-Target Detection. Remote Sens. 2023, 15, 3827. https://doi.org/10.3390/rs15153827

AMA Style

Hao Y, Liu Y, Zhao J, Yu C. Dual-Domain Prior-Driven Deep Network for Infrared Small-Target Detection. Remote Sensing. 2023; 15(15):3827. https://doi.org/10.3390/rs15153827

Chicago/Turabian Style

Hao, Yutong, Yunpeng Liu, Jinmiao Zhao, and Chuang Yu. 2023. "Dual-Domain Prior-Driven Deep Network for Infrared Small-Target Detection" Remote Sensing 15, no. 15: 3827. https://doi.org/10.3390/rs15153827

APA Style

Hao, Y., Liu, Y., Zhao, J., & Yu, C. (2023). Dual-Domain Prior-Driven Deep Network for Infrared Small-Target Detection. Remote Sensing, 15(15), 3827. https://doi.org/10.3390/rs15153827

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop