Open AccessArticle

Lightweight Multi-Scale Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images

Jun Li

^* and

Kaigen Huang

College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu 610059, China

Author to whom correspondence should be addressed.

Electronics 2025, 14(1), 8; https://doi.org/10.3390/electronics14010008

Submission received: 10 December 2024 / Revised: 22 December 2024 / Accepted: 23 December 2024 / Published: 24 December 2024

Download

Browse Figures

Versions Notes

Abstract

Salient object detection in optical remote sensing images (ORSI-SOD) encounters notable challenges, mainly because of the small scale of salient objects and the similarity between these objects and their backgrounds in images captured by satellite and aerial sensors. Conventional approaches frequently struggle to efficiently leverage multi-scale and multi-stage features. Moreover, these methods usually rely on sophisticated and resource-heavy architectures, which can limit their practicality and efficiency in real-world applications. To overcome these limitations, this paper proposes a novel lightweight network called the Multi-scale Feature Fusion Network (MFFNet). Specifically, a Multi-stage Information Fusion (MIF) module is created to improve the detection of salient objects by effectively integrating features from multiple stages and scales. Additionally, we design a Semantic Guidance Fusion (SGF) module to specifically alleviate the problem of semantic dilution often observed in U-Net architecture. Comprehensive evaluations on two benchmark datasets show that the MFFNet attains outstanding performance in four out of eight evaluation metrics while only having 12.14M parameters and 2.75G FLOPs. These results highlight significant advancements over 31 state-of-the-art models, underscoring the efficiency of MFFNet in salient object-detection tasks.

Keywords:

optical remote sensing images; lightweight salient object detection; multi-scale features; semantic guidance

1. Introduction

Salient object detect (SOD) endeavors to emulate the human visual attention mechanism, precisely delineating and segmenting the most visually salient objects within images [1]. As a crucial step in image preprocessing, SOD has been successfully applied across numerous image processing fields, including object tracking [2], object segmentation [3], image retargeting [4], and image quality assessment [5]. The swift progress of deep learning has enabled significant advancements in salient object detection for natural scene images (NSI-SOD). Recently, salient object detection in optical remote sensing images (ORSI-SOD) has emerged as a new research topic, attracting considerable attention from researchers. Unlike NSIs, ORSIs are typically acquired by satellites, drones, or other flying devices with a bird’s-eye view. Images obtained from these platforms are often captured at high altitudes, exhibiting more pronounced scale variations, more complex background information, and larger image dimensions. Consequently, existing NSI-SOD models cannot be directly applied to ORSI-SOD tasks. Therefore, designing specialized ORSI-SOD models presents a current and challenging task.

Traditional SOD methods primarily rely on hand-crafted features, including image contrast, brightness and color. While these methods can be effective in certain scenarios, their performance frequently fails to meet the desired standards when dealing with scenes of considerable complexity, especially when backgrounds and objects have similar brightness and color features or when objects are partially occluded. This limitation arises mainly because these methods fail to completely utilize multi-scale information and high-level semantic features within images. In the contemporary period of deep learning, researchers have proposed a large number of deep-learning-based SOD methods, particularly for NSI, leading to significant improvements in detection accuracy. Notably, among the various methods, the classic encoder–decoder architecture [6] emerges as the most universal and efficacious structure. They often incorporate various strategies to enhance performance, such as edge assistance [7], gating mechanisms [8], progressive architectures [9], and deep supervision [10]. Even though NSI-SOD models cannot directly address the complexities of ORSI environments, the strategies employed in these models lay a solid foundation for designing specialized ORSI-SOD models.

ORSIs are characterized by high resolution, rich content, and broad coverage, providing precise data support for resource management. They find extensive applications in areas like military reconnaissance [11] and land planning [12]. Existing methods specifically designed for ORSI-SOD consider the unique attributes of salient objects and scenarios in ORSIs. For instance: LVNet [13] utilizes a dual-stream pyramid structure to obtain complementary features, thereby enhancing its capability to perceive objects at varying scales and capture localized features. MCCNet [14] employs foreground features, boundary features, background features, and global image-level features from RSI, using their complementary content to highlight salient regions. ACCoNet [15] introduces a new adjacent context coordination network to investigate the coordination among adjacent features within the U-shaped network of ORSI-SOD, thereby maximizing the utilization of contextual information in ORSI.

In spite of the advances made by current methodologies in ORSI-SOD, challenges remain concerning the quantity of parameters and computational load. Current approaches frequently contain numerous parameters and are computationally intensive, posing challenges for deployment on mobile devices. As shown in Figure 1, six leading ORSI-SOD methods (SARNet [16], DAFNet [17], MJRBM [18], ERPNet [19], RRNet [20], and EMFINet [21]), and nine lightweight SOD methods (CSNet [22], SAMNet [23], MSCNet [24], FSMINet [25], CorrNet [26], SeaNet [27], MEANet [28], CSFFNet [29], and SAFINet [30]) were evaluated on the ORSI dataset. Cutting-edge methodologies with ORSI-SOD techniques have impressive performance, but their high performance often comes with substantial computational demands, rendering them unsuitable for deployment on mobile devices. In practical applications, certain scenarios require real-time responses to changes in user input, which lightweight models can provide more efficiently on mobile terminals. Figure 1 also demonstrates that the nine lightweight SOD methods aim to address computational inefficiency but, unfortunately, often at the expense of some performance. This highlights the primary challenge currently faced by ORSI-SOD: attaining an ideal equilibrium between accuracy and efficiency. Since it is a fairly specialized topic, this work will generally not be of interest to a broad variety of academics and experts. Naturally, as the algorithm is refined for faster streaming speeds and greater sample resolutions, the technique may find broader use in computer vision applications.

Additionally, ORSI images contain complex boundaries and transition regions, which are crucial factors for identifying salient objects against varying backgrounds. Feature maps within networks with fewer layers are characterized by greater resolution and more detail information. In contrast, feature maps from deeper network layers contain richer contextual information but with reduced detail. To achieve high-precision detection in these images, it is necessary to skillfully integrate multi-stage and multi-scale information. By introducing our Multi-stage Information Fusion (MIF) module, achieving a highly extensive comprehension of the image content, we enhance the model to better extract the intricate relationships between objects and their backgrounds. This allows for the precise discrimination of salient objects within the intricate and varied environments of ORSI.

Finally, the multifaceted nature of ORSI is evident, as it includes objects of varying scales and diverse directions. U-Net architectures are commonly used in SOD because they can balance broad background information and fine detail information information through skip connections. Nevertheless, when performing the upsampling process, the network combines fine low-level detail information with coarse high-level features, which may consequently result in the loss of nuanced details. This results in misdetections of very small objects, a phenomenon known as semantic dilution. With the aim of addressing this issue, we introduced the Semantic Guidance Fusion (SGF) module to compensate for the salient features lost during upsampling. This module resolves the issue of semantic dilution, ensuring that abundant semantic features from deep layers guide multi-scale feature fusion throughout the upsampling and refinement stages of SOD. Consequently, this enhances the accuracy of detecting small objects.

In summary, to address the common challenges in ORSI-SOD, we proposes a novel lightweight multi-scale feature fusion network (MFFNet) that aims to enhance SOD performance while achieving model lightweighting. Built upon an efficient backbone network, MFFNet incorporates innovative modules designed to fuse multi-stage information and guide the fusion process using high-level semantic information. The method employs a Multi-stage Information Fusion (MIF) module to fuse the features from different stages and scales, generating attention maps at both spatial and channel levels. Additionally, we present a Semantic Guidance Fusion (SGF) module, ensuring that high-level semantic information serves an essential function in the generation of saliency maps. The key contributions of our study can be summarized as follows.

(1): Multi-Scale Feature Fusion Network (MFFNet): The MFFNet adopts the lightweight encoder UniFormer-L [31] to extract global dependencies and local features of salient objects, fully leveraging the benefits of convolutional and self-attention mechanisms. This network not only improves the accuracy of ORSI-SOD but also maintains a lightweight model. For images of size 288 × 288, the number of parameters is 12.14M, with FLOPs at 2.75G.
(2): Multi-stage Information Fusion (MIF) Module: We propose the MIF module, which is used to fuse multi-scale features from stages 1 to 4, generating attention maps in both the spatial and channel levels. The MIF module enhances the model’s capability to integrate salient information across different stages and scales, thereby improving its perception and understanding of salient objects in challenging scenarios.
(3): Semantic Guidance Fusion (SGF) Module: We introduce the SGF module to handle the problem of semantic dilution, ensuring that high-level features rich in semantic information are used to guide the fusion of multi-stage features during upsampling and refinement for saliency prediction. The SGF module better preserves the integrity of object structures and significantly reduces over-prediction in foreground regions, thereby improving the accuracy of SOD.

The remainder of this paper is structured as follows: Section 2 presents a review of recent work. Section 3 presents a detailed description of our MFFNet. Section 4 demonstrates the superiority of the MFFNet through extensive experiments. Finally, Section 5 outlines the critical outcomes of this study, reaffirms the contributions of the MFFNet and identifies potential research avenues moving forward.

2. Related Work

The goal of salient object detection (SOD) is to identify and obtain visually prominent regions from images. This technology is applicable in various domains, leading to the development of various SOD methods. In this section, we review historic approaches for NSI-SOD and ORSI-SOD, encompassing both traditional methods and deep-learning-based methods. Additionally, we highlight methods that emphasize efficiency in lightweight SOD approaches.

2.1. Traditional SOD Methods

NSI-SOD Methods: Traditional SOD methods primarily depend on hand-crafted features. Itti et al. [32] pioneered the initial computational visual attention model based on a center–surroundings disparity mechanism. Li et al. [33] designed a regularized random walk ranking method, generating pixel-level saliency maps based on background and foreground saliency estimates at the superpixel level. Kim et al. [34] used high-dimensional color transformation to encode the colors of image pixels into a high-dimensional representation, thereby better capturing the distinctions and connections between colors. Zhou et al. [35] computed two initial saliency maps by leveraging saliency and background seed vectors and then employed a manifold ranking diffusion method to achieve more accurate results. Yuan et al. [36] introduced a SOD model that incorporates regression calibration to achieve greater accuracy and reliability saliency estimation.

ORSI-SOD Methods: Through the analysis of ORSI features, traditional ORSI-SOD methods have developed specialized feature extraction algorithms tailored for SOD. Zhang et al. [37] conducted an analysis of color information content in ORSI, calculating saliency scores for every color layer and fusing these color elements to obtain the final saliency result. Zhang et al. [38] proposed a bidirectional supplementary saliency analysis strategy that integrates visual saliency and knowledge-driven saliency, constructing an adaptive boundary model for airport extraction oriented towards saliency detection. Zhang et al. [39] introduced a flexible integration method using low-rank matrix recovery, which integrates color features, strength features, texture, and global contrast for SOD in ORSI. Huang et al. [40] introduced a contrast-weighted dictionary learning-based method for SOD in very high-resolution ORSI.

Despite the aforementioned methods’ lack generalization in new scenes, they establish the foundation for upcoming approaches.

2.2. Deep Learning-Based SOD Methods

NSI-SOD Methods: Deep learning-based SOD methods utilize end-to-end optimization, which enables them to autonomously extract multi-level, high-expressive features without manual intervention. Wang et al. [41] designed two sub-networks, one for local measurement and the other for global search, to detect salient regions more effectively. Liu et al. [42] provided spatial cues for potential salient objects by efficiently fusing multi-level features and then refined these cues through progressive semantic enhancement, thus generating saliency maps rich in detailed information. Qin et al. [43] introduced a hybrid loss method that combines binary cross-entropy, a structural similarity index measure, and intersection over union loss, aiming to achieve effective segmentation of structurally fine and clearly bounded salient targets. Zhou et al. [44] adopted a dynamic dual-stream decoder architecture to evaluate and exploit the interaction between saliency features and edge features, thereby improving the overall detection accuracy. Liu et al. [45] achieved SOD by modeling visual tasks as a dictionary lookup problem with learnable query functionality, where the Transformer decoder acts as the task-specific unit atop the CNN backbone in this network.

ORSI-SOD Methods: ORSI-SOD is a promising newcomer in the SOD field, with many deep-learning-based methods being proposed recently. Huang et al. [16] proposed a Semantic Guidance Decoder (SGD), which aggregates diverse high-order features to identify multi-scale targets and integrates global semantic information through a step-by-step feedback process. Zhang et al. [17] introduced the Dense Attention Fluid (DAF) structure, which passes shallow attention information to deeper layers and combines an edge supervision mechanism to enhance object boundaries. Zhou et al. [21] enhanced multi-scale deep features through an edge extraction module and incorporated fine edge details into the saliency maps using a hybrid loss that includes edge-aware constraints. Liu et al. [46] utilized Transformer blocks with global perception areas to extract features and adopted a divide-and-conquer method to extract and integrate relevant features from various branches. Yan et al. [47] ingeniously integrated Transformers and CNNs into the encoder, using an adaptive semantic matching mechanism to model both global and local relationships.

While considerable success has been achieved by these methods in SOD, they typically require substantial computational resources, which can pose a challenge for devices with limited processing capabilities.

2.3. Lightweight SOD Methods

NSI-SOD Methods: Lightweight SOD is an arising project that was initially studied in the context of NSI-SOD. Gao et al. [22] introduced a dynamic and adaptable adaptive convolutional layer with powerful multi-scale representation capabilities, based on which they created a remarkably lightweight network for NSI-SOD. Liu et al. [23] introduced a novel stereo attention multi-scale module that dynamically fuses features from different scales using a stereo attention approach. This method efficiently captures detailed information from salient objects with fewer computational resources. Liu et al. [48] introduced a layered visual processing module to mimic hierarchical perceptual learning in the primate visual cortex. This module effectively learns and utilizes contextual information across multiple scales through dense connections.

ORSI-SOD Methods: The push for lightweight models has extended to ORSI-SOD. Given the specific attributes of ORSI, ensuring superior performance while maintaining model lightweightness is a highly challenging task. Lin et al. [24] designed a lightweight multi-scale contextual network using the MobileNet V2 [49] architecture, specifically tailored for precise salient object detection in ORSI environments. Shen et al. [25] developed a lightweight network that compresses fine-resolution attributes while effectively extracting salient targets using a multi-scale strategy. Li et al. [26] simplified the VGG-16 [50] backbone to enhance feature extraction efficiency and employed a coarse-to-fine scheme, utilizing dense lightweight refinement blocks to detect salient targets. Li et al. [27] used a lightweight MobileNet V2 encoder to extract features, incorporating a flexible semantic fitting module to handle higher-order information and a boundary matching module to refine lower-order information. Liang et al. [28] combined a multi-scale edge embedding attention module with a multi-level semantic guidance module, both of which have low computational burdens. Wang et al. [29] fed local feature representations from a CNN-based encoder to a Transformer-based feature pyramid module while constructing an information optimization module to merge information from multiple levels with enhanced accuracy. Luo et al. [30] proposed a novel recursive structure to effectively bridge the gap between features of distinct resolutions and enhanced the model’s contextual understanding with multi-scale methods. They also utilized shallow features to complement and optimize high-level features, ensuring feature consistency and ultimately improving detection performance.

These advancements signify a crucial trend in the design of SOD models, not only achieving superior detection precision but also maintaining low computational demands. This development is essential to meet the increasing need for deploying such technologies on devices with limited processing power, ensuring broader applicability and efficiency in real-world scenarios.

3. Proposed Method

3.1. Network Overview

As illustrated in Figure 2, the MFFNet employs an encoder–decoder structure, leveraging UniFormer-L as its lightweight encoder. This design allows it to preserve more detailed information about salient areas while preventing non-relevant information from masking significant features. Unlike the MobileNet V2 used in other lightweight networks, UniFormer-L merges convolution and self-attention mechanisms, paralleling the approach of Transformers. This combination enables the effective learning of both local and global relationships, achieving superior accuracy–computation trade-offs in image processing tasks and thus improving the computational efficiency of saliency detection. The encoder is segmented into four modules, each labeled as UniFL-t for t = 1, 2, 3, 4. In the first stage, a downsampling operation with a stride of 4 is performed, while in subsequent stages, progressive downsampling with a stride of 2 is used. Each block processes the input ORSI

I \in R^{3 \times H \times W}

to extract feature maps, denoted as

f^{t} \in R^{c_{t} \times h_{t} \times w_{t}}

. Specifically,

h_{t} = H / 2^{t + 1}

w_{t} = W / 2^{t + 1}

, and for t = 1 through 4, the channel dimensions

c_{t}

are set to

\{56, 112, 224, 448\}

, respectively.

The decoder network within MFFNet leverages the well-established U-Net architecture. Specifically, the UniFormer-L encoder outputs four-level feature maps, which are then forwarded to the Multi-stage Information Fusion (MIF) module for further processing to extract multi-stage and multi-scale features. Following this, feature maps refined by the MIF modules at Stages 2 to 4 are integrated using a coarse-to-fine scheme guided by the Semantic Guidance Fusion (SGF) module, resulting in a salient prediction map. Lastly, the model executes an upsampling to upscale the salient prediction map, ensuring it matches the dimensions of the original ORSI and thus producing the final prediction map.

3.2. Multi-Stage Information Fusion (MIF) Module

Salient features in ORSI demonstrate considerable scale variations and complex backgrounds, necessitating powerful feature derivation technologies to extract their diversity and intricacy. As shown in Figure 3, the proposed Multistage Information Fusion (MIF) module is designed to adequately merge multi-stage and multi-scale information, generating richer attention feature maps that significantly enhance the performance of SOD. Fundamentally, the MIF module consists of two primary components: the Spatial Attention Module, which concentrates on spatial relationships within the image, and the Channel Attention Module, which emphasizes channel-wise dependencies.

Spatial Attention Block (SAB): The efficient processing of multi-scale information is essential for ORSI-SOD. The integration of multi-scale information has proven a critical factor in improving SOD performance. Therefore, we introduce a Spatial Attention Block (SAB) that integrates multi-scale features, generating corresponding attention maps to improve the integration of information.

As illustrated on the left in Figure 3, we initially conducted max pooling and average pooling processes on the feature representations of each stage across the channel axis, combining these two results to produce a dual-channel feature map while also keeping the original height and width. Following this, we applied a dilated convolution with an expansion rate of 3 and a kernel size of 7 and subsequently employed a sigmoid activation function. Finally, we performed a component-wise multiplication between the generated spatial attention map and the original feature map, incorporating residual information. The procedure of the SAB can be mathematically described by

T^{t} = C o n c a t (M a x (f^{t}), A v g (f^{t}))

(1)

T_{s}^{t} = σ (C o n v (T^{t}))

(2)

S_{s a b}^{t} = f^{t} \cdot T_{s}^{t} + f^{t}

(3)

where

M a x

denotes the max pooling operation,

A v g

denotes the average pooling operation,

f^{t}

represents the feature maps obtained from different stages by the encoder,

C o n c a t

indicates the concatenation operation along the channel dimension, and

σ

represents the sigmoid function.

Channel Attention Block (CAB): According to CBAM [51], both spatial and channel attention mechanisms are necessary for filtering out unnecessary features. With this goal in mind, we introduce the Channel Attention Block (CAB). This module enhances information integration by generating channel attention maps through the concatenation of features from different stages along the channel axis. By focusing on relevant channels, the CAB improves the model’s capability to highlight important features and suppresses noise.

As illustrated on the right in Figure 3, the CAB generates more comprehensive attention feature maps by dividing the fusion of multi-stage features into two parts: local feature integration, which employs convolutional processing, and global feature integration, which leverages distinct fully connected layers at every stage. This dual approach enables the CAB to extract both detailed and broad-spectrum information effectively. The operation of the CAB can be mathematically expressed as follows:

R^{t} = C o n c a t (G A P (S_{s a b}^{1}), G A P (S_{s a b}^{2}), G A P (S_{s a b}^{3}), G A P (S_{s a b}^{4}))

(4)

R_{c}^{t} = σ (F C^{t} (C o n v (R^{t})))

(5)

S_{m i f}^{t} = S_{s a b}^{t} \cdot R_{c}^{t} + S_{s a b}^{t}

(6)

where

G A P

denotes Global Average Pooling and

F C^{t}

indicates the fully connected layer of stage t.

3.3. Semantic Guidance Fusion (SGF) Module

In the decoding stage, generating an ultimate salient prediction map requires the continuous integration of semantic information from higher network layers with detailed features from lower layers. Research has demonstrated a significant semantic gap between high-level and low-level features. High-level features are rich in semantic data that helps in comprehending the varying regions within the image. Low-level features encompass abundant detailed data such as edges and textures but may also include irrelevant background noise. However, in the decoder structure based on the U-Net architecture, merely passing high-level semantic features progressively and directly concatenating them with low-level features can cause semantic confusion between foreground and background information. This approach can lead to issues such as fragmented object compositions or the overforecasting of foreground areas, impacting the overall accuracy and robustness of the saliency detection.

To tackle this issue, we have introduced the Semantic Guidance Fusion (SGF) module within the decoder. The SGF module leverages deep-layer features that are rich in semantic information to generate a semantic guidance map, which serves as prior information for guiding the fusion of low-layer features. Through a self-attention mechanism, the SGF explores the projection correspondence of both global information and local features, infusing global semantic information into local features, thereby significantly enhancing the feature representation capability. This mechanism filters out unimportant parts from a large amount of information, focusing on processing and selecting key important information, thus strengthening the model’s awareness of global information and capturing long-distance dependencies.

As shown in Figure 4, first, convolution operations are performed on the semantically rich features extracted from MIF-4, reducing their dimensions to a single-channel output. After resizing the semantic guidance map, the Sigmoid function is used to transform the input into values between 0 and 1, resulting in

W_{s g}

W_{s g} = σ (R e s i z e (C o n v (S_{m i f}^{4})))

(7)

Then, a linear transformation is applied to the input feature maps to reduce computational complexity, generating three attention components, which are then multiplied by

W_{s g}

. These components are reshaped to fit the form required for matrix multiplication:

Q, K, V = (v i e w (W_{s g} \cdot T_{q}), v i e w (W_{s g} \cdot T_{k}), v i e w (W_{s g} \cdot T_{v}))

(8)

where

T_{q}

T_{k}

, and

T_{v}

denote the query component, key component, and value component obtained by applying linear transformations to the input feature map

S_{m i f}^{t}

v i e w

refers to the reshape operation. The equation for creating the attention map is as follows:

A = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}})

(9)

where A is the attention map and d is the channel count in the key matrix.

Finally, the final result of the SGF module is computed using the following formula:

S_{s g f}^{t} = A \cdot V + S_{m i f}^{t}

(10)

3.4. Loss Function

As marked by the red arrows in Figure 2, this study incorporates deep supervision mechanisms during the training process. Specifically, each output from the three SGF modules, represented as

S_{s g f}^{t}

(

t \in

2, 3, 4), is passed to a 3 × 3 convolutional layer to generate the saliency prediction map

S_{p r e d}^{t}

(

t \in

2, 3, 4). Subsequently,

S_{p r e d}^{t}

(

t \in

2, 3, 4) is upsampled to the same resolution as the GT for loss computation. The MFFNet utilizes the saliency map

S_{p r e d}^{1}

to predict the final saliency map.

The loss functions of the MFFNet model depend on binary cross-entropy loss (

L_{b c e}

), map-level intersection over union loss (

L_{i o u}

) [43], and F-measure loss (

L_{f m}

) [52] to train MFFNet. The loss functions are expressed in the following manner:

L_{b c e} = - \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} [G_{i j} l o g S_{i j} + (1 - G_{i j}) l o g (1 - S_{i j})]

(11)

L_{i o u} = 1 - \frac{\sum_{i = 1}^{H} \sum_{j = 1}^{W} S_{i j} G_{i j}}{\sum_{i = 1}^{H} \sum_{j = 1}^{W} (S_{i j} + G_{i j} - S_{i j} G_{i j})}

(12)

L_{f m} = 1 - \frac{(1 + β^{2}) \times P r e c i s i o n \times R e c a l l}{β^{2} \times P r e c i s i o n + R e c a l l}

(13)

where H and W represent the height and width of the image,

G_{i j}

indicates GT for pixel

(i, j)

, and

S_{i j}

signifies the predicted value for that pixel. We set

β^{2}

to 0.3 to underscore the precision over recall [53].

The overall loss of the model is defined as

L_{t o t a l} = \sum_{t = 1}^{4} (L_{b c e}^{t} + L_{i o u}^{t} + L_{f m}^{t})

(14)

4. Experiments

In this section, we first outline the experiment details, including the benchmark datasets, parameter settings, and performance metrics. Following this, we demonstrate a comprehensive comparison of the effectiveness across 31 state-of-the-art salient object detection (SOD) models. Additionally, we conduct a sequence of ablation studies to systematically evaluate and highlight the efficacy and contributions of the proposed modules within the entire structure of the MFFNet model.

4.1. Experiment Details

4.1.1. Datasets

We conducted experiments using two ORSI-SOD datasets, detailed as follows.

(1): The ORSSD dataset [13] contains 800 ORSI along with their corresponding per-pixel labels. These images are primarily captured by aircraft and satellites and were selected and compiled by the authors based on existing databases and Google Earth imagery. The dataset encompasses a diverse range of scenes, such as islands, boats, cars, roads, rivers, and airplanes. Within this dataset, 600 images were assigned for training the model, and 200 images were reserved for testing its performance.
(2): The EORSSD dataset [17] enlarges the ORSSD dataset by including more images, featuring 2000 ORSI paired with their respective pixel-level ground truth (GT). In contrast to its predecessor, EORSSD poses increased difficulties in the object detection of tiny objects, the interpretation of more intricate scenes, and the mitigation of various image interferences. This dataset is organized into two subsets: 1400 images were assigned for the training phase, and 600 images were earmarked for the testing phase.

Furthermore, to enhance the training data, we applied data augmentation techniques that included rotating the images and their corresponding GT by 90°, 180°, and 270°. Additionally, we performed horizontal flips on the images and GT, followed by further rotations at the same angles to broaden the range of the training set. Through these augmentation methods, the sum total of training samples for the ORSSD dataset was expanded to 4800, while the EORSSD dataset’s training samples were increased to 11,200.

4.1.2. Parameter Settings

All experiments described in this paper were performed on an NVIDIA GeForce RTX 4080 GPU using the open-source PyTorch 2.1.2 framework. During the training stage, each input image was adjusted to a dimension of 288 × 288 pixels. To optimize the network, we employed the Adam optimizer with a batch size of eight. The starting learning rate was set at 0.0001 and was reduced by tenfold following 30 training epochs. Training was conducted for 50 and 55 epochs on the EORSSD and ORSSD datasets.

4.1.3. Evaluation Metrics

To thoroughly evaluate the model’s performance, we utilized a set of standard performance metrics to conduct our quantitative analysis, including the F-measure (

F_{β}

) [54], E-measure (

F_{ξ}

) [55], S-measure (

S_{m}

) [56], Mean Absolute Error (MAE), F-measure curve, and PR curve.

The F-measure is determined by taking the weighted average of precision and recall, offering an extensive evaluation of a model’s performance:

F_{β} = \frac{(1 + β^{2}) \times P r e c i s i o n \times R e c a l l}{β^{2} \times P r e c i s i o n + R e c a l l}

(15)

where

β^{2}

is set to 0.3. To evaluate performance, we use the mean F-measure and F-measure curves calculated across a range of thresholds from 0 to 255.

The E-measure assesses how closely the predicted saliency maps match the ground truth (GT) by accounting for both local and overall saliency values. This comprehensive approach ensures a highly precise assessment of the model’s performance. It is computed as

E_{ξ} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} θ (ξ)

(16)

where H and W indicate the height and width of the ORSI,

ξ

serves as the alignment matrix, and

θ (ξ)

signifies the improved version of this matrix.

The S-measure evaluates the structural similarity between the predicted salient regions and GT, offering insight into how closely the structures align. It can be computed as

S_{m} = α \times S_{o} + (1 - α) \times S_{r}

(17)

where

S_{o}

represents the object-sensitive similarity term,

S_{r}

represents a region-sensitive similarity term, and

α

is set to 0.5 to ensure that both terms contribute equally to the final result.

The MAE metric evaluates the mean absolute difference at every pixel location, providing a comprehensive assessment of pixel-wise accuracy:

M A E = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} | S_{i j} - G_{i j} |

(18)

where

S_{i j}

indicates the saliency prediction, and

G_{i j}

represents the corresponding GT at pixel location

(i, j)

The PR curve visualizes how precision changes with respect to recall as the saliency map threshold is incrementally adjusted from 0 to 255. It offers a visual representation of the balance between precision rate and recall rate.

4.2. Performance Comparison

To validate the advancement of our approach, the MFFNet model is compared with 31 state-of-the-art salient object detection (SOD) models. The models for comparison included five traditional NSI-SOD models (RRWR [33], HDCT [34], DSG [35], SMD [57], RCRR [36]), three traditional ORSI-SOD models (VOS [38], SMFF [39], CMC [58]), six DL-based NSI-SOD models (PoolNet [42], EGNet [7], ITSD [44], GCPANet [59], SUCA [60], PA-KRN [9]), seven DL-based ORSI-SOD models (LVNet [13], SARNet [16], DAFNet [17], MJRBM [18], ERPNet [19], RRNet [20], EMFINet [21]), three lightweight NSI-SOD models (CSNet [22], SAMNet [23], HVPNet [48]), seven lightweight ORSI-SOD (MSCNet [24], FSMINet [25], CorrNet [26], SeaNet [27], MEANet [28], CSFFNet [29], and SAFINet [30]). We sourced saliency maps for other methods from the original authors or accessible repositories, ensuring that all predicted saliency maps were evaluated according to the same standards.

4.2.1. Quantitative Comparison

Table 1 provides a quantitative comparison of the MFFNet model against 31 state-of-the-art SOD models across two benchmark datasets. The metrics employed for evaluating the models in this comparison are

F_{β}^{m e a n}

E_{ξ}^{m e a n}

S_{m}

, and MAE. The top three scores are marked in red, blue, and green. The symbol “↑” denotes that higher scores are preferable, while “↓” indicates that lower scores are better.

Utilizing the strengths of deep learning, DL-based methods showcase remarkable results in SOD, surpassing traditional approaches across all four evaluation metrics. This consistent superiority underscores the effectiveness of deep learning in this domain. Moreover, the results clearly demonstrate that ORSI-SOD methods achieve better performance compared to NSI-SOD approaches, highlighting the importance of specialized SOD methods for ORSI to achieve optimal results.

Despite its lightweight design, our MFFNet outperforms non-lightweight NSI-SOD methods based on deep learning, achieving superior performance across various metrics. For instance, compared to the state-of-the-art PA-KRN, MFFNet requires 11.6 times lower parameters and achieves a FLOP reduction of 224.6 times, yet it still delivers significant improvements in evaluation metrics. This balance between efficiency and effectiveness underscores MFFNet’s capability to provide high performance with minimal computational resources.

In contrast to non-lightweight ORSI-SOD methods based on deep learning, MFFNet delivers better results with a notably reduced number of parameters and FLOPs, highlighting its efficiency and effectiveness in SOD. For the EORSSD dataset, our model achieves the highest rank in

F_{β}^{m e a n}

E_{ξ}^{m e a n}

, and

S_{m}

and second in MAE. In the ORSSD dataset, our model obtains the highest rankings in all evaluation metrics.

Compared to ten recent lightweight SOD methods, the MFFNet model demonstrates superior performance, securing four first-place, two second-place, and one third-place ranking across various metrics. In particular, on the EORSSD dataset, our model ranks first in

E_{ξ}^{m e a n}

S_{m}

and MAE. For the ORSSD dataset, the MEANet model achieves first place in

E_{ξ}^{m e a n}

, second place in

S_{m}

and MAE, and third place in

F_{β}^{m e a n}

. The results shown here not only underscore the effectiveness of the MFFNet model in addressing ORSI-SOD issues but also showcase its applicability to real-world remote sensing image analysis tasks.

Additionally, Figure 5 presents the precision–recall (PR) and F-measure curves for 13 ORSI-SOD models (LVNet, SARNet, DAFNet, MJRBM, ERPNet, RRNet, EMFINet, MSCNet, FSMINet, CorrNet, SeaNet, MEANet, and SAFINet), as well as the proposed MFFNet. The curves for the proposed MFFNet model are highlighted in red to ensure clarity and ease of comparison. As visible in the evaluation plots, the proposed model (marked by the red curve) consistently positions itself nearer to the top-right corner in both datasets. This positioning indicates superior performance metrics, suggesting that our method exceeds most other approaches with respect to precision and recall.

To summarize, the proposed MFFNet model shows remarkable robustness in handling ORSI-SOD tasks, achieving superior outcomes compared to other state-of-the-art methods. This consistent outperformance underscores MFFNet’s capability to deliver high-quality results with greater efficiency.

4.2.2. Visual Comparison

To greater highlight the strengths of the proposed MFFNet model, Figure 6 illustrates a visual comparison between the proposed model and the predictions from the 13 ORSI-SOD models listed in Table 1, across various challenging ORSI scenarios. These scenarios encompass a variety of conditions, including multiple objects, large objects, and other comparable elements. Figure 6 organizes its content such that the first column contains the input images, and the second column exhibits the associated ground truth (GT). The following 13 columns display the prediction results for each ORSI-SOD model, with the results from our proposed model illustrated in the last column. Within the figures below, red highlights mark false positives—areas where the model erroneously flags background regions as targets. Blue highlights denote false negatives—instances where the model overlooks actual targets, misclassifying them as background. This color coding enables a detailed comprehension of the model’s error patterns.

In the first row of Figure 6, a scenario with multiple objects—a particularly challenging context for ORSI-SOD—is presented. Our method illustrates its capability by precisely locating all salient objects while preserving fine details. In contrast, some models struggle with accuracy: MSCNet, MEANet, and SAFINet misidentify targets, whereas FSMINet misses objects. This variability underscores the importance of robust detection methods.

The second row depicts a scene containing a large island with a cluttered exterior boundary. In this scenario, SeaNet and MEANet either partially segment the island or result in over-segmentation due to the large area and complex boundaries. Unlike these methods, which struggle to clearly define the edges around the island, our proposed model accurately segments the entire island with precise, finely detailed boundaries.

The third row illustrates a scenario involving small objects, where accurate detection is particularly challenging. In this scene, MSCNet, SeaNet, and SAFINet erroneously detect more regions, whereas other methods only offer a rough localization without detailed detection. Conversely, our method accurately identifies and segments the small object with high precision and fine detail.

As depicted in the fourth row, the scene includes a cluttered background that poses challenges for other methods, leading to the generation of incorrect objects (such as CorrNet, SeaNet, MEANet, and SAFINet) and non-smooth edges (such as DAFNet and MSCNet) in the saliency maps.

Additional evidence showcasing the robustness of the MFFNet model is provided in the next three rows of Figure 6. As shown in these rows, our model maintains its superior performance across three challenging conditions: objects with low contrast, objects with shadow interference, and narrow and elongated objects.

Overall, the results shown in Figure 6 emphasize the reliability and robustness of the proposed method in detecting saliency across various complex scenarios. This consistent performance not only proves our method’s effectiveness in integrating multi-stage and multi-scale information but also demonstrates the advantages of using high-level semantic information as guidance in detecting small objects.

4.2.3. Computational Performance

Table 1 presents a comparative assessment of our model relative to 31 state-of-the-art models, examining key metrics of computational complexity, namely model parameters and FLOPs. Compared to leading non-lightweight networks such as RRNet and EMFINet, our model is much more efficient, featuring only 12.14M parameters (versus 86.27M for RRNet and 107.26M for EMFINet) and requiring only 2.75G FLOPs, a stark contrast to the 692.15G for RRNet and 480.9G for EMFINet.

Among lightweight leading models, our model stands out by requiring fewer FLOPs and demonstrating superior effectiveness. However, it does have a slightly higher parameter count. By achieving a balance between performance and efficiency, the MFFNet model highlights its potential as a robust solution for applications in real-world remote sensing image analysis.

4.3. Ablation Studies

4.3.1. Effectiveness of MIF and SGF

The MFFNet model features two key modules: Multistage Information Fusion (MIF) module and Semantic Guide Fusion (SGF) module. We conducted ablation experiments to assess the effectiveness of the relevant modules, comparing the performance of various configurations. In establishing the baseline model, we removed the MIF and SGF modules, retaining exclusively the UniFormer-L encoder in conjunction with a simplified decoder composed of three up-sampling modules. We designed four variants of MFFNet, which are presented in Table 2: (1) Baseline; (2) Baseline + MIF; (3) Baseline + SGF; (4) Baseline + MIF + SGF.

As illustrated in Table 2, the incorporation of these key modules results in incremental improvements in

F_{β}^{m e a n}

E_{ξ}^{m e a n}

S_{m}

and

M A E

scores. The minimal increase in parameters and computational load to achieve such effects is well worth it, while also highlighting the crucial role of each module in enhancing the overall performance of the model. Notably, the enhancements gained by adding the SGF modules are more substantial compared to those obtained with the MIF modules. This distinction highlights the significant advantage of our advanced Semantic Guidance Fusion mechanism, which drives the enhancement in performance.

4.3.2. Effectiveness of Using Multistage Features in MIF

The MIF module combines features from four stages, generating corresponding channel attention maps in the Channel Attention Block. To validate the effectiveness of fusing multi-stage features in improving SOD performance, we conducted ablation studies by classifying the stage features to be fused as follows.

From Table 3, as the number of stage features integrated by the MIF module increases from one to four, there is a notable improvement in performance metrics. Specifically, increasing the number of fused features from one to four results in an addition of only 0.66 M parameters and 0.0009 G FLOPs, while achieving performance gains of 0.96% in

F_{β}^{m e a n}

, 0.83% in

E_{ξ}^{m e a n}

, 1.04% in

S_{m}

, and 0.13% in MAE. This trend highlights the significant positive impact of fusing multi-stage features on the overall performance of SOD.

5. Conclusions

This paper introduces the MFFNet, an innovative and lightweight network architecture tailored to perform SOD in ORSI. The MFFNet is specifically designed to tackle the distinctive challenges encountered in ORSI-SOD, offering optimized performance for this specialized application. By leveraging the MIF modules, the MFFNet skillfully integrates information from multiple stages and scales, enabling it to capture a more comprehensive feature representation. The SGF module utilizes deep-layer semantic information as prior information to mitigate the semantic dilution that occurs throughout the layered feature fusion process. This approach enables the model to more effectively extract and preserve key salient features, enhancing its performance in detecting and highlighting important features.

Experimental evaluations show that our model, featuring just 12.14M parameters and 2.75G FLOPs, not only surpasses existing state-of-the-art SOD methods but also keeps an exceptionally light computational footprint. These results validate the efficiency of our lightweight network architecture in achieving high-performance tasks, demonstrating its value for deployment in resource-constrained settings.

In future work, we plan to refine and optimize the multi-scale information fusion process to enhance detection performance even further. We will also concentrate on enhancing the efficiency of our methodology, aiming to make it more adaptable and impactful across a wider range of computer vision applications. This includes exploring new techniques and algorithms that can streamline processing while maintaining high performance, thereby expanding the potential uses of our approach in diverse fields.

Author Contributions

Methodology, K.H.; formal analysis, J.L.; investigation, J.L.; writing—original draft preparation, K.H.; writing—review and editing, K.H.; visualization, K.H.; supervision, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

ORSSD: https://pan.baidu.com/s/1k44UlTLCW17AS0VhPyP7JA (accessed on 23 December 2024). EORSSD: https://github.com/rmcong/EORSSD-dataset (accessed on 23 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H.; Yang, R. Salient object detection in the deep learning era: An in-depth survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3239–3259. [Google Scholar] [CrossRef] [PubMed]
Fu, C.; Lu, K.; Zheng, G.; Ye, J.; Cao, Z.; Li, B.; Lu, G. Siamese object tracking for unmanned aerial vehicle: A review and comprehensive analysis. Artif. Intell. Rev. 2023, 56, 1417–1477. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Shi, R.; Wei, W. Constrained fixation point based segmentation via deep neural network. Neurocomputing 2019, 368, 180–187. [Google Scholar] [CrossRef]
Fang, Y.; Chen, Z.; Lin, W.; Lin, C.W. Saliency detection in the compressed domain for adaptive image retargeting. IEEE Trans. Image Process. 2012, 21, 3888–3901. [Google Scholar] [CrossRef] [PubMed]
Yang, S.; Jiang, Q.; Lin, W.; Wang, Y. SGDNet: An end-to-end saliency-guided deep neural network for no-reference image quality assessment. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21 October 2019; pp. 1383–1391. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhao, J.X.; Liu, J.J.; Fan, D.P.; Cao, Y.; Yang, J.; Cheng, M.M. EGNet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8779–8788. [Google Scholar]
Zhao, X.; Pang, Y.; Zhang, L.; Lu, H.; Zhang, L. Suppress and balance: A simple gated network for salient object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–51. [Google Scholar]
Xu, B.; Liang, H.; Liang, R.; Chen, P. Locate globally, segment locally: A progressive architecture with knowledge review network for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 3004–3012. [Google Scholar]
Hou, Q.; Cheng, M.M.; Hu, X.; Borji, A.; Tu, Z.; Torr, P.H. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3203–3212. [Google Scholar]
Casagli, N.; Intrieri, E.; Tofani, V.; Gigli, G.; Raspini, F. Landslide detection, monitoring and prediction with remote-sensing techniques. Nat. Rev. Earth Environ. 2023, 4, 51–64. [Google Scholar] [CrossRef]
Wellmann, T.; Lausch, A.; Andersson, E.; Knapp, S.; Cortinovis, C.; Jache, J.; Scheuer, S.; Kremer, P.; Mascarenhas, A.; Kraemer, R.; et al. Remote sensing in urban planning: Contributions towards ecologically sound policies? Landsc. Urban Plan. 2020, 204, 103921. [Google Scholar] [CrossRef]
Li, C.; Cong, R.; Hou, J.; Zhang, S.; Qian, Y.; Kwong, S. Nested network with two-stream pyramid for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9156–9166. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Lin, W.; Ling, H. Multi-content complementation network for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
Li, G.; Liu, Z.; Zeng, D.; Lin, W.; Ling, H. Adjacent context coordination network for salient object detection in optical remote sensing images. IEEE Trans. Cybern. 2022, 53, 526–538. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Chen, H.; Liu, B.; Wang, Z. Semantic-guided attention refinement network for salient object detection in optical remote sensing images. Remote Sens. 2021, 13, 2163. [Google Scholar] [CrossRef]
Zhang, Q.; Cong, R.; Li, C.; Cheng, M.M.; Fang, Y.; Cao, X.; Zhao, Y.; Kwong, S. Dense attention fluid network for salient object detection in optical remote sensing images. IEEE Trans. Image Process. 2020, 30, 1305–1317. [Google Scholar] [CrossRef]
Tu, Z.; Wang, C.; Li, C.; Fan, M.; Zhao, H.; Luo, B. ORSI salient object detection via multiscale joint region and boundary model. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
Zhou, X.; Shen, K.; Weng, L.; Cong, R.; Zheng, B.; Zhang, J.; Yan, C. Edge-guided recurrent positioning network for salient object detection in optical remote sensing images. IEEE Trans. Cybern. 2022, 53, 539–552. [Google Scholar] [CrossRef]
Cong, R.; Zhang, Y.; Fang, L.; Li, J.; Zhao, Y.; Kwong, S. RRNet: Relational reasoning network with parallel multiscale attention for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Zhou, X.; Shen, K.; Liu, Z.; Gong, C.; Zhang, J.; Yan, C. Edge-Aware Multiscale Feature Integration Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Gao, S.H.; Tan, Y.Q.; Cheng, M.M.; Lu, C.; Chen, Y.; Yan, S. Highly efficient salient object detection with 100k parameters. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 702–721. [Google Scholar]
Liu, Y.; Zhang, X.Y.; Bian, J.W.; Zhang, L.; Cheng, M.M. SAMNet: Stereoscopically attentive multi-scale network for lightweight salient object detection. IEEE Trans. Image Process. 2021, 30, 3804–3814. [Google Scholar] [CrossRef] [PubMed]
Lin, Y.; Sun, H.; Liu, N.; Bian, Y.; Cen, J.; Zhou, H. A lightweight multi-scale context network for salient object detection in optical remote sensing images. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 238–244. [Google Scholar]
Shen, K.; Zhou, X.; Wan, B.; Shi, R.; Zhang, J. Fully squeezed multiscale inference network for fast and accurate saliency detection in optical remote-sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
GongyangLi, Z.; Bai, Z.; Lin, W.; Ling, H. Lightweight salient object detection in optical remote sensing images via feature correlation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5617712. [Google Scholar]
Li, G.; Liu, Z.; Zhang, X.; Lin, W. Lightweight salient object detection in optical remote-sensing images via semantic matching and edge alignment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
Liang, B.; Luo, H. MEANet: An effective and lightweight solution for salient object detection in optical remote sensing images. Expert Syst. Appl. 2024, 238, 121778. [Google Scholar] [CrossRef]
Wang, L.; Long, C.; Li, X.; Tang, X.; Bai, Z.; Gao, H. CSFFNet: Lightweight cross-scale feature fusion network for salient object detection in remote sensing images. IET Image Process. 2024, 18, 602–614. [Google Scholar] [CrossRef]
Luo, H.; Wang, J.; Liang, B. Spatial attention feedback iteration for lightweight salient object detection in optical remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13809–13823. [Google Scholar] [CrossRef]
Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
Li, C.; Yuan, Y.; Cai, W.; Xia, Y.; Dagan Feng, D. Robust saliency detection via regularized random walks ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2710–2717. [Google Scholar]
Kim, J.; Han, D.; Tai, Y.W.; Kim, J. Salient region detection via high-dimensional color transform and local spatial support. IEEE Trans. Image Process. 2015, 25, 9–23. [Google Scholar] [CrossRef]
Zhou, L.; Yang, Z.; Zhou, Z.; Hu, D. Salient region detection using diffusion process on a two-layer sparse graph. IEEE Trans. Image Process. 2017, 26, 5882–5894. [Google Scholar] [CrossRef]
Yuan, Y.; Li, C.; Kim, J.; Cai, W.; Feng, D.D. Reversion correction and regularized random walk ranking for saliency detection. IEEE Trans. Image Process. 2017, 27, 1311–1322. [Google Scholar] [CrossRef]
Zhang, L.; Wang, S.; Li, X. Salient region detection in remote sensing images based on color information content. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 1877–1880. [Google Scholar]
Zhang, Q.; Zhang, L.; Shi, W.; Liu, Y. Airport extraction via complementary saliency analysis and saliency-oriented active contour model. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1085–1089. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Y.; Zhang, J. Saliency detection based on self-adaptive multiple feature fusion for remote sensing images. Int. J. Remote Sens. 2019, 40, 8270–8297. [Google Scholar] [CrossRef]
Huang, Z.; Chen, H.X.; Zhou, T.; Yang, Y.Z.; Wang, C.Y.; Liu, B.Y. Contrast-weighted dictionary learning based saliency detection for VHR optical remote sensing images. Pattern Recognit. 2021, 113, 107757. [Google Scholar] [CrossRef]
Wang, L.; Lu, H.; Ruan, X.; Yang, M.H. Deep networks for saliency detection via local estimation and global search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3183–3192. [Google Scholar]
Liu, J.J.; Hou, Q.; Cheng, M.M.; Feng, J.; Jiang, J. A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3917–3926. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehghan, M.; Jagersand, M. Basnet: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7479–7489. [Google Scholar]
Zhou, H.; Xie, X.; Lai, J.H.; Chen, Z.; Yang, L. Interactive two-stream decoder for accurate and fast saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9141–9150. [Google Scholar]
Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual saliency transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4722–4732. [Google Scholar]
Liu, K.; Zhang, B.; Lu, J.; Yan, H. Towards Integrity and Detail with Ensemble Learning for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Yan, R.; Yan, L.; Geng, G.; Cao, Y.; Zhou, P.; Meng, Y. ASNet: Adaptive Semantic Network Based on Transformer-CNN for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Liu, Y.; Gu, Y.C.; Zhang, X.Y.; Wang, W.; Cheng, M.M. Lightweight salient object detection via hierarchical visual perception learning. IEEE Trans. Cybern. 2020, 51, 4439–4449. [Google Scholar] [CrossRef] [PubMed]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhao, K.; Gao, S.; Wang, W.; Cheng, M.M. Optimizing the F-measure for threshold-free salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8849–8857. [Google Scholar]
Cheng, M.M.; Mitra, N.J.; Huang, X.; Torr, P.H.; Hu, S.M. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 569–582. [Google Scholar] [CrossRef] [PubMed]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1597–1604. [Google Scholar]
Fan, D.P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.M.; Borji, A. Enhanced-alignment measure for binary foreground map evaluation. arXiv 2018, arXiv:1805.10421. [Google Scholar]
Fan, D.P.; Cheng, M.M.; Liu, Y.; Li, T.; Borji, A. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4548–4557. [Google Scholar]
Peng, H.; Li, B.; Ling, H.; Hu, W.; Xiong, W.; Maybank, S.J. Salient object detection via structured matrix decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 818–832. [Google Scholar] [CrossRef]
Liu, Z.; Zhao, D.; Shi, Z.; Jiang, Z. Unsupervised saliency model with color Markov chain for oil tank detection. Remote Sens. 2019, 11, 1089. [Google Scholar] [CrossRef]
Chen, Z.; Xu, Q.; Cong, R.; Huang, Q. Global context-aware progressive aggregation network for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10599–10606. [Google Scholar]
Li, J.; Pan, Z.; Liu, Q.; Wang, Z. Stacked U-shape network with channel-wise attention for salient object detection. IEEE Trans. Multimed. 2020, 23, 1397–1409. [Google Scholar] [CrossRef]

Figure 1. Comparison of performance and efficiency.

Figure 2. The overall framework of MFFNet is founded on U-Net architecture. The UniFormer-L encoder captures four-level features, which are subsequently forwarded to the MIF modules. In the decoder, four MIF modules integrate multi-stage and multi-scale information, while the SGF module leverages top-level semantic features to guide the synthesis of lower-level information.

Figure 3. Illustration of the structure of MIF.

Figure 4. Illustration of the structure of SGF.

Figure 5. PR curves and F-measure curves on the EORSSD and ORSSD datasets.

Figure 6. Visual comparisons with 13 ORSI-SOD models.

Table 1. Quantitative comparison on EORSSD and ORSSD datasets.

Methods	Type	Year	Backbone	Params (M) ↓	FLOPs (G) ↓	EORSSD				ORSSD
Methods	Type	Year	Backbone	Params (M) ↓	FLOPs (G) ↓	$F_{β}^{mean}$ ↑	$E_{ξ}^{mean}$ ↑	$S_{m}$ ↑	MAE ↓	$F_{β}^{mean}$ ↑	$E_{ξ}^{mean}$ ↑	$S_{m}$ ↑	MAE ↓
RRWR [33]	T.N.S	2015	-	-	-	0.3686	0.5943	0.5992	0.1677	0.5125	0.7017	0.6835	0.1324
HDCT [34]	T.N.S	2016	-	-	-	0.4018	0.6376	0.5971	0.1088	0.4235	0.6495	0.6197	0.1309
DSG [35]	T.N.S	2017	-	-	-	0.4597	0.6594	0.6420	0.1246	0.5747	0.7337	0.7195	0.1041
SMD [57]	T.N.S	2017	-	-	-	0.5473	0.7286	0.7101	0.0771	0.6214	0.7745	0.7640	0.0715
RCRR [36]	T.N.S	2018	-	-	-	0.3685	0.5946	0.6007	0.1644	0.5126	0.7021	0.6849	0.1277
VOS [38]	T.R.S	2018	-	-	-	0.2107	0.4886	0.5082	0.2096	0.2717	0.5352	0.5366	0.2151
SMFF [39]	T.R.S	2019	-	-	-	0.2992	0.5197	0.5401	0.1434	0.2684	0.4920	0.5312	0.1854
CMC [58]	T.R.S	2019	-	-	-	0.2692	0.5894	0.5798	0.1057	0.3454	0.6417	0.6033	0.1267
PoolNet [42]	D.N.S	2019	VGG16	53.63	123.4	0.6406	0.8193	0.8207	0.0210	0.6999	0.8650	0.8403	0.0358
EGNet [7]	D.N.S	2019	ResNet50	108.07	291.9	0.6967	0.8775	0.8601	0.0110	0.7500	0.9013	0.8721	0.0216
ITSD [44]	D.N.S	2020	VGG16	17.08	54.5	0.8221	0.9407	0.9050	0.0106	0.8502	0.9482	0.9050	0.0165
GCPANet [59]	D.N.S	2020	ResNet50	67.06	54.3	0.7905	0.9167	0.8869	0.0102	0.8433	0.9341	0.9026	0.0168
SUCA [60]	D.N.S	2021	ResNet50	117.71	56.4	0.7949	0.9277	0.8988	0.0097	0.8237	0.9400	0.8989	0.0145
PA-KRN [9]	D.N.S	2021	ResNet50	141.06	617.7	0.8358	0.9536	0.9192	0.0104	0.8727	0.9620	0.9239	0.0139
LVNet [13]	D.R.S	2019	-	-	-	0.7356	0.8826	0.8644	0.0145	0.7995	0.9259	0.8815	0.0207
SARNet [16]	D.R.S	2021	VGG16	25.91	118.16	0.8541	0.9555	0.9240	0.0099	0.8619	0.9477	0.9134	0.0187
DAFNet [17]	D.R.S	2021	Res2Net-50	29.35	839.21	0.7980	0.9382	0.9184	0.0053	0.8442	0.9537	0.9118	0.0106
MJRBM [18]	D.R.S	2022	ResNet50	63.28	80.56	0.8058	0.9212	0.9091	0.0099	0.8573	0.9394	0.9211	0.0146
ERPNet [19]	D.R.S	2022	VGG16	56.48	87.04	0.8304	0.9401	0.9210	0.0089	0.8745	0.9566	0.9254	0.0135
RRNet [20]	D.R.S	2022	R2esNet-50	86.27	692.15	0.8377	0.9449	0.9264	0.0074	0.8747	0.9553	0.9339	0.0112
EMFINet [21]	D.R.S	2022	VGG16	107.26	480.9	0.8486	0.9604	0.9290	0.0084	0.8856	0.9671	0.9366	0.0109
CSNet [22]	L.N.S.	2020	-	0.14	0.7	0.7656	0.8929	0.8364	0.0169	0.8285	0.9171	0.8910	0.0186
SAMNet [23]	L.N.S.	2021	-	1.33	0.5	0.6879	0.8473	0.8537	0.0151	0.7753	0.8930	0.8835	0.0214
HVPNet [48]	L.N.S.	2021	-	1.23	1.1	0.7377	0.8721	0.8734	0.0110	0.7396	0.8717	0.8610	0.0225
MSCNet [24]	L.R.S.	2022	MobileNetV2	3.26	5.87	0.8151	0.9551	0.9071	0.0090	0.8676	0.9653	0.9227	0.0129
FSMINet [25]	L.R.S.	2022	-	3.56	5.24	0.8436	0.9567	0.9255	0.0079	0.8878	0.9672	0.9361	0.0101
CorrNet [26]	L.R.S.	2022	-	4.09	21.09	0.8620	0.9646	0.9289	0.0083	0.9002	0.9746	0.9380	0.0098
SeaNet [27]	L.R.S.	2023	MobileNetV2	2.76	1.7	0.8519	0.9651	0.9208	0.0073	0.8772	0.9722	0.9260	0.0105
MEANet [28]	L.R.S.	2023	MobileNetV2	3.27	9.62	0.8678	0.9658	0.9282	0.0070	0.8934	0.9730	0.9340	0.0098
CSFFNet [29]	L.R.S.	2023	-	25.63	17.21	0.734	0.891	0.896	0.010	0.885	0.930	0.930	0.017
SAFINet [30]	L.R.S.	2024	MobileNetV2	3.12	7.63	0.8710	0.9682	0.9267	0.0065	0.9030	0.9748	0.9401	0.0086
Ours			UniFormer-L	12.14	2.75	0.8585	0.9688	0.9292	0.0064	0.8984	0.9754	0.9384	0.0093

T.N.S.: Traditional NSI-SOD method; T.R.S.: Traditional ORSI-SOD method; D.N.S.: DL-based NSI-SOD method; D.R.S.: DL-based ORSI-SOD method; L.N.S.: Lightweight NSI-SOD method; L.R.S.: Lightweight ORSI-SOD method. The top three scores are marked in red, blue, and green.

Table 2. Effectiveness of MIF and SGF.

Baseline	MIF	SGF	Params (M) ↓	FLOPs (G) ↓	EORSSD
Baseline	MIF	SGF	Params (M) ↓	FLOPs (G) ↓	$F_{β}^{mean}$ ↑	$E_{ξ}^{mean}$ ↑	$S_{m}$ ↑	MAE ↓
√			9.91	1.9935	0.8438	0.9628	0.9171	0.0080
√	√		10.62	1.9951	0.8577	0.9683	0.9213	0.0070
√		√	1143	2.7565	0.8561	0.9676	0.9242	0.0069
√	√	√	12.14	2.7582	0.8585	0.9688	0.9292	0.0064

The best one in each column is shown in red.

Table 3. Effectiveness of using multistage features in MIF.

NO.	Params (M) ↓	FLOPs (G) ↓	EORSSD
NO.	Params (M) ↓	FLOPs (G) ↓	$F_{β}^{mean}$ ↑	$E_{ξ}^{mean}$ ↑	$S_{m}$ ↑	MAE ↓
1	11.48	2.7573	0.8489	0.9605	0.9188	0.0077
2	11.57	2.7576	0.8498	0.9631	0.9221	0.0071
3	11.76	2.7578	0.8530	0.9663	0.9242	0.0067
4	12.14	2.7582	0.8585	0.9688	0.9292	0.0064

The best one in each column is shown in red.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Huang, K. Lightweight Multi-Scale Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images. Electronics 2025, 14, 8. https://doi.org/10.3390/electronics14010008

AMA Style

Li J, Huang K. Lightweight Multi-Scale Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images. Electronics. 2025; 14(1):8. https://doi.org/10.3390/electronics14010008

Chicago/Turabian Style

Li, Jun, and Kaigen Huang. 2025. "Lightweight Multi-Scale Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images" Electronics 14, no. 1: 8. https://doi.org/10.3390/electronics14010008

APA Style

Li, J., & Huang, K. (2025). Lightweight Multi-Scale Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images. Electronics, 14(1), 8. https://doi.org/10.3390/electronics14010008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Multi-Scale Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Traditional SOD Methods

2.2. Deep Learning-Based SOD Methods

2.3. Lightweight SOD Methods

3. Proposed Method

3.1. Network Overview

3.2. Multi-Stage Information Fusion (MIF) Module

3.3. Semantic Guidance Fusion (SGF) Module

3.4. Loss Function

4. Experiments

4.1. Experiment Details

4.1.1. Datasets

4.1.2. Parameter Settings

4.1.3. Evaluation Metrics

4.2. Performance Comparison

4.2.1. Quantitative Comparison

4.2.2. Visual Comparison

4.2.3. Computational Performance

4.3. Ablation Studies

4.3.1. Effectiveness of MIF and SGF

4.3.2. Effectiveness of Using Multistage Features in MIF

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI