1. Introduction
With the rapid advancement of remote sensing technology, the ability to observe the Earth’s surface has reached unprecedented levels, with remote sensing satellites generating vast amounts of data daily. Efficiently organizing, managing, and rapidly querying these large datasets to retrieve relevant image data from remote sensing image databases remains a significant challenge [
1,
2]. Remote sensing image retrieval (RSIR) has garnered increasing attention from the remote sensing community as a key technique for querying and retrieving similar images from large-scale databases [
3].
Early text-based image retrieval (TBIR) methods involved manually assigning keywords to each image, allowing users to retrieve corresponding images based on these keywords [
4]. However, the proliferation of remote sensing data makes it impractical to manually annotate each image. In addition, RS images cover large areas and contain multiple features, and the subjective judgment and cognitive biases of annotators can lead to inconsistencies in image descriptions. As a result, accurate retrieval of remote sensing images based solely on keywords is challenging. Content-Based Image Retrieval (CBIR) eliminates the need for metadata or textual descriptions, overcoming the limitations of TBIR by directly analyzing the visual features of images for retrieval tasks [
5]. On the one hand, CBIR can support various image processing tasks, such as scene classification [
6], target identification and localization [
7,
8]. On the other hand, it has a wide range of applications across fields such as ecological monitoring [
9], disaster assessment [
10], and urban management [
11], making it a key area of research in remote sensing image retrieval technology.
Content-based remote sensing image retrieval consists of two main components: feature extraction and similarity metrics [
12]. Feature extraction focuses on obtaining representative features of remote sensing images, while similarity metrics calculate the similarity between query image and database images to retrieve similar images. Deep learning has been increasingly applied to remote sensing image retrieval due to its powerful feature extraction capabilities. The advanced features extracted by deep learning methods are more stable and generalizable, and extensive research has been conducted on using pre-trained or fine-tuned models to extract the content features of remote sensing images, yielding promising results [
13,
14,
15,
16,
17,
18]. However, most of these methods assume that both query image and database image are single-labeled. While this approach may be sufficient for certain remote sensing scenarios, such as a beach or forest with a uniform background, images often contain diverse features (e.g., buildings, roads, vegetation, etc.), making it difficult to fully represent image content with a single label. In contrast, multi-label images can describe the information within remote sensing images in greater detail and provide a deeper understanding of image semantics [
19]. Many researchers have focused on multi-label remote sensing image retrieval methods. However, most of these studies emphasize improving the feature extraction capabilities of remote sensing images, which often results in high-dimensional features and the problem of “dimensional catastrophe”. Additionally, there is relatively little research on similarity metrics for multi-label remote sensing image retrieval [
20]. Some studies have even oversimplified the multi-label retrieval problem by assuming that images are similar if they share a common label.
To address the aforementioned challenges, we propose an end-to-end deep hashing network model incorporating semantic information named Semantically Guided Deep Supervised Hashing (SGDSH). For feature extraction, we enhance the Swin Transformer model to obtain multi-scale feature fusion from remote sensing images and reduce feature dimensionality through hash learning. Additionally, the multi-label semantic information of images is fully utilized to guide the learning of hash codes, thereby improving retrieval accuracy. This approach provides an effective solution for large-scale remote sensing image retrieval tasks. The main contributions of this paper can be summarized as follows:
(1) In this paper, we propose a deep hashing network based on Swin Transformer for extracting image features. We design a loss function that integrates image label loss, image pair similarity loss, and hash code quantization loss.
(2) Leveraging the model’s ability to recognize image categories, we propose a category-weighted Hamming distance-based ranking strategy to enhance the similarity metric.
(3) We conduct experiments on three public remote sensing datasets of varying sizes. In addition to evaluating the retrieval accuracy of the proposed method across different dataset sizes, we also compare it with other state-of-the-art methods to demonstrate its effectiveness and superiority.
The remainder of this paper is organized as follows:
Section 2 reviews conventional and deep hashing methods for retrieval tasks.
Section 3 details the SGDSH model. Experimental results and discussions are presented in
Section 4 and
Section 5. Finally,
Section 6 concludes this paper.
2. Related Work
2.1. Multi-Label Remote Sensing Image Retrieval
Multi-label remote sensing image retrieval associates each remote sensing image with multiple labels. However, labeling remote sensing images with multi-label information is a costly task, resulting in a lack of large-scale multi-label remote sensing datasets, which limits research in this area.
Shao et al. [
19] released the DLRSD multi-label dataset based on the UCM dataset, extending the annotation of image libraries across 17 categories. They evaluated the performance of both single-label and multi-label remote sensing image retrieval methods, providing a benchmark for multi-label retrieval. Additionally, a novel multi-labeling method based on Fully Convolutional Networks (FCNs) was proposed, marking the first introduction of deep learning into multi-label retrieval tasks [
20]. Chaudhuri et al. [
21] developed a multi-label RS dataset based on the UCM dataset and proposed a semi-supervised graph-theoretic approach for remote sensing image retrieval. Cheng [
22] introduced a multi-label semantic preservation deep hashing model to improve retrieval efficiency, reduce feature storage, and retain semantic information. Jin et al. [
23] proposed an interpretable network model for hash retrieval of cloud images, consisting of a feature learning module and a hash learning module, for efficient content-based multi-label retrieval of satellite cloud images. Despite these advancements, the scarcity of large-scale multi-label remote sensing image databases remains a significant challenge. To address this gap, Qi et al. [
24] released the MLRS dataset, a large-scale multi-label remote sensing image dataset designed for multi-label classification and retrieval tasks. It contains 109,161 samples across 46 scenes. The existing research primarily focuses on extracting powerful features from RS images, while there is relatively limited work on similarity metrics. Regarding similarity metric computation, Imbriaco et al. [
25] proposed a multi-label loss function and reordering method, investigating the impact of commonly used loss functions and reordering techniques on multi-label retrieval performance. Lu et al. [
26] introduced a two-stage hierarchical image retrieval (HIR) approach, which incorporates a semantic segmentation module and supplements the similarity metric with visual similarity and semantic statistical information to enhance retrieval accuracy.
The development of deep learning techniques can enable multi-label remote sensing image retrieval to effectively improve retrieval accuracy. The proposed method focuses on enhancing the extraction of high-level semantic features and improving the model’s understanding of image content. However, there remains a gap in the research on similarity metrics, and the discussion on the storage and retrieval efficiency of large-scale remote sensing image datasets is still limited.
2.2. Hash-Based Remote Sensing Image Retrieval
Image retrieval tasks typically rely on an Approximate Nearest Neighbor (ANN) search [
27], which can lead to significant computational burdens in large-scale data retrieval, especially as deep learning methods tend to produce high-dimensional feature vectors. To mitigate this issue, feature hashing is used to reduce storage requirements and improve computational efficiency by mapping high-dimensional feature vectors to low-dimensional binary codes [
28]. In recent years, numerous hash retrieval methods have been proposed, which can generally be classified into two categories: unsupervised hashing and supervised hashing.
Unsupervised hashing methods rely on unlabeled data to generate binary hash codes for images. Li et al. [
29] employed random projection to generate an initial estimate of the hash code, which was subsequently reprojected into the original feature space using a linear model to train the projection matrix and generate a final hash code. Sun et al. [
30] proposed an unsupervised deep hashing method based on soft pseudo-labeling, which autonomously generates image soft pseudo-labels and local similarity matrices to facilitate similarity learning among remote sensing images. Although unsupervised hashing methods are simple and effective in generating binary codes for large-scale remote sensing image datasets, their accuracy is relatively limited due to the lack of supervisory information.
In comparison, supervised hashing, particularly methods incorporating deep learning, fully leverages the potential of deep networks and hash learning to improve retrieval accuracy. Li et al. [
31] proposed an end-to-end Deep Hash Neural Network (DHNN), which uses hashing to transform high-dimensional features into low-dimensional hash codes. Inspired by generative adversarial networks (GANs), Liu et al. [
32] used a uniformly distributed truth matrix as input to the discriminator and designed a loss function with multiple constraints to train the generator and produce high-quality hash codes. Song et al. [
33] introduced an asymmetric hash learning method to generate hash codes for querying database images in an asymmetric manner. In deep hash learning, to generate hash codes that effectively distinguish remote sensing images with different contents, a metric learning strategy is often employed. The core objective of this strategy is to maximize the distance between different classes while minimizing the variation within the same class, thus enhancing the distinguishing ability of the deep hash model. Roy et al. [
34] proposed a hash network based on metric learning, utilizing a pre-trained deep CNN without fine-tuning, focusing solely on learning the hash function. Cheng et al. [
22] introduced a pairwise label similarity loss algorithm that fully exploits multi-label information and demonstrated the effectiveness of this hashing method in multi-label remote sensing image retrieval. Wang et al. [
13] proposed a ternary ordered cross-entropy hashing method to address the limitation of most hashing algorithms, which focus only on pointwise or pairwise similarity. Zhou et al. [
35] introduced a deep global semantic structure-preserving hashing (DGSSH) method based on triplet loss, optimizing triplet loss and asymmetric constraints in hash learning to generate distinguishable hash codes.
Efficient hash code learning and the utilization of loss functions to retain as much semantic information as possible remain critical challenges in deep hash-based image retrieval research.
3. Proposed Method
The overall framework of Semantically Guided Deep Supervised Hashing (SGDSH) is illustrated in
Figure 1. In
Section 3.1, we provide a detailed description of the model architecture for our multi-label image retrieval model.
Section 3.2 focuses on the design of loss function, and, in
Section 3.3, we introduce a similarity metric ranking scheme that incorporates label information.
3.1. Overall Model Framework
3.1.1. Feature Extraction Module
In order to extract multi-scale feature information and global contextual dependencies from remote sensing images, we propose a multi-scale feature fusion network based on Swin Transformer [
36]. As shown in
Figure 2, the Swin Transformer module consists of a W-MSA module, a SW-MSA module, and two MLPs, which capture the dependencies of input sequences through a self-attention mechanism. The self-attention operation is expressed in Equation (1):
where
denote query, key, and value matrices;
denotes the normalization function; and
denotes the dimension of the
vector.
The image features pass through Swin Transformer module with the following equation:
where
are
output features through the W-MSA module and MLP module in block
;
are
output features through the SW-MSA module and MLP module in block
.
Given an input remote sensing image , feature extraction is performed using a pre-trained Swin Transformer model. The image and its generated feature maps exhibit multi-scale properties across different stages of the model’s output. The feature maps from the final stage are then flattened to obtain a global feature vector of the image .
The mathematical expression of
is shown in Equation (6):
where
denotes the parameters of network.
Due to the rich feature content and feature variability at different scales in remote sensing images, it is challenging to comprehensively capture multi-scale information of the targets with only the final output of the network. Additionally, as the number of network layers increases, the receptive field expands, which can reduce the model’s ability to detect small targets and increase the risk of information loss. Considering these factors, we consider global features
alone cannot fully represent all the information contained in the RS image. To address this, we propose a multi-scale feature fusion module. This module extracts features from each stage of the network and performs multi-scale feature fusion. The specific process is as follows:
where
denotes image features output from
th stage;
and
represent concatenation and upsampling operations;
denotes low-level image features; and
denotes high-level feature generated from higher stages.
We successfully obtain two key features: high-level feature
, enriched with global semantic information of the image, and low-level feature
, which captures rich local details. To further enhance the representational capability of these features, we introduce the Convolutional Block Attention Module (CBAM) [
37]. By employing the CBAM module, we apply a weighting mechanism to the feature map, emphasizing important feature information while reducing the influence of background noise. This allows the model to focus its attention on the most informative areas of the image. The CBAM module can be summarized by the following equation:
where
denotes the input feature map;
represents intermediate features generated by channel attention;
refers to output features generated by spatial attention;
and
denote the channel attention module and spatial attention module. The structure is shown in
Figure 3 and
Figure 4.
After processing by CBAM module, the updated expressions for the low-level features and high-level features are as follows:
We obtain the high-level feature
and low-level feature
and then perform pooling operations and fusion with the global feature
to generate the final image features. This feature fusion process is carried out by the Multi-Scale Feature Fusion Module (MSFFM), which is expressed as follows:
With this feature fusion strategy, the model is able to more accurately capture and interpret the complex scenes and fine-grained details within remote sensing images, thereby providing robust and detailed feature representations for remote sensing image retrieval tasks.
3.1.2. Hashing Layer
To reduce the storage space required for image features and enable fast retrieval from large remote sensing databases, we employ a hashing method to compress the high-dimensional image features. After the image passes through feature extraction network to obtain the feature F, it is then processed by a fully connected layer to produce the hash feature
, where K denotes the length of the hash code. This process is expressed by Equation (15):
where the tanh activation function maps the feature vector values between −1 and 1.
The classification labels of an image typically capture its overall semantic content, particularly for multi-labeled images. These labels can serve as valuable semantic clues to assist network training and facilitate the generation of hash codes that contain richer semantic information. To leverage this, we introduce a classification layer after the hash layer, with the number of neurons set to match the predefined number of C categories. The classification result, denoted as
, is obtained through a fully connected layer. A sigmoid activation function is then applied to predict the probability that the image belongs to each of C categories. This process can be expressed by Equation (16):
where
denotes the fully connected layer, and the sigmoid activation function maps category vector values between 0 and 1.
3.2. Loss Function Design
3.2.1. Pairwise Similarity Loss
The primary goal of hash learning is to learn a set of nonlinear hash functions that map data from the original feature space to a binary hash code, while preserving semantic similarity between data points. Specifically, the objective is to make samples from the same category closer in Hamming distance, while pushing samples from different categories further apart. In hash learning of single-label remote sensing images, the similarity between two images is determined by whether their category labels are the same. However, for multi-label remote sensing images, the assumption that two images are similar simply because they share a label does not fully capture the nuanced similarity between images. Inspired by [
25,
38], we propose a method based on image label similarity to better reflect the true similarity relationship. Our approach calculates the label similarity between pairs of images, dividing the samples into positive and negative pairs. This label similarity is then used to weight the loss function, contributing more effectively to the learning process. The label similarity is computed as follows:
where
and
denote the label vectors of image
and image
, ‖ ‖ and denotes the L2 paradigm operation.
According to the definition in [
25], two images are considered similar if the similarity of their labels
. Otherwise, images are deemed dissimilar, and the distance between their hash codes should be maximized. This can be expressed as follows:
In addition, the Hamming distance between the hash codes of two images is computed, with label similarity serving as a weighted loss term. This ensures that image pairs with a higher number of shared labels contribute more heavily to the loss function. This design encourages the model to better capture the semantic similarity between images and learn more accurate hash codes. Specifically, our loss function is defined as follows:
where
denotes the Hamming distance between hash codes extracted from image
and image
.
is a hyperparameter indicating the minimum distance between pairs of negative samples.
The Hamming distance between image pairs is calculated as follows:
According to the proposed contrast loss function, the trained model minimizes the difference between the hash codes of remote sensing image pairs with more common labels, indicating greater similarity between them. Conversely, for dissimilar images with lower label similarity, the corresponding hash codes are pushed further apart, ensuring proper differentiation.
3.2.2. Quantization Loss
Ideally, the hash code
b should be compact and consist of discretized 0 s and 1 s. However, the discrete nature of the hash code can cause the gradient to vanish during model training, hindering backpropagation and parameter updates. To address this issue, we use the floating-point vector
u output from the hash layer to approximate the hash code
b during model training. Additionally, we introduce a quantization loss function that encourages each element in
u to move closer to either −1 or 1. The quantization loss function is defined as follows:
3.2.3. Classification Loss
For multi-label classification problems, the cross-entropy loss function is the most commonly used objective function. In a multi-category prediction scenario, a sigmoid activation function is used instead of a softmax function, transforming the problem into a set of binary classification tasks for each category. The objective function is then constructed by applying cross-entropy loss to the binomial distribution of each category label. Thus, the classification loss function is defined as follows:
where
represents the actual probability of the image belonging to that class, taking values of 1 or 0, and
denotes the predicted probability of the model for the image belonging to that class.
For the retrieval task of multi-label remote sensing image datasets, this study aims not only to ensure the accuracy of retrieval results but also to prioritize returning images that share more labels with the query image. To achieve this, we propose a semantic-guided deep supervised hashing method that fully leverages image semantic information. This approach efficiently integrates deep hash learning with the multi-label classification task by designing an objective function that simultaneously considers both the multi-label semantic information of remote sensing images and the similarity features between image pairs. Specifically, the objective function of this paper is defined as follows:
where
is multi-label classification loss;
is pairwise similarity loss;
is quantization loss; and
,
, and
are sensitivity parameters that control three parts of loss, respectively.
3.3. Similarity Metric Search
A set of query images is defined as
, and corresponding hash codes
are obtained for each image after passing through a deep hash network. During the retrieval phase, the similarity between the hash code of the query image and all images in the dataset is measured by calculating the Hamming distance between them. The results of the Hamming distance calculations are then sorted to determine the set of images most similar to the query image. The Hamming distance between hash codes is computed as follows:
In Hamming distance-based retrieval, the quality of the retrieval results heavily depends on the quality of learned hash codes. However, the lack of an intuitive understanding of how these hash codes represent image content makes the results from Hamming distance ranking difficult to interpret. Additionally, the retrieval database often contains a large volume of data, and, since Hamming distance measures the sum of dissimilar bits between two hash codes, it can produce many results with identical Hamming distances. This creates the need for finer ordering of images that have similar Hamming distances. Furthermore, differences between images are not only reflected in the feature vector values but also in the overall category differences. Yet, most methods do not fully leverage the learned category information during retrieval.
To address these challenges, this paper proposes an Improved Similarity Measurement Strategy (ISMS), which aims to better utilize the category information learned by the model and reduce the distance between images belonging to same category. Specifically, after the query image passes through the network, we obtain its hash code
as well as the predicted category label vector
. In the similarity calculation, the label similarity between the query image’s category label and the image to be retrieved is computed using the Jaccard coefficient as a distance weight. The Jaccard coefficient is calculated using the following formula:
The Hamming distance between two hash codes is finally weighted as the final similarity result, and the similarity metric expression is defined as follows:
4. Experiments and Analysis
To evaluate the effectiveness of proposed method, several experiments are conducted in this section. First, the datasets and experimental environments are introduced, followed by the implementation details of the various experiments. Finally, the results are presented, analyzed, and discussed.
4.1. Datasets and Evaluation Indicators
4.1.1. Datasets
In this paper, we utilized three widely used publicly available multi-labeled remote sensing datasets: the UC Merced multi-labeled dataset [
21], AID multi-labeled dataset [
39], and MLRS dataset [
24].
The UCM-multi dataset is labeled with 17 categories based on the UC Merced remote sensing image dataset [
40], which contains 21 geographic categories, with 100 remote sensing images per category, totaling 2100 images. The image size is 256 × 256 pixels, and the spatial resolution is 30 cm/pixel.
The AID-multi dataset is an enhanced version of the AID dataset [
41], containing 3000 aerial images selected from 30 scenes in the AID dataset. Multiple object labels are manually assigned to each image through visual inspection. The image resolution for different object categories ranges from 0.5 m/pixel to 8 m/pixel, and each image has a size of 600 × 600 pixels, making it more challenging compared to the UCM-multi dataset.
The MLRS dataset consists of 109,161 labeled RGB images obtained from Google Earth, annotated into 46 broad categories. The number of images varies across categories, ranging from 1500 to 3000 images per category. Each image is assigned a subset of 60 pre-defined category labels, with the number of labels per image ranging from 1 to 13. The spatial resolution of the images ranges from 0.1 m to 10 m, making it ideal for data-driven deep learning research.
4.1.2. Evaluation Indicators
In our experiments, we used four commonly used retrieval evaluation metrics to assess the performance of multi-label remote sensing image retrieval model: mean average precision (MAP), normalized discounted cumulative gain (NDCG), average cumulative gain (ACG), and weighted average precision (WAP).
MAP measures mean of average precision (AP) across all query images, where AP represents average precision for each query. MAP is calculated as follows:
where
denotes the number of query images. Average precision (AP) is defined as follows:
where
is an indicative function; if label similarity between
th query image and
th image is greater than 0.4, then
; otherwise, it is zero.
NDCG is well suited for multi-label retrieval problems because the metric measures the relevance of the top-ranked results with the following formula:
where DCG is calculated as follows:
where
denotes the number of public labels of the query image q and
th retrieved image.
denotes
in the ideal state, calculated as follows:
where
denotes the number of common labels of the query image q and
th retrieved image in the ideal state.
ACG describes the average similarity between query image and first n retrieved images, calculated as follows:
where
denotes the number of labels for retrieving image
.
WAP is a variant of MAP which can be computed based on ACG and is more accurate for evaluating the retrieval accuracy of the model. The WAP of first n images obtained from retrieval is calculated as follows:
4.2. Experimental Settings
To validate the effectiveness of the proposed method, we conducted experiments on three publicly available multi-labeled remote sensing image datasets. In each experiment, the dataset was split into two parts: 80% of images were used to construct the training set, while the remaining 20% were used as the validation set. In the retrieval phase, all images in UCM-multi and AID-multi datasets were used to form the retrieval database, with each image serving as a query instance. For the MLRS dataset, the validation set was used as the retrieval database, and 10% of images were randomly selected to serve as query examples.
The experiments in this paper were conducted on the PyTorch platform, with the software environment consisting of Ubuntu 5.4, CUDA-11.4, PyTorch-1.10.0, and Python 3.7, and accelerated by NVIDIA A100. During the model training, all the input images were resized to 224 × 224. A learning rate decay strategy was applied with a decay rate of 0.05 and an initial learning rate of 0.0001. The batch size was set to 128, and AdamW was used for optimization. The number of epochs was set to 200 for both the UCM-multi and AID-multi datasets, and 150 for the MLRS dataset.
4.3. Parameter Analysis
As shown in Equation (23), our loss function consists of three components: classification loss, pairwise similarity loss, and quantization loss, each associated with penalty coefficients α, β, and γ, respectively. To deeply analyze the impact of these parameters on retrieval performance, we conducted experiments on the UCM-multi, AID-multi, and MLRS datasets. In these experiments, the hash code length was set to 128 bits, and ACG@50 was used as the evaluation metric for retrieval results. During analysis, when the effect of a particular parameter is evaluated, the other two parameters are kept fixed.
Figure 5a,b,c,
Figure 6a,b,c, and
Figure 7a,b,c show the variation in retrieval results as the values of α, β, and γ change on these three datasets, respectively.
We first determined the range of variation for penalty coefficient α, fixing β and γ at 1.0 and 0.1, respectively, based on empirical observation. By analyzing the trends in
Figure 5a,
Figure 6a, and
Figure 7a, we observed that model achieves optimal retrieval performance when α is set to 2.5 for all three datasets. Additionally, when α and γ are set to 2.5 and 0.1, respectively, the best retrieval results are obtained when β is set to 1.0. As β increases, the model’s retrieval accuracy begins to decrease, particularly on the MLRS and UCM-multi dataset. For sensitivity analysis of quantization loss, we tested a set of empirical values for γ: {0.01, 0.05, 0.1, 0.1, 1.0}, and the retrieval accuracy reaches its peak when γ is set to 0.1 in the three datasets, as shown in
Figure 5c,
Figure 6c, and
Figure 7c. Further increases in the weight of γ lead to a substantial decline in accuracy. This suggests that the weight of quantization loss should be carefully tuned to avoid over-emphasizing the quantization effect on hash codes.
Based on the above analyses, we determined that penalty coefficients α, β, and γ should be set to 2.5, 1.0, and 0.1, respectively. These parameter values strike a balance for the model, ensuring high classification accuracy of hash codes while maintaining an appropriate trade-off between pairwise similarity loss and quantization loss. This optimal configuration enables the model to achieve the best retrieval performance for the task of multi-label remote sensing image retrieval.
4.4. Ablation Experiment
After determining the optimal weights of each component of the loss function, we conducted a series of ablation experiments on three datasets. These experiments aim to not only validate the contribution of each module to overall accuracy in different datasets but also to assess model’s retrieval performance when handling a large-scale remote sensing dataset (the MLRS dataset being retrieved in the ablation experiments includes more than 20,000 images). The proposed model consists of two core modules: a feature extraction module and a hash learning module. In the feature extraction module, we incorporate a fusion strategy that combines shallow and deep features to enhance feature representation. In the hash learning module, we leverage image category information learned by the model and integrate it into the similarity metric to improve retrieval performance.
We progressively integrated these modules into the baseline model and constructed four configurations for comparison: (i) Extracting image features using the Swin Transformer base model and relying on hash codes in the similarity metric; (ii) Using the Swin Transformer base model and weighting the Hamming distance by incorporating category labeling information; (iii) Using the improved Swin Transformer model with Hamming distance; (iv) Extracting image features using the improved Swin Transformer model and weighting Hamming distance by combining category label information.
We evaluated the performance using NDCG, ACG, and WAP metrics, focusing on the top 10 and top 50 retrieved images. Final experimental results, presented in
Table 1,
Table 2 and
Table 3, illustrate the quantitative impact of different model components on retrieval performance across the three datasets.
By analyzing the experimental results, we find that the proposed method achieves satisfactory performance in the retrieval task, with all components of the model contributing positively to the results. Specifically, feature extraction capability is enhanced by the feature fusion module compared to the base Swin Transformer model. For the top 10 retrieved images, the accuracy of NDCG, ACG, and WAP improved by 0.2%, 0.1%, and 0.1% on the UCM-multi dataset; 0.5%, 0.6%, and 0.9% on the AID-multi dataset; and 0.2%, 0.4%, and 0.5% on the MLRS dataset. For the top 50 retrieved images, these metrics showed improvements of 0.1%, 0.1%, and 0.1% on the UCM-multi dataset; 0.2%, 0.8%, and 1.0% on the AID-multi dataset; and 0.2%, 0.6%, and 0.8% on the MLRS dataset.
Furthermore, by incorporating model-predicted image category information into the similarity metric, retrieval accuracy improves even more significantly, especially on the MLRS dataset. For the top 10 retrieved images, the accuracy increased by 0.2%, 0.7%, and 0.7% on the UCM-multi dataset; 0.1%, 0.2%, and 0.2% on the AID-multi dataset; and 1.1%, 2.7%, and 3.4% on the MLRS dataset. For the top 50 retrieved images, the accuracy improved by 0.2%, 1.5%, and 1.5% on the UCM-multi dataset; 0.2%, 0.4%, and 0.5% on the AID-multi dataset; and 1.0%, 2.5%, and 3.3% on the MLRS dataset. These results demonstrate that the inclusion of inter-image category information can effectively enhance ACG, with the improvement being particularly evident in large-scale data retrieval, indicating that retrieved images share more common labels with the query image, thereby contributing to a substantial increase in WAP.
The experimental results demonstrate that the multi-feature fusion module proposed in this paper enhances the feature extraction capability of remote sensing images compared to the basic Swin Transformer model. Additionally, by incorporating category label information into the similarity metric and weighting the Hamming distance according to label similarity, the number of common labels between the query image and retrieved images increases. This ensures that retrieved images share more similar visual content with the query image, thereby improving retrieval accuracy.
4.5. Comparative Experiments
To comprehensively evaluate effectiveness of proposed method, we compare it with several existing multi-label retrieval methods, including HyP [
42], DMHR [
22], IDHN [
43], and FAH [
16], all of which use pairwise similarity to learn the relationships between image pairs. The comparisons are conducted on the UCM-multi and AID-multi datasets, using three different hash code lengths to assess the performance of retrieving the top 10 and top 50 most similar images.
Experimental results for UCM-multi dataset are presented in
Table 4 and
Table 5. The data clearly demonstrate that the SGDSH model outperforms comparison methods across all hash code lengths. Specifically, when hash code length is set to 96 or 128, our method achieves superior results. For the task of retrieving the top 10 most similar images, MAP reaches 99.623%, outperforming HyP by 0.237%, DMHR by 3.491%, IDHN by 10.236%, and FAH by 28.916%. In the task of retrieving the top 50 images, MAP reaches 98.688%, which is 0.347% higher than HyP, 8.57% higher than DMHR, 16.947% higher than IDHN, and 32.657% higher than FAH.
Figure 8 and
Figure 9 present the performance results of each method across different hash code lengths on the UCM-multi dataset.
Experimental results on the AID-multi dataset, shown in
Table 6 and
Table 7, demonstrate a performance similar to that observed on the UCM-multi dataset, with the SGDSH method outperforming all other methods when hash code length is set to 128 bits. As the dataset size increases, particularly with the variation in spatial resolution across different images, the gap between the SGDSH method and the other methods becomes more pronounced. Notably, the HyP method remains competitive, maintaining a similar mean average precision (mAP) to SGDSH on the AID-multi dataset. Specifically, for the task of retrieving the top 10 most similar images, SGDSH achieves a MAP of 99.566%, surpassing the other methods by 0.492% over HyP, 2.459% over DMHR, 20.522% over IDHN, and 21.733% over FAH. In the task of retrieving the top 50 most similar images, SGDSH reaches a MAP of 98.932%, which is 0.888% higher than HyP, 4.938% higher than DMHR, 20.626% higher than FAH, and 21.094% higher than IDHN. In terms of ACG, SGDSH shows an even more significant superiority on this dataset compared to the UCM-multi dataset. For the top 10 and top 50 image retrieval tasks, the ACG results of SGDSH are higher than HyP by up to 7.853% and 7.62%, higher than DMHR by up to 17.707% and 17.2%, higher than FAH by up to 37.163% and 32.762%, and higher than IDHN by up to 39.862% and 35.802%, respectively. These results indicate that the SGDSH method is particularly effective in generating distinguishable hash codes for datasets with varying spatial resolutions.
Figure 10 and
Figure 11 present the performance results of each method across different hash code lengths on the AID-multi dataset.
In the previous comparison experiments, our method with the HyP method showed a very significant superior performance compared to the other comparison methods. In addition, the experimental results of the MAP metrics of our method and HyP on both datasets are very close to each other. To further validate the effectiveness of the proposed method, we compared it with HyP on the MLRS dataset. In contrast to the UCM-multi dataset (2100 images) and AID-multi dataset (3000 images), the MLRS dataset contains a significantly larger volume of data (109,161 images) and a greater number of categories (60 predefined types), making it more representative of actual applications. The retrieval results on the MLRS dataset are presented in
Table 8 and
Table 9, and
Figure 12 and
Figure 13 present the performance results of our method and the HyP method.
The experimental results show that both methods achieve the best performance when hash code length is set to 128 bits, indicating that longer hash codes are more effective at retaining the original high-dimensional feature information. In the mAP results for retrieving the top 10 most similar images, the proposed method improves by 1.134% compared to the HyP method. For retrieving the top 50 most similar images, the proposed method improves by 2.342%. This demonstrates that the proposed method maintains excellent retrieval accuracy even when dealing with large-scale datasets. In terms of ACG, the proposed method outperforms the HyP method by 4.049% for the top 10 images and by 4.241% for the top 50 images. This consistently shows that the retrieval results in the proposed method maintain a high similarity to the query images in terms of feature type and visual content. These results further validate the effectiveness of the proposed method in multi-label remote sensing image retrieval tasks.
Summarizing the results of comparison experiments in this section, our method demonstrates the effectiveness of the SGDSH approach when compared with other state-of-the-art methods across three different datasets. This can be attributed to the following key factors: The Multi-scale Feature Fusion Module effectively learns and fuses various types of features from remote sensing images at different scales, enhancing the model’s feature extraction capability and ensuring a comprehensive representation of different feature types. Our proposed pairwise similarity loss strategy effectively reduces the distance between images with more common labels. This strategy helps ensure that images with similar labels are ranked higher in retrieval results, thereby improving the retrieval performance. The sorting strategy combined with category labels enhances visual similarity between query image and retrieved images. This results in satisfying retrieval outcomes, even when the dataset is large and images have inconsistent spatial resolutions. These experimental results fully confirm the effectiveness and practicality of the SGDSH method in the multi-label remote sensing image retrieval task.
4.6. Visual Analysis
To visualize the performance of different methods in the task of multi-label remote sensing image retrieval, we present a comparison of query retrieval results of the proposed method with those of HyP, DMHR, IDHN, and FAH on the UCM-multi and AID-multi datasets. It is important to note that all methods use a uniform 128-bit hash code, and the query image does not serve as a retrieved image, meaning it is excluded from the retrieval process. To further validate the effectiveness of the similarity metric strategy proposed in this paper, we also include the results of our method without Jaccard weighting, referred to as “our method”. The retrieval results are displayed in
Figure 14 and
Figure 15.
We specifically selected natural landscape images and artificial facility images as two types of retrieval queries. The retrieval results demonstrate that our proposed method ensures all returned results are correct for both queries, with retrieved images showing a higher similarity to the actual labels of query images. This means the retrieved images share more labels with the query images. Notably, by weighting the Hamming distance, our method preferentially returns images that share more labels with the query, leading to retrieval results that are closer to the ideal outcome. This indicates that the proposed method can effectively recognize different feature classes within remote sensing images, extract features from each class, and deliver strong ranking performance, thus providing more accurate and reliable retrieval results for practical applications.
5. Discussion
To enhance the feature extraction capability of remote sensing images, this study proposes a multi-scale feature fusion model, Semantically Guided Deep Supervised Hashing (SGDSH), based on Swin Transformer. The model effectively leverages detailed information in shallow features and rich semantic information in high-level features, while suppressing irrelevant information. This end-to-end deep hashing framework enables multi-scale feature fusion, optimizing the retrieval process. To improve the quality of hash code extraction, a classification layer is added after the hash layer, using real feature labels to guide training of the hash code. The predicted label information is then incorporated into the image retrieval task. The experimental results demonstrate that the inclusion of label information during computation and similarity ranking significantly improves the accuracy of multi-label remote sensing image retrieval. This improvement indicates that the proposed method effectively captures both the categories and distributions of ground objects within remote sensing images. It prioritizes the retrieval of images with higher semantic similarity to the query in terms of land cover types. This leads to retrieval results that better reflect the actual composition of the query image, making the method more suitable for practical applications in remote sensing image analysis. For example, in disaster assessment, retrieving semantically similar historical images from remote sensing databases aids in damage evaluation and response planning. In agricultural monitoring, identifying similar land cover types across different periods helps track crop growth patterns and detect anomalies. By improving the efficiency and accuracy of multi-label remote sensing image retrieval, this method provides a valuable tool for large-scale geospatial analysis and decision-making.
While the proposed method exhibits strong retrieval performance, there are still some limitations that need to be addressed in future work. Firstly, this paper introduces a classification layer to incorporate predicted labels into the similarity measure calculation. However, improving the accuracy of the predicted labels remains an important issue to address. Secondly, in judging the similarity of image pairs, this paper uses a label similarity threshold of 0.4 as the criterion for determining whether two images are similar. The impact of using higher similarity thresholds on model training and retrieval performance has not been explored. Additionally, the retrieval speed has not been experimentally analyzed. In our model, we adopt an improved Swin Transformer framework as the backbone. While comparative experiments demonstrate that our method achieves higher retrieval accuracy than other state-of-the-art approaches, the increased model complexity of the Transformer-based model introduces a higher computational overhead, leading to longer inference times compared to traditional CNN-based models. Moreover, the similarity metric proposed in this paper enhances retrieval accuracy; it also requires the storage of image label information when constructing the feature library. This significantly increases storage costs, especially when building large-scale remote sensing image feature libraries. Furthermore, the two-stage calculation process for the similarity metric—first computing hash code and label information for the query image, then combining them with hash codes and category labels of all the images in the database—inevitably adds a time overhead. In future research, we aim to develop a lightweight Transformer-based architecture that maintains the feature extraction capability of remote sensing images while reducing model complexity. Additionally, we plan to integrate hash codes and predicted label information into a unified compressed feature representation, aiming to reduce storage requirements and improve retrieval efficiency while preserving feature retrieval accuracy. Through the above research directions, we aim to develop a more lightweight and efficient multi-label remote sensing image retrieval method.
6. Conclusions
In this paper, we propose an end-to-end multi-label remote sensing image retrieval method called Semantically Guided Deep Supervised Hashing (SGDSH), based on an improved Swin Transformer model. By incorporating an innovative multi-scale feature fusion module, we enhance the model’s ability to extract deep features from remote sensing images. Additionally, our method leverages the semantic information in the data, with a carefully designed weighted pairwise similarity loss function and semantic labels, enabling the model to generate hash features with superior differentiation. To further improve the ranking performance of hash features, we introduce image category labels to compute similarity and apply a weighted Hamming distance between image features. This approach significantly enhances retrieval accuracy and ensures that the retrieved images share more common labels with the query image.
The experimental results demonstrate that the SGDSH method achieves outstanding retrieval performance across all three datasets. In the retrieval task of returning the top 10 most similar images, ACG and wAP metrics for SGDSH reach 94.148%, 93.388%, and 81.824% and 93.924%, 93.178%, and 79.975%, respectively. For the top 50 images, these metrics are 86.203%, 89.579%, and 78.992% and 85.438%, 89.042%, and 76.194%. Comparative experiments with other methods further highlight the advantages of SGDSH, showing significant improvements in ACG and WAP over the best-performing comparison method. Specifically, for the top 10 images returned, SGDSH outperforms the other methods by 2.673%, 7.853%, and 4.049% and 2.827%, 8.163%, and 4.722%, respectively, and, for the top 50 images, it improves by 2.181%, 7.62%, and 4.241% and 2.339%, 8.07%, and 5.358%. Additionally, visualization analysis confirms that SGDSH not only ensures high retrieval accuracy but also enhances label similarity between the query and retrieved images. These results validate the effectiveness of SGDSH in large-scale remote sensing image retrieval.
Despite promising results, the proposed SGDSH method has some limitations, including higher inference time due to the Transformer-based backbone, increased storage costs from label information in the feature library, and additional retrieval time caused by the two-stage similarity computation, which may hinder large-scale retrieval. To address these issues, future research will focus on developing a lightweight Transformer-based architecture that reduces model complexity while retaining feature extraction capability, as well as integrating hash codes and label information into a unified compressed representation. These efforts aim to improve the scalability and efficiency of multi-label remote sensing image retrieval, providing valuable insights and opening up new research avenues in this field.