Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning
<p>The overall structure of our CRSR model. The image features and text features are extracted by the CLIP image encoder and text encoder from the input image and sentence pool, respectively. Image features are transformed into a sequence of visual tokens through Transformer Mapper network with learnable queries. Based on the cross-modal retrieval, the retrieved relevant sentences are separated as a series of words and fed into the semantic refinement module to filter out irrelevant words. Finally, the obtained visual tokens with query tokens and semantic tokens are both fed into a cross-modal decoder to generate the corresponding image captions.</p> "> Figure 2
<p>The structure of the <span class="html-italic">i</span>th cross-modal transformer block of the decoding module. The “M-H Att” denotes the multi-head attention layer. The “Input Embed” denotes the input of the <span class="html-italic">i</span>th block.</p> "> Figure 3
<p>Captioning results of the baseline model and our proposed CRSR method. The “GT” denotes the ground truth captions. The “base” denotes the description generated by the baseline model. “ours” denotes the description generated by our CRSR method. “retr” denotes the retrieved words from the sentence pool, while “miss” denotes the overlooked words during retrieval. Blue words denote the scene uniquely captured by our model compared with baseline model. Red words denote the incorrectly generated words in description and retrieved words.</p> "> Figure 4
<p>Visualized attention matrix between the semantic tokens and the captioning of the input images.</p> ">
Abstract
:1. Introduction
- We propose a new CRSR method incorporating the CLIP-based retrieval model and a semantic refinement module that effectively addresses the limitations of existing two-stage approaches. We firstly obtain the semantic information through a fine-tuned retrieval model. Then, the semantic refinement module is introduced for filtering out misleading words through a masked cross-attention mechanism.
- We introduce a Transformer Mapper network, which is designed to provide a comprehensive representation of the image features that extends beyond the retrieved information using attention mechanisms and learnable queries for semantic prediction. The projected image feature with learnable queries employs self-attention to capture intricate relationships and dependencies within the image features, which provides a particular focus on the overlooked semantic region, enabling it to effectively analyze and understand more semantic details present in RSIs.
- The extensive experiments conducted on three diverse datasets, RSICD, UCM-Captions, and Sydney-Captions, provide compelling evidence to validate the superior performance of our proposed CRSR method. We demonstrate that our approach achieves higher captioning accuracy compared to other state-of-the-art methods on the three benchmark datasets.
2. Related Works
2.1. One-Stage Methods
2.2. Two-Stage Methods
3. Methodology
3.1. Semantic Refinement
3.2. Transformer Mapper
3.3. Cross-Modal Decoder
4. Experiments and Analysis
4.1. Datasets
4.1.1. RSICD
4.1.2. UCM-Captions
4.1.3. Sydney-Captions
4.1.4. Datasets Alignment
4.2. Evaluation Metrics
4.3. Experimental Details
4.4. Experiments on Image Encoder
4.4.1. Different Image Feature Extractors
4.4.2. Transformer Mapper Projection Lengths
4.5. Ablation Studies
4.6. Comparison with Other Methods
4.7. Analysis of Training and Testing Time
4.8. Qualitative Analysis
4.8.1. Generated Captioning Results
4.8.2. Visualization of Attention Weights
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shi, Z.; Zou, Z. Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image? IEEE Trans. Geosci. Remote Sens. 2017, 55, 3623–3634. [Google Scholar] [CrossRef]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2183–2195. [Google Scholar] [CrossRef]
- Recchiuto, C.T.; Sgorbissa, A. Post-disaster assessment with unmanned aerial vehicles: A survey on practical implementations and research approaches. J. Field Robot. 2018, 35, 459–490. [Google Scholar] [CrossRef]
- Tian, Y.; Sun, X.; Niu, R.; Yu, H.; Zhu, Z.; Wang, P.; Fu, K. Fully-weighted HGNN: Learning efficient non-local relations with hypergraph in aerial imagery. ISPRS J. Photogram. Remote Sens. 2022, 191, 263–276. [Google Scholar] [CrossRef]
- Hossain, M.Z.; Sohel, F.; Shiratuddin, M.F.; Laga, H. A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. 2019, 51, 1–36. [Google Scholar] [CrossRef]
- Zhao, B. A systematic survey of remote sensing image captioning. IEEE Access 2021, 9, 154086–154111. [Google Scholar] [CrossRef]
- Elman, J.L. Finding structure in time. Cognit. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Sun, X.; Tian, Y.; Lu, W.; Wang, P.; Niu, R.; Yu, H.; Fu, K. From single- to multi-modal remote sensing imagery interpretation: A survey and taxonomy. Sci. China Inf. Sci. 2023, 66, 140301. [Google Scholar] [CrossRef]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Cheng, G.; Li, Q.; Wang, G.; Xie, X.; Min, L.; Han, J. SFRNet: Fine-Grained Oriented Object Recognition via Separate Feature Refinement. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610510. [Google Scholar] [CrossRef]
- Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid Multiple Attention Network for Semantic Segmentation in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5603018. [Google Scholar] [CrossRef]
- Zhang, X.; Wang, X.; Tang, X.; Zhou, H.; Li, C. Description generation for remote sensing images using attribute attention mechanism. Remote Sens. 2019, 11, 612. [Google Scholar] [CrossRef]
- Wang, B.; Zheng, X.; Qu, B.; Lu, X. Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2020, 13, 256–270. [Google Scholar] [CrossRef]
- Chen, J.; Han, Y.; Wan, L.; Zhou, X.; Deng, M. Geospatial relation captioning for high-spatial-resolution images by using an attention-based neural network. Int. J. Remote Sens. 2019, 40, 6482–6498. [Google Scholar] [CrossRef]
- Ye, X.; Wang, S.; Gu, Y.; Wang, J.; Wang, R.; Hou, B.; Giunchiglia, F.; Jiao, L. A Joint-Training Two-Stage Method For Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4709616. [Google Scholar] [CrossRef]
- Zhang, Z.; Diao, W.; Zhang, W.; Yan, M.; Gao, X.; Sun, X. LAM: Remote sensing image captioning with label-attention mechanism. Remote Sens. 2019, 11, 2349. [Google Scholar] [CrossRef]
- Zhao, R.; Shi, Z.; Zou, Z. High-Resolution Remote Sensing Image Captioning Based on Structured Attention. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5603814. [Google Scholar] [CrossRef]
- Sarto, S.; Cornia, M.; Baraldi, L.; Cucchiara, R. Retrieval-Augmented Transformer for Image Captioning. In Proceedings of the 19th International Conference on Content-Based Multimedia Indexing, Graz, Austria, 14–16 September 2022. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the ICML, Online, 18–24 July 2021. [Google Scholar]
- Shen, S.; Li, L.H.; Tan, H.; Bansal, M.; Rohrbach, A.; Chang, K.W.; Yao, Z.; Keutzer, K. How much can clip benefit vision-and-language tasks? arXiv 2021, arXiv:2107.06383. [Google Scholar]
- Jiasen, L.; Goswami, V.; Rohrbach, M.; Parikh, D.; Lee, S. 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 July 2020; pp. 10437–10446. [Google Scholar]
- Mokady, R.; Hertz, A.; Bermano, A.H. ClipCap: CLIP prefix for image captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar]
- Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (Cits), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
- Li, Y.; Fang, S.; Jiao, L.; Liu, R.; Shang, R. A multi-level attention model for remote sensing image captions. Remote Sens. 2020, 12, 939. [Google Scholar] [CrossRef]
- Huang, W.; Wang, Q.; Li, X. Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci. Remote Sens. Lett. 2021, 18, 436–440. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, X.; Gu, J.; Li, C.; Wang, X.; Tang, X.; Jiao, L. Recurrent attention and semantic gate for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
- Li, X.; Zhang, X.; Huang, W.; Wang, Q. Truncation cross entropy loss for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5246–5257. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, W.; Yan, M.; Gao, X.; Fu, K.; Sun, X. Global visual feature and linguistic state guided attention for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5608816. [Google Scholar] [CrossRef]
- Hoxha, G.; Melgani, F. A novel SVM-based decoder for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5404514. [Google Scholar] [CrossRef]
- Wang, Q.; Huang, W.; Zhang, X.; Li, X. Word—Sentence framework for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10532–10543. [Google Scholar] [CrossRef]
- Sumbul, G.; Nayak, S.; Demir, B. SD-RSIC: Summarization-driven deep remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6922–6934. [Google Scholar] [CrossRef]
- Kandala, H.; Saha, S.; Banerjee, B.; Zhu, X. Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6514905. [Google Scholar] [CrossRef]
- Yang, Q.; Ni, Z.; Ren, P. Meta captioning: A meta learning based remote sensing image captioning framework. ISPRS J. Photogram. Remote Sens. 2022, 186, 190–200. [Google Scholar] [CrossRef]
- Zhang, X.; Li, Y.; Wang, X.; Liu, F.; Wu, Z.; Cheng, X.; Jiao, L. Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning. Remote Sens. 2023, 15, 579. [Google Scholar] [CrossRef]
- Shen, X.; Liu, B.; Zhou, Y.; Zhao, J.; Liu, M. Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning. Knowl.-Based Syst. 2020, 203, 105920. [Google Scholar] [CrossRef]
- Du, R.; Cao, W.; Zhang, W.; Zhi, G.; Sun, X.; Li, S.; Li, J. From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2023, 16, 7704–7717. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, X.; Cheng, X.; Tang, X.; Jiao, L. Learning consensus-aware semantic knowledge for remote sensing image captioning. Pattern Recognit. 2024, 145, 109893. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS), San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
- Zhang, F.; Du, B.; Zhang, L. Saliency-guided unsupervised feature learning for scene classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2175–2184. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation (StatMT), Morristown, NJ, USA, 29–30 June 2005; pp. 65–72. [Google Scholar]
- Lin, C.-Y. Rouge: A Package for Automatic Evaluation of Summaries; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Vedantam, R.; Zitnick, C.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic propositional image caption evaluation. Proc. Eur. Conf. Comput. Vis. 2016, 9909, 382–398. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
Datasets | Categories | Mean Caption Length | Vocab Size (before) | Vocab Size (after) | Image Numbers |
---|---|---|---|---|---|
Sydney-Captions | 21 | 11.5 | 315 | 298 | 2100 |
UCM-Captions | 7 | 13.2 | 231 | 179 | 613 |
RSICD | 30 | 11.4 | 2695 | 1252 | 10,000 |
Dataset | Model | M | R | C | S | |||||
---|---|---|---|---|---|---|---|---|---|---|
Sydney-Captions | RN50 | 0.7827 | 0.7241 | 0.6775 | 0.6369 | 0.4141 | 0.7351 | 2.7790 | 0.4614 | 1.0053 |
RN50x4 | 0.7826 | 0.7226 | 0.6701 | 0.6265 | 0.3984 | 0.7260 | 2.7286 | 0.4680 | 0.9895 | |
RN101 | 0.7935 | 0.7284 | 0.6792 | 0.6358 | 0.3951 | 0.7230 | 2.7214 | 0.4789 | 0.9908 | |
ViT-B/16 | 0.7756 | 0.7137 | 0.6612 | 0.6146 | 0.3958 | 0.7303 | 2.8420 | 0.4670 | 1.0100 | |
ViT-B/32 | 0.7994 | 0.7440 | 0.6987 | 0.6602 | 0.4150 | 0.7488 | 2.8900 | 0.4845 | 1.0397 | |
UCM-Captions | RN50 | 0.8959 | 0.8425 | 0.7941 | 0.7491 | 0.4794 | 0.8425 | 3.8020 | 0.5253 | 1.2797 |
RN50x4 | 0.9034 | 0.8553 | 0.8137 | 0.7751 | 0.4994 | 0.8472 | 3.7401 | 0.5448 | 1.2813 | |
RN101 | 0.9026 | 0.8559 | 0.8097 | 0.7638 | 0.4853 | 0.8544 | 3.8206 | 0.5179 | 1.2884 | |
ViT-B/16 | 0.8928 | 0.8451 | 0.7981 | 0.7520 | 0.4919 | 0.8587 | 3.7753 | 0.5277 | 1.2811 | |
ViT-B/32 | 0.9060 | 0.8561 | 0.8122 | 0.7681 | 0.4956 | 0.8586 | 3.8069 | 0.5201 | 1.2899 | |
RSICD | RN50 | 0.8063 | 0.7029 | 0.6178 | 0.5458 | 0.3911 | 0.7027 | 2.9708 | 0.5085 | 1.0238 |
RN50x4 | 0.8056 | 0.7025 | 0.6173 | 0.5458 | 0.3909 | 0.7011 | 2.9630 | 0.5133 | 1.0228 | |
RN101 | 0.7998 | 0.6981 | 0.6122 | 0.5392 | 0.3921 | 0.6993 | 2.9863 | 0.5130 | 1.0260 | |
ViT-B/16 | 0.8136 | 0.7077 | 0.6207 | 0.5484 | 0.3956 | 0.7052 | 3.0062 | 0.5190 | 1.0349 | |
ViT-B/32 | 0.8192 | 0.7171 | 0.6307 | 0.5574 | 0.4015 | 0.7134 | 3.0687 | 0.5276 | 1.0537 |
Dataset | Length | M | R | C | S | |||||
---|---|---|---|---|---|---|---|---|---|---|
Sydney-Captions | 5 | 0.7715 | 0.7130 | 0.6666 | 0.6300 | 0.4068 | 0.7372 | 2.8260 | 0.4596 | 1.0119 |
10 | 0.7994 | 0.7440 | 0.6987 | 0.6602 | 0.4150 | 0.7488 | 2.8900 | 0.4845 | 1.0397 | |
15 | 0.7886 | 0.7220 | 0.6662 | 0.6187 | 0.3991 | 0.7232 | 2.7212 | 0.4557 | 0.9836 | |
20 | 0.7829 | 0.7148 | 0.6549 | 0.6068 | 0.4007 | 0.7261 | 2.7861 | 0.4756 | 0.9991 | |
UCM-Captions | 5 | 0.8851 | 0.8345 | 0.7911 | 0.7502 | 0.4871 | 0.8424 | 3.7313 | 0.5010 | 1.2624 |
10 | 0.9060 | 0.8561 | 0.8122 | 0.7681 | 0.4956 | 0.8586 | 3.8069 | 0.5201 | 1.2899 | |
15 | 0.8924 | 0.8410 | 0.7976 | 0.7574 | 0.4899 | 0.8531 | 3.7736 | 0.5272 | 1.2802 | |
20 | 0.9034 | 0.8580 | 0.8144 | 0.7689 | 0.4860 | 0.8471 | 3.7945 | 0.4990 | 1.2791 | |
RSICD | 5 | 0.8050 | 0.6978 | 0.6072 | 0.5309 | 0.3957 | 0.7065 | 2.9726 | 0.5164 | 1.0245 |
10 | 0.8192 | 0.7171 | 0.6307 | 0.5574 | 0.4015 | 0.7134 | 3.0687 | 0.5276 | 1.0537 | |
15 | 0.8110 | 0.7070 | 0.6205 | 0.5471 | 0.3973 | 0.7083 | 3.0261 | 0.5208 | 1.0399 | |
20 | 0.8054 | 0.7038 | 0.6203 | 0.5500 | 0.4030 | 0.7114 | 3.0364 | 0.5193 | 1.0440 |
Dataset | Model | M | R | C | S | |||||
---|---|---|---|---|---|---|---|---|---|---|
Sydney-Captions | bs | 0.7697 | 0.6914 | 0.6219 | 0.5584 | 0.3817 | 0.6933 | 2.4097 | 0.4283 | 0.8943 |
bs+m | 0.7754 | 0.6985 | 0.6297 | 0.5651 | 0.3894 | 0.7042 | 2.4429 | 0.4261 | 0.9055 | |
bs+mq | 0.7947 | 0.7200 | 0.6546 | 0.5932 | 0.4019 | 0.7265 | 2.6237 | 0.4461 | 0.9583 | |
bs+sr | 0.7873 | 0.7088 | 0.6425 | 0.5857 | 0.4035 | 0.7217 | 2.6867 | 0.4542 | 0.9704 | |
bs+mq+sr | 0.7994 | 0.7440 | 0.6987 | 0.6602 | 0.4150 | 0.7488 | 2.8900 | 0.4845 | 1.0397 | |
UCM-Captions | bs | 0.8295 | 0.7660 | 0.7184 | 0.6747 | 0.4467 | 0.7820 | 3.5171 | 0.4898 | 1.1821 |
bs+m | 0.8434 | 0.7870 | 0.7414 | 0.7010 | 0.4639 | 0.8068 | 3.5358 | 0.5072 | 1.2029 | |
bs+mq | 0.8623 | 0.8145 | 0.7703 | 0.7287 | 0.4694 | 0.8246 | 3.4941 | 0.5001 | 1.2034 | |
bs+sr | 0.8918 | 0.8457 | 0.8015 | 0.7602 | 0.4909 | 0.8531 | 3.7335 | 0.5109 | 1.2697 | |
bs+mq+sr | 0.9060 | 0.8561 | 0.8122 | 0.7681 | 0.4956 | 0.8586 | 3.8069 | 0.5201 | 1.2899 | |
RSICD | bs | 0.7823 | 0.6729 | 0.5837 | 0.5090 | 0.3899 | 0.7012 | 2.8694 | 0.5076 | 0.9954 |
bs+m | 0.8038 | 0.6943 | 0.6045 | 0.5313 | 0.3890 | 0.6964 | 2.8975 | 0.5063 | 1.0041 | |
bs+mq | 0.8068 | 0.6978 | 0.6090 | 0.5352 | 0.3894 | 0.6985 | 2.9658 | 0.5095 | 1.0197 | |
bs+sr | 0.7977 | 0.6906 | 0.6018 | 0.5288 | 0.3952 | 0.6956 | 2.9830 | 0.5189 | 1.0243 | |
bs+mq+sr | 0.8192 | 0.7171 | 0.6307 | 0.5574 | 0.4015 | 0.7134 | 3.0687 | 0.5276 | 1.0537 |
Dataset | M | R | C | S | |||||
---|---|---|---|---|---|---|---|---|---|
Soft-Att [2] | 0.7322 | 0.6674 | 0.6223 | 0.5820 | 0.3942 | 0.7127 | 2.4993 | - | - |
Hard-Att [2] | 0.7591 | 0.6610 | 0.5889 | 0.5258 | 0.3898 | 0.7189 | 2.1819 | - | - |
Up-Down [48] | 0.8180 | 0.7484 | 0.6879 | 0.6305 | 0.3972 | 0.7270 | 2.6766 | - | - |
MLAM [25] | 0.7900 | 0.7108 | 0.6517 | 0.6052 | 0.4741 | 0.7353 | 2.1811 | 0.4089 | 0.8809 |
Re-ATT [27] | 0.8000 | 0.7217 | 0.6531 | 0.5909 | 0.3908 | 0.7218 | 2.6311 | 0.4301 | 0.9529 |
GVFGA + LSGA [29] | 0.7681 | 0.6846 | 0.6145 | 0.5504 | 0.3866 | 0.7030 | 2.4522 | 0.4532 | 0.9091 |
SVM-D CONC [30] | 0.7547 | 0.6711 | 0.5970 | 0.5308 | 0.3643 | 0.6746 | 2.2222 | - | - |
FC-ATT [13] | 0.8076 | 0.7160 | 0.6276 | 0.5544 | 0.4099 | 0.7114 | 2.2033 | 0.3951 | 0.8355 |
SM-ATT [13] | 0.8143 | 0.7351 | 0.6586 | 0.5806 | 0.4111 | 0.7195 | 2.3021 | 0.3976 | 0.8593 |
SAT (LAM-TL) [17] | 0.7425 | 0.6570 | 0.5913 | 0.5369 | 0.3700 | 0.6819 | 2.3563 | 0.4048 | 0.8698 |
Adaptive (LAM-TL) [17] | 0.7365 | 0.6440 | 0.5835 | 0.5348 | 0.3693 | 0.6827 | 2.3513 | 0.4351 | 0.8746 |
struc-att [18] | 0.7795 | 0.7019 | 0.6392 | 0.5861 | 0.3954 | 0.7299 | 2.3791 | - | - |
JTTS [16] | 0.8492 | 0.7797 | 0.7137 | 0.6496 | 0.4457 | 0.7660 | 2.8010 | 0.4679 | 1.0260 |
Meta-ML [34] | 0.7958 | 0.7274 | 0.6638 | 0.6068 | 0.4247 | 0.7300 | 2.3987 | - | - |
SCST [35] | 0.7643 | 0.6919 | 0.6283 | 0.5725 | 0.3946 | 0.7172 | 2.8122 | - | - |
DTFB [37] | 0.8373 | 0.7771 | 0.7198 | 0.6659 | 0.4548 | 0.7860 | 3.0369 | 0.4839 | 1.0855 |
CASK [38] | 0.7908 | 0.7200 | 0.6605 | 0.6088 | 0.4031 | 0.7354 | 2.6788 | 0.4637 | 0.9780 |
ours | 0.7994 | 0.7440 | 0.6987 | 0.6602 | 0.4150 | 0.7488 | 2.8900 | 0.4845 | 1.0397 |
Dataset | M | R | C | S | |||||
---|---|---|---|---|---|---|---|---|---|
Soft-Att [2] | 0.7454 | 0.6545 | 0.5855 | 0.5250 | 0.3886 | 0.7237 | 2.6124 | - | - |
Hard-Att [2] | 0.8157 | 0.7312 | 0.6702 | 0.6182 | 0.4263 | 0.7698 | 2.9947 | - | - |
Up-Down [48] | 0.8356 | 0.7748 | 0.7264 | 0.6833 | 0.4447 | 0.7967 | 3.3626 | - | - |
MLAM [25] | 0.8864 | 0.8233 | 0.7735 | 0.7271 | 0.5222 | 0.8441 | 3.3074 | 0.5021 | 1.1806 |
Re-ATT [27] | 0.8518 | 0.7925 | 0.7432 | 0.6976 | 0.4571 | 0.8072 | 3.3887 | 0.4891 | 1.1679 |
GVFGA + LSGA [29] | 0.8319 | 0.7657 | 0.7103 | 0.6596 | 0.4436 | 0.7845 | 3.3270 | 0.4853 | 1.1400 |
SVM-D CONC [30] | 0.7653 | 0.6947 | 0.6417 | 0.5942 | 0.3702 | 0.6877 | 2.9228 | - | - |
FC-ATT [13] | 0.8135 | 0.7502 | 0.6849 | 0.6352 | 0.4173 | 0.7504 | 2.9958 | 0.4867 | 1.1339 |
SM-ATT [13] | 0.8154 | 0.7575 | 0.6936 | 0.6458 | 0.4240 | 0.7632 | 3.1864 | 0.4875 | 1.1435 |
SAT (LAM-TL) [17] | 0.8208 | 0.7856 | 0.7525 | 0.7229 | 0.4880 | 0.7933 | 3.7088 | 0.5126 | 1.2450 |
Adaptive (LAM-TL) [17] | 0.857 | 0.812 | 0.775 | 0.743 | 0.510 | 0.826 | 3.758 | 0.535 | 1.2734 |
struc-att [18] | 0.8538 | 0.8035 | 0.7572 | 0.7149 | 0.4632 | 0.8141 | 3.3489 | - | - |
JTTS [16] | 0.8696 | 0.8224 | 0.7788 | 0.7376 | 0.4906 | 0.8364 | 3.7102 | 0.5231 | 1.2596 |
Meta-ML [34] | 0.8714 | 0.8199 | 0.7769 | 0.7390 | 0.4956 | 0.8344 | 3.7823 | - | - |
SCST [35] | 0.8727 | 0.8096 | 0.7551 | 0.7039 | 0.4652 | 0.8258 | 3.7129 | - | - |
DTFB [37] | 0.8230 | 0.7700 | 0.7228 | 0.6792 | 0.4439 | 0.7839 | 3.4629 | 0.4825 | 1.1705 |
CASK [38] | 0.8900 | 0.8416 | 0.7987 | 0.7575 | 0.4931 | 0.8578 | 3.8314 | 0.5227 | 1.2925 |
ours | 0.9060 | 0.8561 | 0.8122 | 0.7681 | 0.4956 | 0.8586 | 3.8069 | 0.5201 | 1.2899 |
Dataset | M | R | C | S | |||||
---|---|---|---|---|---|---|---|---|---|
Soft-Att [2] | 0.6753 | 0.5308 | 0.4333 | 0.3617 | 0.3255 | 0.6109 | 1.9643 | - | - |
Hard-Att [2] | 0.6669 | 0.5182 | 0.4164 | 0.3407 | 0.3201 | 0.6084 | 1.7925 | - | - |
Up-Down [48] | 0.7679 | 0.6579 | 0.5699 | 0.4962 | 0.3534 | 0.6590 | 2.6022 | - | - |
MLAM [25] | 0.8058 | 0.6778 | 0.5866 | 0.5163 | 0.4718 | 0.7237 | 2.7716 | 0.4786 | 0.9924 |
Re-ATT [27] | 0.7729 | 0.6651 | 0.5782 | 0.5062 | 0.3626 | 0.6691 | 2.7549 | 0.4719 | 0.9529 |
GVFGA + LSGA [29] | 0.6779 | 0.5600 | 0.4781 | 0.4165 | 0.3285 | 0.5929 | 2.6012 | 0.4683 | 0.8815 |
SVM-D CONC [30] | 0.5999 | 0.4347 | 0.3355 | 0.2689 | 0.2299 | 0.4557 | 0.6854 | - | - |
FC-ATT [13] | 0.6671 | 0.5511 | 0.4691 | 0.4059 | 0.3225 | 0.5781 | 2.5763 | 0.4673 | 0.8700 |
SM-ATT [13] | 0.6699 | 0.5523 | 0.4703 | 0.4068 | 0.3255 | 0.5802 | 2.5738 | 0.4687 | 0.8710 |
SAT (LAM-TL) [17] | 0.6790 | 0.5616 | 0.4782 | 0.4148 | 0.3298 | 0.5914 | 2.6672 | 0.4707 | 0.8946 |
Adaptive (LAM-TL) [17] | 0.6756 | 0.5549 | 0.4714 | 0.4077 | 0.3261 | 0.5848 | 2.6285 | 0.4671 | 0.8828 |
struc-att [18] | 0.7016 | 0.5614 | 0.4648 | 0.3934 | 0.3291 | 0.5706 | 1.7031 | - | - |
JTTS [16] | 0.7893 | 0.6795 | 0.5893 | 0.5135 | 0.3773 | 0.6823 | 2.7958 | 0.4877 | 0.9713 |
Meta-ML [34] | 0.6866 | 0.5679 | 0.4839 | 0.4196 | 0.3249 | 0.5882 | 2.5244 | - | - |
SCST [35] | 0.7836 | 0.6679 | 0.5774 | 0.5042 | 0.3672 | 0.6730 | 2.8436 | - | - |
DTFB [37] | 0.7581 | 0.6416 | 0.5585 | 0.4923 | 0.3550 | 0.6523 | 2.5814 | 0.4579 | 0.9078 |
CASK [38] | 0.7965 | 0.6856 | 0.5964 | 0.5224 | 0.3745 | 0.6833 | 2.9343 | 0.4914 | 1.0012 |
ours | 0.8192 | 0.7171 | 0.6307 | 0.5574 | 0.4015 | 0.7134 | 3.0687 | 0.5276 | 1.0537 |
Model | Training Time/Epoch(s) | Testing Time/Epoch(s) | Params (M) | |
---|---|---|---|---|
bs | 375 | 40 | 52.02 | 0.9954 |
bs+mq+sr(ours) | 414 | 45 | 67.26 | 1.0537 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Z.; Zhao, W.; Du, X.; Zhou, G.; Zhang, S. Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning. Remote Sens. 2024, 16, 196. https://doi.org/10.3390/rs16010196
Li Z, Zhao W, Du X, Zhou G, Zhang S. Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning. Remote Sensing. 2024; 16(1):196. https://doi.org/10.3390/rs16010196
Chicago/Turabian StyleLi, Zhengxin, Wenzhe Zhao, Xuanyi Du, Guangyao Zhou, and Songlin Zhang. 2024. "Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning" Remote Sensing 16, no. 1: 196. https://doi.org/10.3390/rs16010196
APA StyleLi, Z., Zhao, W., Du, X., Zhou, G., & Zhang, S. (2024). Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning. Remote Sensing, 16(1), 196. https://doi.org/10.3390/rs16010196