More Web Proxy on the site http://driver.im/

research-article

Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval

Authors:

Cong BaiAuthors Info & Claims

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Pages 398 - 406

https://doi.org/10.1145/3591106.3592236

Published: 12 June 2023 Publication History

Abstract

Recently, remote sensing cross-modal retrieval has received incredible attention from researchers. However, the unique nature of remote-sensing images leads to many semantic confusion zones in the semantic space, which greatly affects retrieval performance. We propose a novel scene-aware aggregation network (SWAN) to reduce semantic confusion by improving scene perception capability. In visual representation, a visual multiscale fusion module (VMSF) is presented to fuse visual features with different scales as a visual representation backbone. Meanwhile, a scene fine-grained sensing module (SFGS) is proposed to establish the associations of salient features at different granularity. A scene-aware visual aggregation representation is formed by the visual information generated by these two modules. In textual representation, a textual coarse-grained enhancement module (TCGE) is designed to enhance the semantics of text and to align visual information. Furthermore, as the diversity and differentiation of remote sensing scenes weaken the understanding of scenes, a new metric, namely, scene recall is proposed to measure the perception of scenes by evaluating scene-level retrieval performance, which can also verify the effectiveness of our approach in reducing semantic confusion. By performance comparisons, ablation studies and visualization analysis, we validated the effectiveness and superiority of our approach on two datasets, RSICD and RSITMD. The source code is available at https://github.com/kinshingpoon/SWAN-pytorch.

References

[1]

Taghreed Abdullah, Yakoub Bazi, Mohamad M Al Rahhal, Mohamed L Mekhalfi, Lalitha Rangarajan, and Mansour Zuair. 2020. TextRS: Deep bidirectional triplet network for matching text to remote sensing images. Remote Sensing 12, 3 (2020), 405.

[2]

Qi Bi, Kun Qin, Zhili Li, Han Zhang, Kai Xu, and Gui-Song Xia. 2020. A multiple-instance densely-connected ConvNet for aerial scene classification. IEEE Transactions on Image Processing 29 (2020), 4911–4926.

[3]

Jianan Chen, Lu Zhang, Qiong Wang, Cong Bai, and Kidiyo Kpalma. 2022. Intra-Modal Constraint Loss for Image-Text Retrieval. In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 4023–4027.

[4]

Si-Bao Chen, Qing-Song Wei, Wen-Zhong Wang, Jin Tang, Bin Luo, and Zu-Yuan Wang. 2021. Remote sensing scene classification via multi-branch local attention network. IEEE Transactions on Image Processing 31 (2021), 99–109.

Digital Library

[5]

Qimin Cheng, Haiyan Huang, Yuan Xu, Yuzhuo Zhou, Huanying Li, and Zhongyuan Wang. 2022. NWPU-Captions dataset and MLCA-Net for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–19.

[6]

Qimin Cheng, Yuzhuo Zhou, Peng Fu, Yuan Xu, and Liang Zhang. 2021. A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021), 4284–4297.

[7]

Mingmin Chi, Antonio Plaza, Jon Atli Benediktsson, Zhongyi Sun, Jinsheng Shen, and Yangyong Zhu. 2016. Big data for remote sensing: Challenges and opportunities. Proc. IEEE 104, 11 (2016), 2207–2219.

[8]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).

Digital Library

[9]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.

[10]

Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, and Qikai Lu. 2019. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2849–2858.

[11]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).

[12]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.

[13]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[14]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV). 201–216.

Digital Library

[15]

Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing 56, 4 (2017), 2183–2195.

[16]

Yafei Lv, Wei Xiong, Xiaohan Zhang, and Yaqi Cui. 2021. Fusion-based correlation learning model for cross-modal remote sensing image retrieval. IEEE Geoscience and Remote Sensing Letters 19 (2021), 1–5.

[17]

Guo Mao, Yuan Yuan, and Lu Xiaoqiang. 2018. Deep cross-modal retrieval for remote sensing image and audio. In 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS). IEEE, 1–7.

[18]

Li Mi, Siran Li, Christel Chappuis, and Devis Tuia. 2022. Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images. (2022).

[19]

Keiller Nogueira, Otávio AB Penatti, and Jefersson A Dos Santos. 2017. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognition 61 (2017), 539–556.

Digital Library

[20]

Otávio AB Penatti, Keiller Nogueira, and Jefersson A Dos Santos. 2015. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 44–51.

[21]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.

[22]

Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia. 1047–1055.

Digital Library

[23]

Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104–1113.

Digital Library

[24]

Jun Rao, Fei Wang, Liang Ding, Shuhan Qi, Yibing Zhan, Weifeng Liu, and Dacheng Tao. 2022. Where Does the Performance Improvement Come From?-A Reproducibility Concern about Image-Text Retrieval. arXiv preprint arXiv:2203.03853 (2022).

[25]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).

[26]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[27]

Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. 2017. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 55, 7 (2017), 3965–3981.

[28]

Kejie Xu, Hong Huang, Peifang Deng, and Yuan Li. 2021. Deep feature aggregation framework driven by graph convolutional network for scene classification in remote sensing. IEEE Transactions on Neural Networks and Learning Systems (2021).

[29]

Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. 2022. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv preprint arXiv:2204.09868 (2022).

[30]

Zhiqiang Yuan, Wenkai Zhang, Xuee Rong, Xuan Li, Jialiang Chen, Hongqi Wang, Kun Fu, and Xian Sun. 2021. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–19.

[31]

Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Yongqiang Mao, Ruixue Zhou, Hongqi Wang, Kun Fu, and Xian Sun. 2022. MCRN: A Multi-source Cross-modal Retrieval Network for remote sensing. International Journal of Applied Earth Observation and Geoinformation 115 (2022), 103071.

[32]

Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, Kun Fu, and Xian Sun. 2022. Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–16.

[33]

Huan Zhang, Yingzhi Sun, Yu Liao, SiYuan Xu, Rui Yang, Shuang Wang, Biao Hou, and Licheng Jiao. 2022. A Transformer-Based Cross-Modal Image-Text Retrieval Method using Feature Decoupling and Reconstruction. In IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 1796–1799.

[34]

Wei Zhang, Ping Tang, and Lijun Zhao. 2019. Remote sensing image scene classification using CNN-CapsNet. Remote Sensing 11, 5 (2019), 494.

Cited By

Liao YZhang XYang RTao JLiu BHu ZWang SZhao ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Selection and Reconstruction of Key Locals: A Novel Specific Domain Image-Text Retrieval MethodProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681421(5653-5662)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681421
Yang RWang STao JHan YLin QGuo YHou BJiao LCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Accurate and Lightweight Learning for Specific Domain Image-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681280(9719-9728)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681280
Ji ZMeng CZhang YWang HPang YHan JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit ReasoningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681270(1662-1671)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681270
Show More Cited By

Index Terms

Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Scale-Semantic Joint Decoupling Network for Image-Text Retrieval in Remote Sensing
Image-text retrieval in remote sensing aims to provide flexible information for data analysis and application. In recent years, state-of-the-art methods are dedicated to “scale decoupling” and “semantic decoupling” strategies to further enhance the ...
Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)
Abstract
In the real world, scene text, as an essential information medium, contains rich and intuitive information about natural scenes. Current cross-modal retrieval studies focus on establishing effective semantic links between images and texts. However,...
Entity Semantic Feature Fusion Network for Remote Sensing Image-Text Retrieval
Web and Big Data
Abstract
Recently, there has been remarkable progress in remote sensing image-text retrieval (RSITR), but in the past RSITR methods, researchers often try to extract features in images and texts from global and local perspectives, and the unique entity ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

June 2023

694 pages

ISBN:9798400701788

DOI:10.1145/3591106

Editors:
Ioannis (Yiannis) Kompatsiaris
Centre for Research and Technology Hellas, Greece
,
Jiebo Luo
University of Rochester,USA
,
Nicu Sebe
University of Trento, Italy
,
Angela Yao
National University of Singapore, Singapore
,
Vasileios Mezaris
Centre for Research and Technology Hellas, Greece
,
Symeon Papadopoulos
Centre for Research and Technology Hellas, Greece
,
Adrian Popescu
CEA LIST, France
,
Zi (Helen) Huang
University of Queensland, Australia

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Zhejiang Provincial Natural Science Foundation of China
Natural Science Foundation of China

Conference

ICMR '23

Sponsor:

SIGMM

ICMR '23: International Conference on Multimedia Retrieval

June 12 - 15, 2023

Thessaloniki, Greece

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
386
Total Downloads

Downloads (Last 12 months)201
Downloads (Last 6 weeks)27

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liao YZhang XYang RTao JLiu BHu ZWang SZhao ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Selection and Reconstruction of Key Locals: A Novel Specific Domain Image-Text Retrieval MethodProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681421(5653-5662)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681421
Yang RWang STao JHan YLin QGuo YHou BJiao LCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Accurate and Lightweight Learning for Specific Domain Image-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681280(9719-9728)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681280
Ji ZMeng CZhang YWang HPang YHan JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit ReasoningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681270(1662-1671)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681270
He XTang CLiu XLi CAn SLi ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Heterogeneous Graph Guided Contrastive Learning for Spatially Resolved Transcriptomics DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680941(8287-8295)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680941
Zhong SHao XYan YZhang YSong YLiang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain AdaptationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680604(6307-6315)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680604
Chen YHuang JSun ZXiong SLu X(2024)Thread the Needle: Cues-Driven Multiassociation for Remote Sensing Cross-Modal RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.350963962(1-13)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3509639
Yang RWang SHan YLi YZhao DQuan DGuo YJiao LYang Z(2024)Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.349689862(1-17)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3496898
Mi LDai XCastillo-Navarro JTuia D(2024)Knowledge-Aware Text–Image Retrieval for Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.348697762(1-13)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3486977
Yang LZhou TMa WDu MLiu LLi FZhao SWang Y(2024)Remote Sensing Image-Text Retrieval With Implicit-Explicit Relation ReasoningIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.346690962(1-11)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3466909
He YXu XChen HLi JPu F(2024)Visual Global-Salient-Guided Network for Remote Sensing Image-Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.346638962(1-14)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3466389
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents