[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3591106.3592236acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval

Published: 12 June 2023 Publication History

Abstract

Recently, remote sensing cross-modal retrieval has received incredible attention from researchers. However, the unique nature of remote-sensing images leads to many semantic confusion zones in the semantic space, which greatly affects retrieval performance. We propose a novel scene-aware aggregation network (SWAN) to reduce semantic confusion by improving scene perception capability. In visual representation, a visual multiscale fusion module (VMSF) is presented to fuse visual features with different scales as a visual representation backbone. Meanwhile, a scene fine-grained sensing module (SFGS) is proposed to establish the associations of salient features at different granularity. A scene-aware visual aggregation representation is formed by the visual information generated by these two modules. In textual representation, a textual coarse-grained enhancement module (TCGE) is designed to enhance the semantics of text and to align visual information. Furthermore, as the diversity and differentiation of remote sensing scenes weaken the understanding of scenes, a new metric, namely, scene recall is proposed to measure the perception of scenes by evaluating scene-level retrieval performance, which can also verify the effectiveness of our approach in reducing semantic confusion. By performance comparisons, ablation studies and visualization analysis, we validated the effectiveness and superiority of our approach on two datasets, RSICD and RSITMD. The source code is available at https://github.com/kinshingpoon/SWAN-pytorch.

References

[1]
Taghreed Abdullah, Yakoub Bazi, Mohamad M Al Rahhal, Mohamed L Mekhalfi, Lalitha Rangarajan, and Mansour Zuair. 2020. TextRS: Deep bidirectional triplet network for matching text to remote sensing images. Remote Sensing 12, 3 (2020), 405.
[2]
Qi Bi, Kun Qin, Zhili Li, Han Zhang, Kai Xu, and Gui-Song Xia. 2020. A multiple-instance densely-connected ConvNet for aerial scene classification. IEEE Transactions on Image Processing 29 (2020), 4911–4926.
[3]
Jianan Chen, Lu Zhang, Qiong Wang, Cong Bai, and Kidiyo Kpalma. 2022. Intra-Modal Constraint Loss for Image-Text Retrieval. In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 4023–4027.
[4]
Si-Bao Chen, Qing-Song Wei, Wen-Zhong Wang, Jin Tang, Bin Luo, and Zu-Yuan Wang. 2021. Remote sensing scene classification via multi-branch local attention network. IEEE Transactions on Image Processing 31 (2021), 99–109.
[5]
Qimin Cheng, Haiyan Huang, Yuan Xu, Yuzhuo Zhou, Huanying Li, and Zhongyuan Wang. 2022. NWPU-Captions dataset and MLCA-Net for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–19.
[6]
Qimin Cheng, Yuzhuo Zhou, Peng Fu, Yuan Xu, and Liang Zhang. 2021. A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021), 4284–4297.
[7]
Mingmin Chi, Antonio Plaza, Jon Atli Benediktsson, Zhongyi Sun, Jinsheng Shen, and Yangyong Zhu. 2016. Big data for remote sensing: Challenges and opportunities. Proc. IEEE 104, 11 (2016), 2207–2219.
[8]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[9]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
[10]
Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, and Qikai Lu. 2019. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2849–2858.
[11]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
[12]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
[13]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[14]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV). 201–216.
[15]
Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing 56, 4 (2017), 2183–2195.
[16]
Yafei Lv, Wei Xiong, Xiaohan Zhang, and Yaqi Cui. 2021. Fusion-based correlation learning model for cross-modal remote sensing image retrieval. IEEE Geoscience and Remote Sensing Letters 19 (2021), 1–5.
[17]
Guo Mao, Yuan Yuan, and Lu Xiaoqiang. 2018. Deep cross-modal retrieval for remote sensing image and audio. In 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS). IEEE, 1–7.
[18]
Li Mi, Siran Li, Christel Chappuis, and Devis Tuia. 2022. Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images. (2022).
[19]
Keiller Nogueira, Otávio AB Penatti, and Jefersson A Dos Santos. 2017. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognition 61 (2017), 539–556.
[20]
Otávio AB Penatti, Keiller Nogueira, and Jefersson A Dos Santos. 2015. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 44–51.
[21]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
[22]
Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia. 1047–1055.
[23]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104–1113.
[24]
Jun Rao, Fei Wang, Liang Ding, Shuhan Qi, Yibing Zhan, Weifeng Liu, and Dacheng Tao. 2022. Where Does the Performance Improvement Come From?-A Reproducibility Concern about Image-Text Retrieval. arXiv preprint arXiv:2203.03853 (2022).
[25]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).
[26]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[27]
Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. 2017. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 55, 7 (2017), 3965–3981.
[28]
Kejie Xu, Hong Huang, Peifang Deng, and Yuan Li. 2021. Deep feature aggregation framework driven by graph convolutional network for scene classification in remote sensing. IEEE Transactions on Neural Networks and Learning Systems (2021).
[29]
Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. 2022. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv preprint arXiv:2204.09868 (2022).
[30]
Zhiqiang Yuan, Wenkai Zhang, Xuee Rong, Xuan Li, Jialiang Chen, Hongqi Wang, Kun Fu, and Xian Sun. 2021. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–19.
[31]
Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Yongqiang Mao, Ruixue Zhou, Hongqi Wang, Kun Fu, and Xian Sun. 2022. MCRN: A Multi-source Cross-modal Retrieval Network for remote sensing. International Journal of Applied Earth Observation and Geoinformation 115 (2022), 103071.
[32]
Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, Kun Fu, and Xian Sun. 2022. Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–16.
[33]
Huan Zhang, Yingzhi Sun, Yu Liao, SiYuan Xu, Rui Yang, Shuang Wang, Biao Hou, and Licheng Jiao. 2022. A Transformer-Based Cross-Modal Image-Text Retrieval Method using Feature Decoupling and Reconstruction. In IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 1796–1799.
[34]
Wei Zhang, Ping Tang, and Lijun Zhao. 2019. Remote sensing image scene classification using CNN-CapsNet. Remote Sensing 11, 5 (2019), 494.

Cited By

View all
  • (2024)Selection and Reconstruction of Key Locals: A Novel Specific Domain Image-Text Retrieval MethodProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681421(5653-5662)Online publication date: 28-Oct-2024
  • (2024)Accurate and Lightweight Learning for Specific Domain Image-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681280(9719-9728)Online publication date: 28-Oct-2024
  • (2024)Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit ReasoningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681270(1662-1671)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
    June 2023
    694 pages
    ISBN:9798400701788
    DOI:10.1145/3591106
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 June 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cross-Modal Retrieval
    2. Remote Sensing
    3. Scene Perception

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Zhejiang Provincial Natural Science Foundation of China
    • Natural Science Foundation of China

    Conference

    ICMR '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)201
    • Downloads (Last 6 weeks)27
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Selection and Reconstruction of Key Locals: A Novel Specific Domain Image-Text Retrieval MethodProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681421(5653-5662)Online publication date: 28-Oct-2024
    • (2024)Accurate and Lightweight Learning for Specific Domain Image-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681280(9719-9728)Online publication date: 28-Oct-2024
    • (2024)Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit ReasoningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681270(1662-1671)Online publication date: 28-Oct-2024
    • (2024)Heterogeneous Graph Guided Contrastive Learning for Spatially Resolved Transcriptomics DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680941(8287-8295)Online publication date: 28-Oct-2024
    • (2024)UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain AdaptationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680604(6307-6315)Online publication date: 28-Oct-2024
    • (2024)Thread the Needle: Cues-Driven Multiassociation for Remote Sensing Cross-Modal RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.350963962(1-13)Online publication date: 2024
    • (2024)Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.349689862(1-17)Online publication date: 2024
    • (2024)Knowledge-Aware Text–Image Retrieval for Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.348697762(1-13)Online publication date: 2024
    • (2024)Remote Sensing Image-Text Retrieval With Implicit-Explicit Relation ReasoningIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.346690962(1-11)Online publication date: 2024
    • (2024)Visual Global-Salient-Guided Network for Remote Sensing Image-Text RetrievalIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.346638962(1-14)Online publication date: 2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media