[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3678717.3691318acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
research-article
Open access

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

Published: 22 November 2024 Publication History

Abstract

Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific domain has relied on model fine-tuning with the standard contrastive objective, using existing human-labeled image-caption datasets, or using synthetic data corresponding to image-caption pairs derived from other annotations over remote sensing images (e.g., object classes). The use of different pre-training mechanisms has received less attention, and only a few exceptions have considered multilingual inputs. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from individual input images, together with the standard CLIP objective. Model training relied on assembling pre-existing datasets of remote sensing images paired with English captions, followed by the use of automated machine translation into nine additional languages. We show that translated data is indeed helpful, e.g. improving performance also on English. Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks, including cross-modal and multilingual image-text retrieval, or zero-shot image classification.

References

[1]
Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, et al. 2024. Tower: An Open Multilingual Large Language Model for Translation-Related Tasks. arXiv:2402.17733 (2024).
[2]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective Conditioned and Composed Image Retrieval Combining CLIP-Based Features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[3]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[4]
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. 2023. MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478 (2023).
[5]
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023. Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic. arXiv:2306.15195 (2023).
[6]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709 (2020).
[7]
Yuxing Chen and Lorenzo Bruzzone. 2023. Toward Open-World Semantic Segmentation of Remote Sensing Images. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium.
[8]
Gong Cheng, Junwei Han, and Xiaoqiang Lu. 2017. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 10 (2017).
[9]
Qimin Cheng, Haiyan Huang, Yuan Xu, Yuzhuo Zhou, Huanying Li, and Zhongyuan Wang. 2022. NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing 60 (2022).
[10]
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. 2023. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[11]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv:1911.02116 (2019).
[12]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 (2023).
[13]
Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. 2022. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[14]
Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Shuyang Gu, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. 2022. CLIP itself is a strong fine-tuner: Achieving 85.7% and 88.0% top-1 accuracy with ViT-B and ViT-L on imagenet. arXiv:2212.06138 (2022).
[15]
Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. 2023.CLIP: How much does CLIP benefit visual question answering in the medical domain?. In Findings of the Association for Computational Linguistics: EACL 2023.
[16]
Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. 2024. Improving CLIP training with language rewrites. In Proceedings of the Conference on Neural Information Processing Systems.
[17]
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2023. EVA: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[18]
Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. 2024. Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation. arXiv:2404.08181 (2024).
[19]
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. 2019. EuroSat: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2019).
[20]
Genc Hoxha, Farid Melgani, and Begüm Demir. 2020. Toward remote sensing image retrieval under a deep image captioning perspective. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020).
[21]
Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, and Xiang Li. 2023. RSGPT: A Remote Sensing Vision Language Model and Benchmark. arXiv:2307.15266 (2023).
[22]
Gabriel Ilharco, Mitchell Wortsman, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. 2021. OpenCLIP. (2021). https://doi.org/10.5281/zenodo.5143773
[23]
Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. 2023. Grounding language models to images for multimodal inputs and outputs. In Proceedings of the International Conference on Machine Learning.
[24]
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. 2023. GeoChat: Grounded Large Vision-Language Model for Remote Sensing. arXiv:2311.15826 (2023).
[25]
Haifeng Li, Xin Dou, Chao Tao, Zhixiang Wu, Jie Chen, Jian Peng, Min Deng, and Ling Zhao. 2020. RSI-CB: A large-scale remote sensing image classification benchmark using crowdsourced data. Sensors 20, 6 (2020).
[26]
Xiang Li, Congcong Wen, Yuan Hu, and Nan Zhou. 2023. RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision. International Journal of Applied Earth Observation and Geoinformation 124 (2023).
[27]
Zichao Li, Cihang Xie, and Ekin Dogus Cubuk. 2024. Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies. arXiv:2404.08197 (2024).
[28]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Transactions on Geoscience and Remote Sensing 62 (2024).
[29]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. arXiv:2304.08485 (2023).
[30]
Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2018. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Transactions on Geoscience and Remote Sensing 56 (2018).
[31]
Li Mi, Xianjie Dai, Javiera Castillo-Navarro, and Devis Tuia. 2024. Knowledge-aware Text-Image Retrieval for Remote Sensing Images. arXiv:2405.03373 (2024).
[32]
Li Mi, Siran Li, Christel Chappuis, and Devis Tuia. 2022. Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images. In Proceedings of the Workshop on Complex Data Challenges in Earth Observation.
[33]
Amin Karimi Monsefi, Kishore Prakash Sailaja, Ali Alilooee, Ser-Nam Lim, and Rajiv Ramnath. 2024. DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks. arXiv:2409.06809 (2024).
[34]
Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. 2022. SLIP: Self-supervision meets language-image pre-training. In Proceedings of the European Conference on Computer Vision.
[35]
Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. 2024. LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model. arXiv:2402.02544 (2024).
[36]
Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van Gool, and Federico Tombari. 2023. SILC: Improving vision language pretraining with self-distillation. arXiv:2310.13355 (2023).
[37]
Sujit Pal, Artashes Arutiunian, Goutham Venkatesh, Ritobrata Ghosh, Dev Vidhani, and Mayank Bhaskar. 2021. Fine Tuning CLIP with Remote Sensing (Satellite) Images and Captions. HuggingFace Blog (2021).
[38]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Conference on Neural Information Processing Systems.
[39]
Xiaoman Qi, Panpan Zhu, Yuebin Wang, Liqiang Zhang, Junhuan Peng, Mengfan Wu, Jialong Chen, Xudong Zhao, Ning Zang, and P Takis Mathiopoulos. 2020. MLRSNet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding. ISPRS Journal of Photogrammetry and Remote Sensing 169 (2020).
[40]
Bo Qu, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. 2016. Deep Semantic Understanding of High Resolution Remote Sensing Image. In Proceedings of the International Conference on Computer, Information and Telecommunication Systems.
[41]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning.
[42]
Mohamad M. Al Rahhal, Yakoub Bazi, Norah A. Alsharif, Laila Bashmal, Naif Alajlan, and Farid Melgani. 2022. Multilanguage Transformer for Improved Text to Remote Sensing Image Retrieval. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15 (2022).
[43]
Rita Ramos, Bruno Martins, and Desmond Elliott. 2023. LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting. In Findings of the Association for Computational Linguistics: ACL 2023.
[44]
Tim Salimans and Durk P Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Proceedings of the Conference on Neural Information Processing Systems (2016).
[45]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. LAION-5b: An open large-scale dataset for training next generation image-text models. Proceedings of the Conference on Neural Information Processing Systems (2022).
[46]
João Daniel Silva, João Magalhães, Devis Tuia, and Bruno Martins. 2024. Large Language Models for Captioning and Retrieving Remote Sensing Images. arXiv:2402.06475 (2024).
[47]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748 (2018).
[48]
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, and Oncel Tuzel. 2024. CLIP with Quality Captions: A Strong Pretraining for Vision Tasks. arXiv:2405.08911 (2024).
[49]
Feng Wang, Jieru Mei, and Alan Yuille. 2023. SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference. arXiv:2312.01597 (2023).
[50]
Qi Wang, Shaoteng Liu, Jocelyn Chanussot, and Xuelong Li. 2018. Scene classification with recurrent attention of VHR remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 57, 2 (2018).
[51]
Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, and Xinlong Wang. 2024. Diffusion Feedback Helps CLIP See Better. arXiv:2407.20171 (2024).
[52]
Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. 2022. MedCLIP: Contrastive learning from unpaired medical images and text. arXiv:2210.10163 (2022).
[53]
Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. 2019. iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.
[54]
Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. 2017. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 55, 7 (2017).
[55]
Gui-Song Xia, Wen Yang, Julie Delon, Yann Gousseau, Hong Sun, and Henri Maitre. 2010. Structural high-resolution satellite image indexing. In International Archives of Photogrammetry and Remote Sensing, Vol. 38. 298--303.
[56]
Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. 2022. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In Proceedings of the European Conference on Computer Vision.
[57]
Hongfeng Yu, Fanglong Yao, Wanxuan Lu, Nayu Liu, Peiguang Li, Hongjian You, and Xian Sun. 2022. Text-image matching for cross-modal remote sensing image retrieval via graph neural network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16 (2022).
[58]
Yuan Yuan, Yang Zhan, and Zhitong Xiong. 2023. Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval. arXiv:2308.12509 (2023).
[59]
Zhenghang Yuan, Lichao Mou, and Xiao Xiang Zhu. 2023. Multilingual augmentation for robust visual question answering in remote sensing images. In Proceedings of the IEEE Joint Urban Remote Sensing Event.
[60]
Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. 2022. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv:2204.09868 (2022).
[61]
Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, Kun Fu, and Xian Sun. 2022. Remote sensing cross-modal textimage retrieval based on global and local information. IEEE Transactions on Geoscience and Remote Sensing 60 (2022).
[62]
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. 2022. When and why vision-language models behave like bags-of-words, and what to do about it?. In Proceedings of the International Conference on Learning Representations.
[63]
Angelos Zavras, Dimitrios Michail, Begüm Demir, and Ioannis Papoutsis. 2024. Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment. arXiv:2402.09816 (2024).
[64]
Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. 2022. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv:2204.00598 (2022).
[65]
Yang Zhan, Zhitong Xiong, and Yuan Yuan. 2024. SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model. arXiv:2401.09712 (2024).
[66]
Le Zhang, Rabiul Awal, and Aishwarya Agrawal. 2023. Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Finegrained Understanding. arXiv:2306.08832 (2023).
[67]
Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. 2024. RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing. arXiv:2306.11300 (2024).
[68]
Bei Zhao, Yanfei Zhong, Gui-Song Xia, and Liangpei Zhang. 2015. Dirichlet-derived multiple topic scene classification model for high spatial resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 54, 4 (2015).
[69]
Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. 2023. MMICL: Empowering vision-language model with multi-modal in-context learning. arXiv:2309.07915 (2023).
[70]
Lijun Zhao, Ping Tang, and Lianzhi Huo. 2016. Feature significance-based multibag-of-visual-words model for remote sensing image scene classification. Journal of Applied Remote Sensing 10, 3 (2016).
[71]
Zhuowen Tu Zheng Ding, Jieke Wang. 2023. Open-Vocabulary Universal Image Segmentation with MaskCLIP. In Proceedings of the International Conference on Machine Learning.
[72]
Chong Zhou, Chen Change Loy, and Bo Dai. 2021. Extract Free Dense Labels from CLIP. arXiv:2112.01071 (2021).
[73]
Weixun Zhou, Shawn Newsam, Congmin Li, and Zhenfeng Shao. 2018. Pattern-Net: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS Journal of Photogrammetry and Remote Sensing 145 (2018).
[74]
Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu. 2023. Zegclip: Towards adapting CLIP for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[75]
Qin Zou, Lihao Ni, Tong Zhang, and Qian Wang. 2015. Deep learning based feature selection for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters 12, 11 (2015).

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGSPATIAL '24: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems
October 2024
743 pages
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024

Check for updates

Author Tags

  1. Contrastive Language-Image Pre-training
  2. Cross-Modal Retrieval
  3. Remote Sensing
  4. Self-Supervised Pre-training
  5. Vision and Language

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Fundação para a Ciência e Tecnologia
  • Center for Responsible AI

Conference

SIGSPATIAL '24
Sponsor:

Acceptance Rates

SIGSPATIAL '24 Paper Acceptance Rate 37 of 122 submissions, 30%;
Overall Acceptance Rate 257 of 1,238 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 127
    Total Downloads
  • Downloads (Last 12 months)127
  • Downloads (Last 6 weeks)105
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media