[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Boosting Vision-and-Language Navigation with Direction Guiding and Backtracing

Published: 05 January 2023 Publication History

Abstract

Vision-and-Language Navigation (VLN) has been an emerging and fast-developing research topic, where an embodied agent is required to navigate in a real-world environment based on natural language instructions. In this article, we present a Direction-guided Navigator Agent (DNA) that novelly integrates direction clues derived from instructions into the essential encoder-decoder navigation framework. Particularly, DNA couples the standard instruction encoder with an additional direction branch which sequentially encodes the direction clues in the instructions to boost navigation. Furthermore, an Instruction Flipping mechanism is uniquely devised to enable fast data augmentation as well as a follow-up backtracing for navigating the agent in a backward direction. Such a way naturally amplifies the grounding of instruction in the local visual scenes along both forward and backward directions, and thus strengthens the alignment between instruction and action sequence. Extensive experiments conducted on Room to Room (R2R) dataset validate our proposal and demonstrate quantitatively compelling results.

References

[1]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian D. Reid, Stephen Gould, and Anton van den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 3674–3683.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision. 2425–2433.
[3]
Yoav Artzi and Luke Zettlemoyer. 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics 1 (2013), 49–62. https://dblp.org/rec/journals/tacl/ArtziZ13.html?view=bibtex.
[4]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.
[5]
Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, and Ruslan Salakhutdinov. 2018. Gated-attention architectures for task-oriented language grounding. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence.2819–2826.
[6]
David L. Chen and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence.
[7]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal image-TExt representation learning. In Proceedings of the European Conference on Computer Vision.Springer, 104–120.
[8]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Stefan Lee, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2019. Visual dialog. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 5 (2019), 1242–1256.
[9]
Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.1473–1482.
[10]
Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. 2018. Speaker-follower models for vision-and-language navigation. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems.3318–3329.
[11]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics.315–323.
[12]
Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. 2018. IQA: Visual question answering in interactive environments. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition.4089–4098.
[13]
Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. 2020. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE, 13134–13143.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.770–778.
[15]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[16]
Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.4555–4564.
[17]
Haoshuo Huang, Vihan Jain, Harsh Mehta, Alexander Ku, Gabriel Magalhães, Jason Baldridge, and Eugene Ie. 2019. Transferable representation learning in vision-and-language navigation. In Proceedings of the 2019 IEEE International Conference on Computer Vision.
[18]
Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. 2019. Stay on the path: Instruction fidelity in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
[19]
Andrej Karpathy and Fei-Fei Li. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 664–676.
[20]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations.
[21]
Yehao Li, Jiahao Fan, Yingwei Pan, Ting Yao, Weiyao Lin, and Tao Mei. 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. arXiv:2201.04026. Retrieved from https://arxiv.org/abs/2201.04026.
[22]
Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, and Tao Mei. 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In Proceedings of the 29th ACM International Conference on Multimedia. 3799–3802.
[23]
Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, and Tao Mei. 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In Proceedings of the AAAI Conference on Artificial Intelligence. 8518–8526.
[24]
Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, and Tao Mei. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In Proceedings of the 29th ACM International Conference on Multimedia. 5600–5608.
[25]
Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. 2019. Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of the 7th International Conference on Learning Representations.
[26]
Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. 2019. The regretful agent: Heuristic-aided navigation through progress estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.6732–6740.
[27]
Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference.1475–1482.
[28]
Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. 2020. Improving vision-and-language navigation with image-text pairs from the web. In Proceedings of the European Conference on Computer Vision.Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12351. Springer, 259–274.
[29]
Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2016. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of the 13th AAAI Conference on Artificial Intelligence.2772–2778.
[30]
Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and Raia Hadsell. 2017. Learning to navigate in complex environments. In Proceedings of the 5th International Conference on Learning Representations.
[31]
Varun K. Nagaraja, Vlad I. Morariu, and Larry S. Davis. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the 14th European Conference on Computer Vision.792–807.
[32]
Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. ACM Trans. Multimedia Comput. Commun. Appl. 18, 2, Article 48 (2022), 16 pages.
[33]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594–4602.
[34]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6504–6512.
[35]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.
[36]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.
[37]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123, 1 (2017), 74–93.
[38]
Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the 14th European Conference on Computer Vision.817–834.
[39]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
[40]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.), Association for Computational Linguistics, 5099–5110.
[41]
Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R. Walter, Ashis Gopal Banerjee, Seth J. Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the 25th AAAI Conference on Artificial Intelligence.
[42]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the 2015 IEEE International Conference on Computer Vision.4534–4542.
[43]
Adam Vogel and Daniel Jurafsky. 2010. Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.
[44]
Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.6629–6638.
[45]
Xin Wang, Wenhan Xiong, Hongmin Wang, and William Yang Wang. 2018. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In Proceedings of the 15th European Conference on Computer Vision.38–55.
[46]
Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, and Dhruv Batra. 2019. Embodied question answering in photorealistic environments with point cloud perception. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.6659–6668.
[47]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning.2048–2057.
[48]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the 2015 IEEE International Conference on Computer Vision.4507–4515.
[49]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision.684–699.
[50]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2621–2629.
[51]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894–4902.
[52]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular attention network for referring expression comprehension. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 1307–1315.
[53]
Zhou Yu, Yuhao Cui, Jun Yu, Meng Wang, Dacheng Tao, and Qi Tian. 2020. Deep multimodal neural architecture search. In Proceedings of the 28th ACM International Conference on Multimedia.Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (Eds.), ACM, 3743–3752.
[54]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 6281–6290.
[55]
Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. 2020. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE, 10009–10019.

Cited By

View all
  • (2024)Digging into Depth and Color Spaces: A Mapping Constraint Network for Depth Super-ResolutionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367712320:10(1-20)Online publication date: 30-Oct-2024
  • (2024)P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday TaskProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680661(6969-6978)Online publication date: 28-Oct-2024
  • (2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
  • Show More Cited By

Index Terms

  1. Boosting Vision-and-Language Navigation with Direction Guiding and Backtracing

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1
      January 2023
      505 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3572858
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 05 January 2023
      Online AM: 17 March 2022
      Accepted: 24 January 2022
      Revised: 06 July 2021
      Received: 18 October 2020
      Published in TOMM Volume 19, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Vision-and-language navigation
      2. cross-modal matching

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • National Key R&D Program of China
      • NSF of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)113
      • Downloads (Last 6 weeks)17
      Reflects downloads up to 05 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Digging into Depth and Color Spaces: A Mapping Constraint Network for Depth Super-ResolutionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367712320:10(1-20)Online publication date: 30-Oct-2024
      • (2024)P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday TaskProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680661(6969-6978)Online publication date: 28-Oct-2024
      • (2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
      • (2024)Building Category Graphs Representation with Spatial and Temporal Attention for Visual NavigationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365371420:7(1-22)Online publication date: 16-May-2024
      • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
      • (2024)Hierarchical Local-Global Transformer for Temporal Sentence GroundingIEEE Transactions on Multimedia10.1109/TMM.2023.330955126(3263-3277)Online publication date: 1-Jan-2024
      • (2024)Memory-Based Augmentation Network for Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.329509826(2367-2379)Online publication date: 1-Jan-2024
      • (2024)Dual-Adversarial Representation Disentanglement for Visible Infrared Person Re-IdentificationIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334428919(2186-2200)Online publication date: 1-Jan-2024
      • (2024)Malicious Path Manipulations via Exploitation of Representation Vulnerabilities of Vision-Language Navigation Systems2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)10.1109/IROS58592.2024.10802618(13845-13852)Online publication date: 14-Oct-2024
      • (2023)Prediction With Visual Evidence: Sketch Classification Explanation via Stroke-Level AttributionsIEEE Transactions on Image Processing10.1109/TIP.2023.329740432(4393-4406)Online publication date: 1-Jan-2023
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media