More Web Proxy on the site http://driver.im/

Article

Learning Disentanglement with Decoupled Labels for Vision-Language Navigation

Authors:

Jianbing ShenAuthors Info & Claims

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI

Pages 309 - 329

https://doi.org/10.1007/978-3-031-20059-5_18

Published: 23 October 2022 Publication History

Abstract

Vision-and-Language Navigation (VLN) requires an agent to follow complex natural language instructions and perceive the visual environment for real-world navigation. Intuitively, we find that instruction disentanglement for each viewpoint along the agent’s path is critical for accurate navigation. However, most methods only utilize the whole complex instruction or inaccurate sub-instructions due to the lack of accurate disentanglement as an intermediate supervision stage. To address this problem, we propose a new Disentanglement framework with Decoupled Labels (DDL) for VLN. Firstly, we manually extend the benchmark dataset Room-to-Room with landmark- and action-aware labels in order to provide fine-grained information for each viewpoint. Furthermore, to enhance the generalization ability, we propose a Decoupled Label Speaker module to generate pseudo-labels for augmented data and reinforcement training. To fully use the proposed fine-grained labels, we design a Disentangled Decoding Module to guide discriminative feature extraction and help alignment of multi-modalities. To reveal the generality of our proposed method, we apply it on a LSTM-based model and two recent Transformer-based models. Extensive experiments on two VLN benchmarks (i.e., R2R and R4R) demonstrate the effectiveness of our approach, achieving better performance than previous state-of-the-art methods.

References

[1]

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

[2]

Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683 (2018)

[3]

Antol, S., et al.: Vqa: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (2015)

[4]

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

[5]

Bojarski, M., et al.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016)

[6]

Cao, K., Brbić, M., Leskovec, J.: Concept learners for few-shot learning. In: International Conference on Learning Representations (2021)

[7]

Chang, A., et al.: Matterport3d: learning from rgb-d data in indoor environments. In: 7th IEEE International Conference on 3D Vision, 3DV 2017, pp. 667–676. Institute of Electrical and Electronics Engineers Inc. (2018)

[8]

Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12538–12547 (2019)

[9]

Chen, J., Gao, C., Meng, E., Zhang, Q., Liu, S.: Reinforced structured state-evolution for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15450–15459 (2022)

[10]

Chen, K., Chen, J.K., Chuang, J., Vázquez, M., Savarese, S.: Topological planning with transformers for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11276–11286 (2021)

[11]

Chen S, Guhur PL, Schmid C, and Laptev I History aware multimodal transformer for vision-and-language navigation Adv. Neural Inf. Process. Syst. 2021 34 1-14

[12]

Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Think global, act local: dual-scale graph transformer for vision-and-language navigation. arXiv preprint arXiv:2202.11742 (2022)

[13]

Deng, Z., Narasimhan, K., Russakovsky, O.: Evolving graphical planner: contextual global planning for vision-and-language navigation. In: Advances in Neural Information Processing Systems, vol. 33, pp. 20660–20672. Curran Associates, Inc. (2020)

[14]

Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volu. 1 (Long and Short Papers), pp. 4171–4186 (2019)

[15]

Dong X, Shen J, Shao L, and Porikli F Vedaldi A, Bischof H, Brox T, and Frahm J-M CLNet: a compact latent network for fast adjusting siamese trackers Computer Vision – ECCV 2020 2020 Cham Springer 378-395

[16]

Dong X, Shen J, Shao L, and Van Gool L Sub-markov random walk for image segmentation IEEE Trans. Image Process. 2015 25 2 516-527

[17]

Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 538–547 (2019)

[18]

Fried, D., et al.: Speaker-follower models for vision-and-language navigation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3318–3329 (2018)

[19]

Fu T-J, Wang XE, Peterson MF, Grafton ST, Eckstein MP, and Wang WY Vedaldi A, Bischof H, Brox T, and Frahm J-M Counterfactual vision-and-language navigation via adversarial path sampler Computer Vision – ECCV 2020 2020 Cham Springer 71-86

[20]

Gao, C., Chen, J., Liu, S., Wang, L., Zhang, Q., Wu, Q.: Room-and-object aware knowledge reasoning for remote embodied referring expression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3064–3073 (2021)

[21]

Guhur, P.L., Tapaswi, M., Chen, S., Laptev, I., Schmid, C.: Airbert: in-domain pretraining for vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1634–1643 (2021)

[22]

Han, W., Dong, X., Khan, F.S., Shao, L., Shen, J.: Learning to fuse asymmetric feature maps in siamese trackers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16570–16580 (2021)

[23]

Hao, W., Li, C., Li, X., Carin, L., Gao, J.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13137–13146 (2020)

[24]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

[25]

Hochreiter S and Schmidhuber J Long short-term memory Neural Comput. 1997 9 8 1735-1780

[26]

Hong Y, Rodriguez C, Qi Y, Wu Q, and Gould S Language and visual entity relationship graph for agent navigation Adv. Neural Inf. Process. Syst. 2020 33 1-12

[27]

Hong, Y., Rodriguez, C., Wu, Q., Gould, S.: Sub-instruction aware vision-and-language navigation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 3360–3376 (2020)

[28]

Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: Vln bert: a recurrent vision-and-language bert for navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1643–1653 (2021)

[29]

Hu, R., Fried, D., Rohrbach, A., Klein, D., Darrell, T., Saenko, K.: Are you looking? grounding to multiple modalities in vision-and-language navigation. arXiv preprint arXiv:1906.00347 (2019)

[30]

Ilharco, G., Jain, V., Ku, A., Ie, E., Baldridge, J.: General evaluation for instruction conditioned navigation using dynamic time warping. arXiv preprint arXiv:1907.05446 (2019)

[31]

Irshad, M.Z., Mithun, N.C., Seymour, Z., Chiu, H.P., Samarasekera, S., Kumar, R.: Sasra: semantically-aware spatio-temporal reasoning agent for vision-and-language navigation in continuous environments. arXiv preprint arXiv:2108.11945 (2021)

[32]

Jain, V., Magalhaes, G., Ku, A., Vaswani, A., Ie, E., Baldridge, J.: Stay on the path: instruction fidelity in vision-and-language navigation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1862–1872 (2019)

[33]

Ke, L., et al.: Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6741–6749 (2019)

[34]

Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015)

[35]

Kolve, E., et al.: Ai2-thor: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474 (2017)

[36]

Krantz, J., Gokaslan, A., Batra, D., Lee, S., Maksymets, O.: Waypoint models for instruction-guided navigation in continuous environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15162–15171 (2021)

[37]

Krantz J, Wijmans E, Majumdar A, Batra D, and Lee S Vedaldi A, Bischof H, Brox T, and Frahm J-M Beyond the nav-graph: vision-and-language navigation in continuous environments Computer Vision – ECCV 2020 2020 Cham Springer 104-120

[38]

Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4392–4412 (2020)

[39]

Landi, F., Baraldi, L., Cornia, M., Corsini, M., Cucchiara, R.: Perceive, transform, and act: multimodal attention networks for low-level vision-and-language navigation. arXiv preprint arXiv:1911.12377 (2019)

[40]

Li, X., et al.: Robust navigation with language pretraining and stochastic sampling. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 1494–1499 (2019)

[41]

Li X et al. Vedaldi A, Bischof H, Brox T, Frahm J-M, et al. Oscar: object-semantics aligned pre-training for vision-language tasks Computer Vision – ECCV 2020 2020 Cham Springer 121-137

[42]

Liang, X., Zhu, F., Zhu, Y., Lin, B., Wang, B., Liang, X.: Contrastive instruction-trajectory learning for vision-language navigation. arXiv preprint arXiv:2112.04138 (2021)

[43]

Lin, C., Jiang, Y., Cai, J., Qu, L., Haffari, G., Yuan, Z.: Multimodal transformer with variable-length memory for vision-and-language navigation. arXiv preprint arXiv:2111.05759 (2021)

[44]

Liu, C., Zhu, F., Chang, X., Liang, X., Ge, Z., Shen, Y.D.: Vision-language navigation with random environmental mixup. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1644–1654 (2021)

[45]

Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)

[46]

Ma, C.Y., et al.: Self-monitoring navigation agent via auxiliary progress estimation. In: Proceedings of the International Conference on Learning Representations (2019)

[47]

Ma, C.Y., Wu, Z., AlRegib, G., Xiong, C., Kira, Z.: The regretful agent: heuristic-aided navigation through progress estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6732–6740 (2019)

[48]

Majumdar A, Shrivastava A, Lee S, Anderson P, Parikh D, and Batra D Vedaldi A, Bischof H, Brox T, and Frahm J-M Improving vision-and-language navigation with image-text pairs from the web Computer Vision – ECCV 2020 2020 Cham Springer 259-274

[49]

Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937. PMLR (2016)

[50]

Moudgil, A., Majumdar, A., Agrawal, H., Lee, S., Batra, D.: Soat: a scene-and object-aware transformer for vision-and-language navigation. arXiv preprint arXiv:2110.14143 (2021)

[51]

Nguyen, K., Daumé III, H.: Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 684–695 (2019)

[52]

Nguyen, K., Dey, D., Brockett, C., Dolan, B.: Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12527–12537 (2019)

[53]

Parvaneh A, Abbasnejad E, Teney D, Shi Q, and van den Hengel A Counterfactual vision-and-language navigation: unravelling the unseen Adv. Neural Inf. Process. Syst. 2020 33 5296-5307

[54]

Pashevich, A., Schmid, C., Sun, C.: Episodic transformer for vision-and-language navigation. In: ICCV (2021)

[55]

Qi, Y., Pan, Z., Hong, Y., Yang, M.H., van den Hengel, A., Wu, Q.: The road to know-where: An object-and-room informed sequential bert for indoor vision-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1655–1664 (2021)

[56]

Qi, Y., Pan, Z., Zhang, S., van den Hengel, A., Wu, Q.: Object-and-action aware model for visual language navigation. In: Proceedings of the European Conference on Computer Vision, Glasgow, Scotland, pp. 23–28. Springer, Heidelberg (2020)

[57]

Qi, Y., et al.: Reverie: Remote embodied visual referring expression in real indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9982–9991 (2020)

[58]

Qiao, Y., Qi, Y., Hong, Y., Yu, Z., Wang, P., Wu, Q.: Hop: history-and-order aware pre-training for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15418–15427 (2022)

[59]

Qin, W., Misu, T., Wijaya, D.: Explore the potential performance of vision-and-language navigation model: a snapshot ensemble method. arXiv preprint arXiv:2111.14267 (2021)

[60]

Raychaudhuri, S., Wani, S., Patel, S., Jain, U., Chang, A.X.: Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments. arXiv preprint arXiv:2109.15207 (2021)

[61]

Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211-252 (2015)

[62]

Savva, M., et al.: Habitat: a platform for embodied ai research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9339–9347 (2019)

[63]

Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

[64]

Shridhar, M., et al.: Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10740–10749 (2020)

[65]

Tan, H., Yu, L., Bansal, M.: Learning to navigate unseen environments: back translation with environmental dropout. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 2610–2621 (2019)

[66]

Tan, S., Ge, M., Guo, D., Liu, H., Sun, F.: Self-supervised 3D semantic representation learning for vision-and-language navigation. arXiv preprint arXiv:2201.10788 (2022)

[67]

Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: Conference on Robot Learning, pp. 394–406. PMLR (2020)

[68]

Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

[69]

Wang, H., Liang, W., Shen, J., Van Gool, L., Wang, W.: Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15471–15481 (2022)

[70]

Wang, H., Wang, W., Liang, W., Xiong, C., Shen, J.: Structured scene memory for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8455–8464 (2021)

[71]

Wang H, Wang W, Shu T, Liang W, and Shen J Vedaldi A, Bischof H, Brox T, and Frahm J-M Active visual information gathering for vision-language navigation Computer Vision – ECCV 2020 2020 Cham Springer 307-322

[72]

Wang H, Wu Q, and Shen C Vedaldi A, Bischof H, Brox T, and Frahm J-M Soft expert reward learning for vision-and-language navigation Computer Vision – ECCV 2020 2020 Cham Springer 126-141

[73]

Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6629–6638 (2019)

[74]

Wang, X., Xiong, W., Wang, H., Wang, W.Y.: Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In: Proceedings of the European Conference on Computer Vision, pp. 37–53 (2018)

[75]

Wu, D., Dong, X., Shao, L., Shen, J.: Multi-level representation learning with semantic alignment for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4996–5005 (2022)

[76]

Xiang, J., Wang, X., Wang, W.Y.: Learning to stop: a simple yet effective approach to urban vision-language navigation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 699–707 (2020)

[77]

Zhang, Y., Niebles, J.C., Soto, A.: Interpretable visual question answering by visual grounding from attention supervision mining. In: 2019 IEEE Winter Conference on Applications of Computer Vision, pp. 349–357. IEEE (2019)

[78]

Zhou B, Lapedriza A, Khosla A, Oliva A, and Torralba A Places: a 10 million image database for scene recognition IEEE Trans. Pattern Anal. Mach. Intell. 2017 40 6 1452-1464

[79]

Zhu, F., Zhu, Y., Chang, X., Liang, X.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10012–10022 (2020)

[80]

Zhu, W., et al.: Babywalk: going farther in vision-and-language navigation by taking baby steps. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2539–2556 (2020)

Index Terms

Learning Disentanglement with Decoupled Labels for Vision-Language Navigation
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
    1. Machine learning approaches
2. Human-centered computing
  1. Human computer interaction (HCI)

Index terms have been assigned to the content through auto-classification.

Recommendations

DisCont: Self-Supervised Visual Attribute Disentanglement Using Context Vectors
Computer Vision – ECCV 2020 Workshops
Abstract
Disentangling underlying feature attributes within an image with no prior supervision is a challenging task. Models that can disentangle attributes well, provide greater interpretability and control. In this paper, we propose a self-supervised ...
Face identity disentanglement via latent space mapping

Learning disentangled representations of data is a fundamental problem in artificial intelligence. Specifically, disentangled latent representations allow generative models to control and compose the disentangled factors in the synthesis process. ...
FedVLN: Privacy-Preserving Federated Vision-and-Language Navigation
Computer Vision – ECCV 2022
Abstract
Data privacy is a central problem for embodied agents that can perceive the environment, communicate with humans, and act in the real world. While helping humans complete tasks, the agent may observe and process sensitive information of users, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI

Oct 2022

809 pages

ISBN:978-3-031-20058-8

DOI:10.1007/978-3-031-20059-5

Editors:
Shai Avidan
Tel Aviv University, Tel Aviv, Israel
,
Gabriel Brostow
University College London, London, UK
,
Moustapha Cissé
Google AI, Accra, Ghana
,
Giovanni Maria Farinella
University of Catania, Catania, Italy
,
Tal Hassner
Facebook (United States), Menlo Park, CA, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 October 2022

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents