More Web Proxy on the site http://driver.im/

research-article

Embodied Contrastive Learning with Geometric Consistency and Behavioral Awareness for Object Navigation

Authors:

Jianxin WangAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 4776 - 4785

https://doi.org/10.1145/3664647.3681248

Published: 28 October 2024 Publication History

Abstract

Object Navigation (ObjcetNav), which enables an agent to seek any instance of an object category specified by a semantic label, has shown great advances. However, current agents are built upon occlusion-prone visual observations or compressed 2D semantic maps, which hinder their embodied perception of 3D scene geometry and easily lead to ambiguous object localization and blind exploration. To address these limitations, we present an Embodied Contrastive Learning (ECL) method with Geometric Consistency (GC) and Behavioral Awareness (BA), which motivates agents to actively encode 3D scene layouts and semantic cues. Driven by our embodied exploration strategy, BA is modeled by predicting navigational actions based on multi-frame visual images, as behaviors that cause differences between adjacent visual sensations are crucial for learning correlations among continuous visions. The GC is modeled as the alignment of behavior-aware visual stimulus with 3D semantic shapes by employing unsupervised contrastive learning. The aligned behavior-aware visual features and geometric invariance priors are injected into a modular ObjectNav framework to enhance object recognition and exploration capabilities. As expected, our ECL method performs well on object detection and instance segmentation tasks. Our ObjectNav strategy outperforms state-of-the-art methods on MP3D and Gibson datasets, showing the potential of our ECL in embodied navigation.

References

[1]

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. 2023. Bevbert: Multimodal map pre-training for language-guided navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2737--2748.

[2]

Pattaramanee Arsomngern, Sarana Nutanong, and Supasorn Suwajanakorn. 2023. Learning Geometric-Aware Properties in 2D Representation Using Lightweight CAD Models, or Zero Real 3D Pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21371--21381.

[3]

Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. 2020. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171 (2020).

[4]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650--9660.

[5]

Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, and Dhruv Batra. 2021. Semantic mapnet: Building allocentric semantic maps and representations from egocentric views. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 964--972.

[6]

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments. (2017), 667--676. https://doi.org/10.1109/3DV.2017.00081

[7]

Devendra Singh Chaplot, Murtaza Dalal, Saurabh Gupta, Jitendra Malik, and Russ R Salakhutdinov. 2021. Seal: Self-supervised embodied active learning using exploration and 3d consistency. Advances in neural information processing systems, Vol. 34 (2021), 13086--13098.

[8]

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. 2020. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, Vol. 33 (2020), 4247--4258.

[9]

Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. 2020. Neural topological slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12875--12884.

[10]

Bolei Chen, Yongzheng Cui, Ping Zhong, Wang Yang, Yixiong Liang, and Jianxin Wang. 2023. STExplorer: A hierarchical autonomous exploration strategy with spatio-temporal awareness for aerial robots. ACM Transactions on Intelligent Systems and Technology, Vol. 14, 6 (2023), 1--24.

Digital Library

[11]

Bolei Chen, Jiaxu Kang, Ping Zhong, Yongzheng Cui, Siyi Lu, Yixiong Liang, and Jianxin Wang. 2023. Think Holistically, Act Down-to-Earth: A Semantic Navigation Strategy with Continuous Environmental Representation and Multi-step Forward Planning. IEEE Transactions on Circuits and Systems for Video Technology (2023).

[12]

Bolei Chen, Siyi Lu, Ping Zhong, Yongzheng Cui, Yixiong Liang, and Jianxin Wang. 2024. SemNav-HRO: A target-driven semantic navigation strategy with human--robot--object ternary fusion. Engineering Applications of Artificial Intelligence, Vol. 127 (2024), 107370.

Digital Library

[13]

Bolei Chen, Haina Zhu, Shengkang Yao, Siyi Lu, Ping Zhong, Yu Sheng, and Jianxin Wang. 2024. Socially Aware Object Goal Navigation With Heterogeneous Scene Representation Learning. IEEE Robotics and Automation Letters (2024).

[14]

Nenglun Chen, Lei Chu, Hao Pan, Yan Lu, and Wenping Wang. 2022. Self-supervised image representation learning with geometric set consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19292--19302.

[15]

Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. 2021. History aware multimodal transformer for vision-and-language navigation. Advances in neural information processing systems, Vol. 34 (2021), 5834--5847.

[16]

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. 2022. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16537--16547.

[17]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.

[18]

Ronghao Dang, Zhuofan Shi, Liuyi Wang, Zongtao He, Chengju Liu, and Qijun Chen. 2022. Unbiased directed object attention graph for object navigation. In Proceedings of the 30th ACM International Conference on Multimedia. 3617--3627.

Digital Library

[19]

Ronghao Dang, Liuyi Wang, Zongtao He, Shuai Su, Jiagui Tang, Chengju Liu, and Qijun Chen. 2023. Search for or navigate to? dual adaptive thinking for object navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8250--8259.

[20]

Heming Du, Xin Yu, and Liang Zheng. 2020. Learning object relation graph and tentative policy for visual navigation. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part VII 16. Springer, 19--34.

[21]

Heming Du, Xin Yu, and Liang Zheng. 2021. VTNet: Visual transformer network for object goal navigation. arXiv preprint arXiv:2105.09447 (2021).

[22]

Yilun Du, Chuang Gan, and Phillip Isola. 2021. Curious representation learning for embodied intelligence. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10408--10417.

[23]

Chen Gao, Xingyu Peng, Mi Yan, He Wang, Lirong Yang, Haibing Ren, Hongsheng Li, and Si Liu. 2023. Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14911--14920.

[24]

Georgios Georgakis, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, and Kostas Daniilidis. 2022. Learning to map for active semantic goal navigation. In International Conference on Learning Representations (ICLR) (2022).

[25]

Theophile Gervet, Soumith Chintala, Dhruv Batra, Jitendra Malik, and Devendra Singh Chaplot. 2023. Navigating to objects in the real world. Science Robotics, Vol. 8, 79 (2023), eadf6991.

[26]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.

[27]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[28]

Joao F Henriques and Andrea Vedaldi. 2018. Mapnet: An allocentric spatial memory for mapping environments. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8476--8484.

[29]

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. 2023. Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3055--3067.

[30]

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. 2023. Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3055--3067.

[31]

Ji Hou, Saining Xie, Benjamin Graham, Angela Dai, and Matthias Nießner. 2021. Pri3d: Can 3d priors help 2d representation learning?. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5693--5702.

[32]

Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour, Han-Pang Chiu, Supun Samarasekera, and Rakesh Kumar. 2022. Semantically-aware spatio-temporal reasoning agent for vision-and-language navigation in continuous environments. In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 4065--4071.

[33]

Jindong Jiang, Lunan Zheng, Fei Luo, and Zhijun Zhang. 2018. Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054 (2018).

[34]

Gulshan Kumar, N Sai Shankar, Himansu Didwania, Ruddra Dev Roychoudhury, Brojeshwar Bhowmick, and K Madhava Krishna. 2021. Gcexp: Goal-conditioned exploration for object goal navigation. In 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN). IEEE, 123--130.

Digital Library

[35]

Weiyuan Li, Ruoxin Hong, Jiwei Shen, Liang Yuan, and Yue Lu. 2023. Transformer Memory for Interactive Visual Navigation in Cluttered Environments. IEEE Robotics and Automation Letters, Vol. 8, 3 (2023), 1731--1738.

[36]

Weijie Li, Xinhang Song, Yubing Bai, Sixian Zhang, and Shuqiang Jiang. 2021. Ion: Instance-level object navigation. In Proceedings of the 29th ACM International Conference on Multimedia. 4343--4352.

Digital Library

[37]

Xiangyang Li, Zihan Wang, Jiahao Yang, Yaowei Wang, and Shuqiang Jiang. 2023. KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2583--2592.

[38]

Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, and Zehuan Yuan. 2022. Multimodal transformer with variable-length memory for vision-and-language navigation. In European Conference on Computer Vision. Springer, 380--397.

Digital Library

[39]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part V 13. Springer, 740--755.

[40]

Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. 2023. Bird's-Eye-View Scene Graph for Vision-Language Navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10968--10980.

[41]

Oleksandr Maksymets, Vincent Cartillier, Aaron Gokaslan, Erik Wijmans, Wojciech Galuba, Stefan Lee, and Dhruv Batra. 2021. Thda: Treasure hunt data augmentation for semantic navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15374--15383.

[42]

Bar Mayo, Tamir Hazan, and Ayellet Tal. 2021. Visual navigation with spatial attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16898--16907.

[43]

John O'Keefe and Neil Burgess. 1996. Geometric determinants of the place fields of hippocampal neurons. Nature, Vol. 381, 6581 (1996), 425--428.

[44]

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 652--660.

[45]

Santhosh K Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. 2020. Occupancy anticipation for efficient exploration and navigation. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part V 16. Springer, 400--418.

[46]

Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. 2022. Poni: Potential functions for objectgoal navigation with interaction-free learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18890--18900.

[47]

Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. 2022. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5173--5183.

[48]

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. 2019. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision. 9339--9347.

[49]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).

[50]

James A Sethian. 1999. Fast marching methods. SIAM review, Vol. 41, 2 (1999), 199--235.

[51]

Kunal Pratap Singh, Jordi Salvador, Luca Weihs, and Aniruddha Kembhavi. 2023. Scene Graph Contrastive Learning for Embodied Navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10884--10894.

[52]

Edward C Tolman. 1948. Cognitive maps in rats and men. Psychological review, Vol. 55, 4 (1948), 189.

[53]

Hanqing Wang, Wei Liang, Luc Van Gool, and Wenguan Wang. 2023. DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10873--10883.

[54]

Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. 2020. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In International Conference on Learning Representations (ICLR) (2020).

[55]

Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. 2018. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9068--9079.

[56]

Karmesh Yadav, Ram Ramrakhya, Arjun Majumdar, Vincent-Pierre Berges, Sachit Kuhar, Dhruv Batra, Alexei Baevski, and Oleksandr Maksymets. 2023. Offline visual representation learning for embodied navigation. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023.

[57]

Brian Yamauchi. 1997. A frontier-based approach for autonomous exploration. In Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97.'Towards New Computational Principles for Robotics and Automation'. IEEE, 146--151.

[58]

Cheng-Kun Yang, Min-Hung Chen, Yung-Yu Chuang, and Yen-Yu Lin. 2023. 2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 977--987.

[59]

Joel Ye, Dhruv Batra, Abhishek Das, and Erik Wijmans. 2021. Auxiliary tasks and exploration enable objectgoal navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16117--16126.

[60]

Bangguo Yu, Hamidreza Kasaei, and Ming Cao. 2023. Frontier semantic exploration for visual target navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4099--4105.

[61]

Haitao Zeng, Xinhang Song, and Shuqiang Jiang. 2023. Multi-Object Navigation Using Potential Target Position Policy Function. IEEE Transactions on Image Processing, Vol. 32 (2023), 2608--2619.

Digital Library

[62]

Albert J Zhai and Shenlong Wang. 2023. Peanut: Predicting and navigating to unseen targets. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10926--10935.

[63]

Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang. 2023. 3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6672--6682.

[64]

Sixian Zhang, Xinhang Song, Yubing Bai, Weijie Li, Yakui Chu, and Shuqiang Jiang. 2021. Hierarchical object-to-zone graph for object navigation. In Proceedings of the IEEE/CVF international conference on computer vision. 15130--15140.

[65]

Sixian Zhang, Xinhang Song, Weijie Li, Yubing Bai, Xinyao Yu, and Shuqiang Jiang. 2023. Layout-Based Causal Inference for Object Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10792--10802.

[66]

Chongyang Zhao, Yuankai Qi, and Qi Wu. 2023. Mind the Gap: Improving Success Rate of Vision-and-Language Navigation by Revisiting Oracle Success Routes. In Proceedings of the 31st ACM International Conference on Multimedia. 4349--4358.

Digital Library

[67]

Yusheng Zhao, Jinyu Chen, Chen Gao, Wenguan Wang, Lirong Yang, Haibing Ren, Huaxia Xia, and Si Liu. 2022. Target-driven structured transformer planner for vision-language navigation. In Proceedings of the 30th ACM International Conference on Multimedia. 4194--4203.

Digital Library

[68]

Ping Zhong, Bolei Chen, Siyi Lu, Xiaoxi Meng, and Yixiong Liang. 2021. Information-driven fast marching autonomous exploration with aerial robots. IEEE Robotics and Automation Letters, Vol. 7, 2 (2021), 810--817.

[69]

Kang Zhou, Chi Guo, Wenfei Guo, and Huyin Zhang. 2023. Learning Heterogeneous Relation Graph and Value Regularization Policy for Visual Navigation. IEEE Transactions on Neural Networks and Learning Systems (2023).

Index Terms

Embodied Contrastive Learning with Geometric Consistency and Behavioral Awareness for Object Navigation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
    2. Knowledge representation and reasoning
      1. Cognitive robotics

Recommendations

Implicit Obstacle Map-driven Indoor Navigation Model for Robust Obstacle Avoidance
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Robust obstacle avoidance is one of the critical steps for successful goal-driven indoor navigation tasks. Due to the obstacle missing in the visual image and the possible missed detection issue, visual image-based obstacle avoidance techniques still ...
Generative Meta-Adversarial Network for Unseen Object Navigation
Computer Vision – ECCV 2022
Abstract
Object navigation is a task to let the agent navigate to a target object. Prevailing works attempt to expand navigation ability in new environments and achieve reasonable performance on the seen object categories that have been observed in ...
Stereo Depth Estimation via Self-supervised Contrastive Representation Learning
Medical Image Computing and Computer Assisted Intervention – MICCAI 2022
Abstract
Accurate stereo depth estimation is crucial for 3D reconstruction in surgery. Self-supervised approaches are more preferable than supervised approaches when limited data is available for training but they can not learn clear discrete data ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
53
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)45

Reflects downloads up to 19 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents