[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Foundations of spatial perception for robotics: : Hierarchical representations and real-time systems

Published: 16 October 2024 Publication History

Abstract

3D spatial perception is the problem of building and maintaining an actionable and persistent representation of the environment in real-time using sensor data and prior knowledge. Despite the fast-paced progress in robot perception, most existing methods either build purely geometric maps (as in traditional SLAM) or “flat” metric-semantic maps that do not scale to large environments or large dictionaries of semantic labels. The first part of this paper is concerned with representations: we show that scalable representations for spatial perception need to be hierarchical in nature. Hierarchical representations are efficient to store, and lead to layered graphs with small treewidth, which enable provably efficient inference. We then introduce an example of hierarchical representation for indoor environments, namely a 3D scene graph, and discuss its structure and properties. The second part of the paper focuses on algorithms to incrementally construct a 3D scene graph as the robot explores the environment. Our algorithms combine 3D geometry (e.g., to cluster the free space into a graph of places), topology (to cluster the places into rooms), and geometric deep learning (e.g., to classify the type of rooms the robot is moving across). The third part of the paper focuses on algorithms to maintain and correct 3D scene graphs during long-term operation. We propose hierarchical descriptors for loop closure detection and describe how to correct a scene graph in response to loop closures, by solving a 3D scene graph optimization problem. We conclude the paper by combining the proposed perception algorithms into Hydra, a real-time spatial perception system that builds a 3D scene graph from visual-inertial data in real-time. We showcase Hydra’s performance in photo-realistic simulations and real data collected by a Clearpath Jackal robots and a Unitree A1 robot. We release an open-source implementation of Hydra at https://github.com/MIT-SPARK/Hydra.

References

[1]
Aditya S, Yang Y, and Baral C, et al. (2018) Image understanding using vision and reasoning through scene description graph. Computer Vision and Image Understanding 173: 33–45.
[2]
Agia C, Jatavallabhula KM, and Khodeir M, et al. (2022) Taskography: evaluating robot task planning over large 3D scene graphs. Conference on Robot Learning (CoRL), Auckland, New Zealand, 14–18 December 2022.
[3]
Aktas ME, Akbas E, and Fatmaoui AE (2019) Persistence homology of networks: methods and applications. Applied Network Science 4(1): 28–61.
[4]
Ali D, Asaad A, and Jimenez MJ, et al. (2022) A Survey of Vectorization Methods in Topological Data Analysis. arXiv preprint arXiv:2212.09703.
[5]
Amodeo F, Caballero F, and Díaz-Rodríguez N, et al. (2022) Og-sgg: ontology-guided scene graph generation—a case study in transfer learning for telepresence robotics. IEEE Access 10: 132564–132583.
[6]
Anderson P, Fernando B, and Johnson M, et al. (2016) Spice: semantic propositional image caption evaluation. European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–13 October 2016.
[7]
Antonante P, Tzoumas V, and Yang H, et al. (2021) Outlier-robust estimation: hardness, minimally tuned algorithms, and applications. IEEE Transactions on Robotics 38(1): 281–301.
[8]
Arandjelovic R, Gronat P, and Torii A, et al. (2016) NetVLAD: CNN architecture for weakly supervised place recognition IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016.
[9]
Armeni I, He Z, and Gwak J, et al. (2019) 3D scene graph: a structure for unified semantics, 3D space, and camera International Conference on Computer Vision (ICCV), Seoul, Korea, 2 November 2019.
[10]
Auer S, Bizer C, and Kobilarov G, et al. (2007) DBpedia: a nucleus for a web of open data. Semantic Web. Berlin, Germany: Springer.
[11]
Bavle H, Sanchez-Lopez JL, and Shaheer M, et al. (2022a) S-graphs+: real-time localization and mapping leveraging hierarchical representations. arXiv preprint arXiv:2212.11770.
[12]
Bavle H, Sanchez-Lopez JL, and Shaheer M, et al. (2022b) Situational graphs for robot navigation in structured indoor environments. IEEE Robotics and Automation Letters 7(4): 9107–9114.
[13]
Becker A and Geiger D (1996) A sufficiently fast algorithm for finding close to optimal junction trees. Conference on Uncertainty in Artificial Intelligence (UAI), Portland, OR, 1–4 August 1996.
[14]
Beeson P, Modayil J, and Kuipers B (2010) Factoring the mapping problem: mobile robot map-building in the hybrid spatial semantic hierarchy. The International Journal of Robotics Research 29(4): 428–459.
[15]
Beetz M, Beßler D, and Haidu A, et al. (2018) KnowRob 2.0—a 2nd generation knowledge processing framework for cognition-enabled robotic agents. 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018.
[16]
Behley J, Garbade M, and Milioto A, et al. (2019) SemanticKITTI: a dataset for semantic scene understanding of LiDAR sequences. International Conference on Computer Vision (ICCV), Seoul, Korea, 2 November 2019.
[17]
Berg M, Konidaris G, and Tellex S (2022) Using language to generate state abstractions for long-range planning in outdoor environments. In: IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022.
[18]
Blanco JL and Rai PK (2014) Nanoflann: a C++ header-only fork of FLANN. A library for nearest neighbor (NN) with kd-trees. https://github.com/jlblancoc/nanoflann
[19]
Bodlaender HL (1988) Dynamic programming on graphs with bounded treewidth Automata, Languages and Programming 3: 105–118.
[20]
Bodlaender HL (2006) Treewidth: characterizations, applications, and computations. Graph-Theoretic Concepts in Computer Science. Berlin, Germany: Springer Berlin Heidelberg, 1–14.
[21]
Bodlaender HL and Koster AM (2010) Treewidth computations I: upper bounds. Information and Computation 208(3): 259–275.
[22]
Bodlaender HL and Koster AM (2011) Treewidth computations II: lower bounds. Information and Computation 209(7): 1103–1119.
[23]
Bollacker K, Evans C, and Paritosh P, et al. (2008) Freebase: a collaboratively created graph database for structuring human knowledge. Proceedings of the ACM SIGMOD International Conference on Management of Data, Houston, TX, USA, 10–15 June 2008.
[24]
Bormann R, Jordan F, and Li W, et al. (2016) Room segmentation: survey, implementation, and analysis. 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016.
[25]
Borst WN (1997) Construction of Engineering Ontologies for Knowledge Sharing and Reuse. PhD Dissertation. Enschede, Netherlands: University of Twente.
[26]
Bowman S, Atanasov N, and Daniilidis K, et al. (2017) Probabilistic data association for semantic SLAM. IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May 2017.
[27]
Bronstein MM, Bruna J, and LeCun Y, et al. (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34(4): 18–42.
[28]
Busbridge D, Sherburn D, and Cavallo P, et al. (2019) Relational Graph Attention Networks. arXiv preprint arXiv:1904.05811.
[29]
Cadena C, Carlone L, and CarrilloLatif HY, et al. (2016) Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Transactions on Robotics 32(6): 1309–1332.
[30]
Carlson A, Betteridge J, and Kisiel B, et al. (2010) Toward an architecture for never-ending language learning. Proceedings of the AAAI Conference on Artificial Intelligence 24: 1306–1313.
[31]
Chandrasekaran V, Srebro N, and Harsha P (2008) Complexity of inference in graphical models. Conference on Uncertainty in Artificial Intelligence (UAI), Helsinki, Finland, 9–12 July 2008.
[32]
Chang A, Dai A, and Funkhouser T, et al. (2017) Matterport3d: learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017.
[33]
Chang Y, Ebadi K, and Denniston C, et al. (2022) Lamp 2.0: a robust multi-robot SLAM system for operation in challenging large-scale underground environments. IEEE Robotics and Automation Letters 7(4): 9175–9182.
[34]
Chang X, Ren P, and Xu P, et al. (2023a) A comprehensive survey of scene graphs: generation and application. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1): 1–26.
[35]
Chang Y, Ballotta L, and Carlone L, et al. (2023b) D-Lite: navigation-oriented compression of 3D scene graphs for multi-robot collaboration. IEEE Robotics and Automation Letters 8: 7527–7534.
[36]
Chatila R and Laumond JP (1985) Position referencing and consistent world modeling for mobile robots. IEEE International Conference on Robotics and Automation (ICRA), St. Louis, Missouri, USA, 25–28 March 1985.
[37]
Chen H, Tan H, and Kuntz A, et al. (2020) Enabling robots to understand incomplete natural language instructions using commonsense reasoning. In: IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 August 2020.
[38]
Chen W, Hu S, and Talak R, et al. (2022) Leveraging Large (Visual) Language Models for Robot 3d Scene Understanding. arXiv preprint: 2209.05629.
[39]
Chen Z, Rezayi S, and Li S (2023) More knowledge, less bias: unbiasing scene graph generation with explicit ontological adjustment. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023.
[40]
Choset H and Nagatani K (2001) Topological simultaneous localization and mapping (SLAM): toward exact localization without explicit localization. IEEE Transactions on Robotics and Automation 17(2): 125–137.
[41]
Chua J (2018) Probabilistic Scene Grammars: A General-Purpose Framework for Scene Understanding. Providence, RI: Brown University Thesis, 1–146.
[42]
Cooper G (1990) The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence 42(2-3): 393–405.
[43]
Czarnowski J, Laidlow T, and Clark R, et al. (2020) DeepFactors: real-time probabilistic dense monocular SLAM. IEEE Robotics and Automation Letters 5(2): 721–728.
[44]
Dai A, Nießner M, and Zollhöfer M, et al. (2017) Bundlefusion: real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics 36(4): 1.
[45]
Daruna A, Nair L, and Liu W, et al. (2021) Towards robust one-shot task execution using knowledge graph embeddings. IEEE International Conference on Robotics and Automation (ICRA). Yokohama, Japan, 5 June 2021.
[46]
Davison AJ (2018) FutureMapping: The Computational Structure of Spatial AI Systems. arXiv preprint arXiv:1803.11288.
[47]
Dechter R and Mateescu R (2007) AND/OR search spaces for graphical models. Artificial Intelligence 171(2-3): 73–106.
[48]
Defferrard M, Bresson X, and Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Advances in Neural Information Processing Systems 29: 3844–3852.
[49]
Diab M, Akbari A, and Ud Din M, et al. (2019) PMK—a knowledge processing framework for autonomous robotics perception and manipulation. Sensors 19(5): 1166.
[50]
Ding Y, Yu J, and Liu B, et al. (2022) MuKEA: multimodal knowledge extraction and accumulation for knowledge-based visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18 June 2022.
[51]
Dong J, Fei X, and Soatto S (2017) Visual-Inertial-Semantic scene representation for 3D object detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21 July 2017.
[52]
Feder T and Vardi M (1993) Monotone monadic snp and constraint satisfaction ACM Symp. On Theory of Computing (STOC). New York, NY, USA: ACM Press, 612–622.
[53]
Fey M and Lenssen JE (2019) Fast graph representation learning with PyTorch Geometric. International Conference on Learning Representations (ICLR) Workshop on Representation Learning on Graphs and Manifolds, Eindhoven, The Netherlands, 6 March 2019.
[54]
Fodor JA and Pylyshyn ZW (1988) Connectionism and cognitive architecture: a critical analysis. Cognition 28(1): 3–71.
[55]
Foskey M, Lin MC, and Manocha D (2003) Efficient computation of a simplified medial axis. Journal of Computing and Information Science in Engineering 3(4): 274–284.
[56]
Friedman S, Pasula H, and Fox D (2007) Voronoi random fields: extracting the topological structure of indoor environments via place labeling. International Joint Conference On AI (IJCAI). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2109–2114.
[57]
Furukawa Y, Curless B, and Seitz SM, et al. (2009) Reconstructing building interiors from images. International Conference on Computer Vision (ICCV), Kyoto, Japan, 2 October 2009.
[58]
Galindo C, Saffiotti A, and Coradeschi S, et al. (2005) Multi-hierarchical semantic maps for mobile robotics. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Edmonton, AB, Canada, 2–6 August 2005.
[59]
Gálvez-López D and Tardós JD (2012) Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics 28(5): 1188–1197.
[60]
Garcia-Garcia A, Orts-Escolano S, and Oprea S, et al. (2017) A Review on Deep Learning Techniques Applied to Semantic Segmentation. ArXiv Preprint: 1704.06857.
[61]
Gardères F, Ziaeefard M, and Abeloos B, et al. (2020) ConceptBert: concept-aware representation for visual question answering. Conference on Empirical Methods in Natural Language Processing, 16–20 November 2020.
[62]
Garg S and Milford M (2021) Seqnet: learning descriptors for sequence-based hierarchical place recognition. IEEE Robotics and Automation Letters 6(3): 4305–4312.
[63]
Gawel A, Don CD, and Siegwart R, et al. (2018) X-view: graph-based semantic multi-view localization. IEEE Robotics and Automation Letters 3(3): 1687–1694.
[64]
Gay P, Stuart J, and Del Bue A (2018) Visual graphs from motion (VGfM): scene understanding with object geometry reasoning. Asian Conference On Computer Vision (ACCV). Berlin, Germany: Springer International Publishing, 330–346.
[65]
Geman S, Potter DF, and Chi Z (2002) Composition systems. Quarterly of Applied Mathematics 60(4): 707–736.
[66]
Genesereth MR and Nilsson NJ (2012) Logical Foundations of Artificial Intelligence. Burlington, Massachusetts: Morgan Kaufmann.
[67]
Gothoskar N, Cusumano-Towner M, and Zinberg B, et al. (2021) 3DP3: 3D Scene Perception via Probabilistic Programming. ArXiv preprint: 2111.00312.
[68]
Grinvald M, Furrer F, and Novkovic T, et al. (2019) Volumetric instance-aware semantic mapping and 3D object discovery. IEEE Robotics and Automation Letters 4(3): 3037–3044.
[69]
Grohe M, Neuen D, and Schweitzer P, et al. (2020) An improved isomorphism test for bounded-tree-width graphs. ACM Transactions on Algorithms 16(3): 1–31.
[70]
Gruber TR (1995) Toward principles for the design of ontologies used for knowledge sharing? International Journal of Human-Computer Studies 43(5-6): 907–928.
[71]
Guarino N, Oberle D, and Staab S (2009) What is an ontology? Handbook on ontologies 1: 1–17.
[72]
Guo Y, Gao L, and Wang X, et al. (2021) From general to specific: informative scene graph generation via balance adjustment. International Conference on Computer Vision (ICCV), Montreal, Canada, 17 October 2021.
[73]
Ha H and Song S (2022) Semantic abstraction: open-world 3d scene understanding from 2d vision-language models. 6th Annual Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022.
[74]
Hamilton WL, Ying R, and Leskovec J (2017) Inductive representation learning on large graphs. Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017.
[75]
Hao J, Chen M, and Yu W, et al. (2019) Universal representation learning of knowledge bases by jointly embedding instances and ontological concepts. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019.
[76]
Harnad S (1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1): 335–346.
[77]
Henaff M, Bruna J, and LeCun Y (2015) Deep Convolutional Networks on Graph-Structured Data. arXiv preprint arXiv:1506.05163.
[78]
Huber S (2021) Persistent homology in data science. Data Science–Analytics and Applications: Proceedings of the 3rd International Data Science Conference–iDSC2020. Berlin, Germany: Springer, 81–88.
[79]
Hughes N, Chang Y, and Carlone L (2022) Hydra: a real-time spatial perception engine for 3D scene graph construction and optimization. Robotics: science and systems (RSS), New York City, 27 June 2022.
[80]
Ichien N, Liu Q, and Fu S, et al. (2021) Visual analogy: deep learning versus compositional models. Annual Meeting of the Cognitive Science Society 43.
[81]
Izatt G and Tedrake R (2020) Generative modeling of environments with scene grammars and variational inference. 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 01 June 2020.
[82]
Izatt G and Tedrake R (2021) Scene understanding and distribution modeling with mixed-integer scene parsing. Cambridge, MA: Massachusetts Institute of Technology.
[83]
Jain J, Li J, and Chiu M, et al. (2023) OneFormer: one transformer to rule universal image segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17 June 2023.
[84]
James S, Rosman B, and Konidaris G (2020) Learning portable representations for high-level planning. International Conference on Machine Learning (ICML), Vienna, Austria, 18 Jul 2020.
[85]
James S, Rosman B, and Konidaris G (2022) Autonomous learning of object-centric abstractions for high-level planning. International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 29 April 2022.
[86]
Jatavallabhula KM, Kuwajerwala A, and Gu Q, et al. (2023) ConceptFusion: Open-Set Multimodal 3D Mapping. arXiv: 2302.07241.
[87]
Jensen FV and Jensen F (1994) Optimal junction trees. Conference on Uncertainty in Artificial Intelligence (UAI), Seattle, Washington, USA, 29–31 July 1994.
[88]
Jepsen T (2009) Just what is an ontology, anyway? IT Professional 11(5): 22–27.
[89]
Jinnai Y, Abel D, and Hershkowitz D, et al. (2019) Finding options that minimize planning time. International Conference on Machine Learning (ICML), Long Beach, CA, USA, 15 June 2019.
[90]
Johnson J, Krishna R, and Stark M, et al. (2015) Image retrieval using scene graphs. IEEE Conference on Computer Vision And Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015.
[91]
Jordan M (2002) An Introduction to Probabilistic Graphical Models. Pittsburgh, PA: University of Pittsburgh.
[92]
Karpathy A and Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015.
[93]
Kim U, Park J, and Song T, et al. (2019) 3-D scene graph: a sparse and semantic representation of physical environments for intelligent agents. IEEE Transactions on Cybernetics 50(12): 1–13.
[94]
Kipf T and Welling M (2017) Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017.
[95]
Kleiner A, Baravalle R, and Kolling A, et al. (2017) A solution to room-by-room coverage for autonomous cleaning robots. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, Canada, 24–28 September 2017.
[96]
Koller D and Friedman N (2009) Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA: The MIT Press.
[97]
Kong X, Liu S, and Taher M, et al. (2023) vMAP: Vectorised Object Mapping for Neural Field SLAM. ArXiv, preprint:2302.01838.
[98]
Konidaris G (2019) On the necessity of abstraction. Current Opinion in Behavioral Sciences 29: 1–7.
[99]
Konidaris G, Kaelbling LP, and Lozano-Perez T (2018) From skills to symbols: learning symbolic representations for abstract high-level planning. Journal of Artificial Intelligence Research 61: 215–289.
[100]
Krishna R, Zhu Y, and Groth O, et al. (2016) Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arXiv preprints arXiv:1602.07332.
[101]
Kuipers B (1978) Modeling spatial knowledge. Cognitive Science 2: 129–153.
[102]
Kuipers B (2000) The spatial semantic hierarchy. Artificial Intelligence 119: 191–233.
[103]
Kwak JH, Lee J, and Whang JJ, et al. (2022) Semantic grasping via a knowledge graph of robotic manipulation: a graph representation learning approach. IEEE Robotics and Automation Letters 7(4): 9397–9404.
[104]
Lake BM, Ullman TD, and Tenenbaum JB, et al. (2017) Building machines that learn and think like people. Behavioral and Brain Sciences 40: e253.
[105]
Lau B, Sprunk C, and Burgard W (2013) Efficient grid-based spatial representations for robot navigation in dynamic environments. Robotics and Autonomous Systems 61(10): 1116–1130.
[106]
Lee J, Rossi R, and Kim S, et al. (2019) Attention models in graphs: a survey. ACM Transactions on Knowledge Discovery from Data 13(6): 1–25.
[107]
Lemaignan S, Ros R, and Mösenlechner L, et al. (2010) ORO, a knowledge management platform for cognitive architectures in robotics. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Taipei, Taiwan, 18–24 October 2010.
[108]
Lenat DB (1995) Cyc: a large-scale investment in knowledge infrastructure. Communications of the ACM 38(11): 33–38.
[109]
Li C, Xiao H, and Tateno K, et al. (2016) Incremental scene understanding on dense SLAM. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9 October 2016.
[110]
Li Y, Ouyang W, and Zhou B, et al. (2017) Scene graph generation from objects, phrases and region captions. International Conference on Computer Vision (ICCV), Venice, Italy, 29 October 2017.
[111]
Li Y, Gu C, and Dullien T, et al. (2019) Graph matching networks for learning the similarity of graph structured objects. International Conference on Machine Learning (ICML), Long Beach, CA, USA, 15 June 2019.
[112]
Lianos K, Schönberger J, and Pollefeys M, et al. (2018) Vso: visual semantic odometry. European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018.
[113]
Lin S, Wang J, and Xu M, et al. (2021) Topology aware object-level semantic mapping towards more robust loop closure. IEEE Robotics and Automation Letters 6(4): 7041–7048.
[114]
Liu C, Wu J, and Furukawa Y (2018) FloorNet: a unified framework for floorplan reconstruction from 3d scans. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018.
[115]
Liu Y, Petillot Y, and Lane D, et al. (2019) Global localization with object-level semantics and topology. 2019 International Conference on Robotics and Automation (ICRA), Montreal, Quebec, Canada, 24 May 2019.
[116]
Looper S, Rodriguez-Puigvert J, and Siegwart R, et al. (2023) 3D VSG: long-term semantic scene change prediction through 3D variable scene graphs. 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2 June 2023.
[117]
Lowry S, Sünderhauf N, and Newman P, et al. (2016) Visual place recognition: a survey. IEEE Transactions on Robotics 32(1): 1–19.
[118]
Lu C, Krishna R, and Bernstein M, et al. (2016) Visual relationship detection with language priors. European Conference on Computer Vision, Amsterdam, The Netherlands, 16 September 2016.
[119]
Lukierski R, Leutenegger S, and Davison AJ (2017) Room layout estimation from rapid omnidirectional exploration. IEEE International Conference on Robotics and Automation (ICRA), Singapore, 3 June 2017.
[120]
Maniu S, Senellart P, and Jog S (2019) An experimental study of the treewidth of real-world graph data. International Conference Database Theory, Edinburgh, UK, 26–29 March 2019.
[121]
Marino K, Chen X, and Parikh D, et al. (2021) KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Nashville, USA, 19–25 June 2021.
[122]
McCormac J, Handa A, and Davison AJ, et al. (2017) SemanticFusion: dense 3D semantic mapping with convolutional neural networks. IEEE International Conference on Robotics and Automation (ICRA), Singapore, 3 June 2017.
[123]
McCormac J, Clark R, and Bloesch M, et al. (2018) Fusion++: volumetric object-level SLAM. International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018.
[124]
McGuinness D and Van Harmelen F (2004) OWL Web Ontology Language Overview. Cambridge, MA: W3C recommendation.
[125]
Mhaskar HN (1996) Neural networks for optimal approximation of smooth and analytic functions. Neural Computation 8(1): 164–177.
[126]
Mhaskar HN and Poggio T (2016) Deep vs. shallow networks: an approximation theory perspective. Analysis and Applications 14(06): 829–848.
[127]
Mikolov T, Chen K, and Corrado G, et al. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv preprints arXiv:1301.3781.
[128]
Milford M and Wyeth G (2012) Seqslam: visual route-based navigation for sunny summer days and stormy winter nights. IEEE International Conference on Robotics and Automation (ICRA), St Paul, Minnesota, USA, 14–18 May 2012.
[129]
Miller GA (1995) Wordnet: a lexical database for English. Communications of the ACM 38(11): 39–41.
[130]
Mo K, Guerrero P, and Yi L, et al. (2020) StructEdit: learning structural shape variations. IEEe Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020.
[131]
Movshovitz-Attias Y, Yu Q, and Stumpe MC, et al. (2015) Ontological supervision for fine grained classification of street view storefronts. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7 June 2015.
[132]
Narita G, Seno T, and Ishikawa T, et al. (2019) Panopticfusion: online volumetric semantic mapping at the level of stuff and things. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), The Venetian Macau, Macau, China, 4–8 September 2019.
[133]
Nicholson L, Milford M, and Sünderhauf N (2018) QuadricSLAM: dual quadrics from object detections as landmarks in object-oriented SLAM. IEEE Robotics and Automation Letters 4: 1–8.
[134]
Niemeyer M and Geiger A (2021) GIRAFFE: representing scenes as compositional generative neural feature fields. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021.
[135]
Nießner M, Zollhöfer M, and Izadi S, et al. (2013) Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics 32(6): 169.
[136]
Niles I and Pease A (2001) Towards a standard upper ontology. Proceedings of the International Conference on Formal Ontology in Information Systems, Ogunquit, Maine, USA, 17–19 October 2001.
[137]
Ok K, Liu K, and Roy N (2021) Hierarchical object map estimation for efficient and robust navigation. 2021 IEEE International Conference on Robotics and Automation (ICRA), Xian, China, 5 June 2021.
[138]
Oleynikova H, Taylor Z, and Fehr M, et al. (2017) Voxblox: incremental 3d euclidean signed distance fields for on-board mav planning. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, Canada, 24–28 September 2017.
[139]
Oleynikova H, Taylor Z, and Siegwart R, et al. (2018) Sparse 3D topological graphs for micro-aerial vehicle planning. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018.
[140]
Park J, Florence P, and Straub J, et al. (2019) DeepSDF: learning continuous signed distance functions for shape representation. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
[141]
Paszke A, Gross S, and Massa F, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32.
[142]
Poggio T, Mhaskar H, and Rosasco L, et al. (2017) Why and when can deep - but not shallow - networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing 14: 503–519.
[143]
Porello D, Cristani M, and Ferrario R (2015) Integrating ontologies and computer vision for classification of objects in images. Workshop on Neural Cognitive Integration 15.
[144]
Qi S, Zhu Y, and Huang S, et al. (2018) Human-centric indoor scene synthesis using stochastic grammar. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018.
[145]
Qin C, Zhang Y, and Liu Y, et al. (2021) Semantic loop closure detection based on graph matching in multi-objects scenes. Journal of Visual Communication and Image Representation 76: 103072.
[146]
Rana K, Haviland J, and Garg S, et al. (2023) SayPlan: grounding large language models using 3d scene graphs for scalable task planning. 7th Annual Conference on Robot Learning, Atlanta, USA, 18 January 2023.
[147]
Ravichandran Z, Peng L, and Hughes N, et al. (2022) Hierarchical representations and explicit memory: learning effective navigation policies on 3D scene graphs using graph neural networks. IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022.
[148]
Reijgwart V, Millane A, and Oleynikova H, et al. (2020) Voxgraph: globally consistent, volumetric mapping using signed distance function submaps. IEEE Robotics and Automation Letters 5: 227–234.
[149]
Reinke A, Palieri M, and Morrell B, et al. (2022) Locus 2.0: robust and computationally efficient lidar odometry for real-time 3D mapping. IEEE Robotics and Automation Letters 7(4): 9043–9050.
[150]
Ren M, Kiros R, and Zemel RS (2015) Image Question Answering: A Visual Semantic Embedding Model and a New Dataset. arXiv preprints arXiv:1505.02074.
[151]
Rosinol A, Abate M, and Chang Y, et al. (2020a) Kimera: an open-source library for real-time metric-semantic localization and mapping. IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 August 2020.
[152]
Rosinol A, Gupta A, and Abate M, et al. (2020b) 3D dynamic scene graphs: actionable spatial perception with places, objects, and humans. Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 12–16 July 2020. https://news.mit.edu/2020/robots-spatial-perception-0715
[153]
Rosinol A, Violette A, and Abate M, et al. (2021) Kimera: from SLAM to spatial perception with 3D dynamic scene graphs. The International Journal of Robotics Research 40(12–14): 1510–1546.
[154]
Rosinol A, Leonard J, and Carlone L (2023) NeRF-SLAM: real-time dense monocular SLAM with neural radiance fields. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, Michigan, USA, 1–5 October 2023.
[155]
Rosu R, Quenzel J, and Behnke S (2019) Semi-supervised semantic mapping through label propagation with semantic texture meshes. International Journal of Computer Vision 128: 1220–1238.
[156]
Ruiz-Sarmiento JR, Galindo C, and Gonzalez-Jimenez J (2017) Building multiversal semantic maps for mobile robot operation. Knowledge-Based Systems 119: 257–272.
[157]
Rusu RB (2009) Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments. PhD Thesis. Muenchen, Germany: Technische Universitaet.
[158]
Salas-Moreno RF, Newcombe RA, and Strasdat H, et al. (2013) SLAM++: simultaneous localisation and mapping at the level of objects. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013.
[159]
Sandler M, Howard A, and Zhu M, et al. (2018) Mobilenetv2: inverted residuals and linear bottlenecks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018.
[160]
Savva M, Kadian A, and Maksymets O, et al. (2019) Habitat: a platform for embodied AI research. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October 2019.
[161]
Schlenoff C, Prestes E, and Madhavan R, et al. (2012) An IEEE standard ontology for robotics and automation. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura, Algarve, 7–12 October 2012.
[162]
Schmid L, Delmerico J, and Schönberger J, et al. (2021) Panoptic Multi-Tsdfs: A Flexible Representation for Online Multi-Resolution Volumetric Mapping and Long-Term Dynamic Scene Consistency. arXiv preprint arXiv:2109.10165.
[163]
Schroff F, Kalenichenko D, and Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015.
[164]
Schubert S, Neubert P, and Protzel P (2021) Fast and memory efficient graph optimization via ICM for visual place recognition. Proceeding of Robotics: Science and Systems (RSS), New York City, NY, USA, 12–16 July 2021.
[165]
Shan M, Feng Q, and Atanasov N (2020) Object residual constrained visual-inertial odometry. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, Nevada, USA, 25–29 October 2020.
[166]
Shi J, Talak R, and Maggio D, et al. (2023) A correct-and-certify approach to self-supervise object pose estimators via ensemble self-training. Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 14 July 2023.
[167]
Smith B (2004) Beyond Concepts: Ontology as Reality Representation. buffalo, NY: University at Buffalo.
[168]
Speer R, Chin J, and Havasi C (2017) ConceptNet 5.5: an open multilingual graph of general knowledge. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17. Washington, DC: AAAI Press, 4444–4451.
[169]
Stekovic S, Rad M, and Fraundorfer F, et al. (2021) MonteFloor: Extending MCTS for Reconstructing Accurate Large-Scale Floor Plans. Paris, France: Universite Paris-Est.
[170]
Stückler J and Behnke S (2014) Multi-resolution surfel maps for efficient dense 3d modeling and tracking. Journal of Visual Communication and Image Representation 25(1): 137–147.
[171]
Studer R, Benjamins V, and Fensel D (1998) Knowledge engineering: principles and methods. Data & Knowledge Engineering 25(1-2): 161–197.
[172]
Sucar E, Wada K, and Davison A (2020) NodeSLAM: neural object descriptors for multi-view shape reconstruction. 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan, 25–28 November 2020.
[173]
Suchanek FM, Kasneci G, and Weikum G (2008) YAGO: a large ontology from wikipedia and WordNet. Journal of Web Semantics 6(3): 203–217.
[174]
Sumner R, Schmid J, and Pauly M (2007) Embedded deformation for shape manipulation. ACM Transactions on Graphics 26(3): 80.
[175]
Tagliasacchi A, Delame T, and Spagnuolo M, et al. (2016) 3d skeletons: a state-of-the-art report. Computer Graphics Forum 35: 573–597.
[176]
Talak R, Hu S, and Peng L, et al. (2021) Neural trees for learning on graphs. Conference on Neural Information Processing Systems (NeurIPS), Canada, 6–14 December 2021.
[177]
Tateno K, Tombari F, and Navab N (2015) Real-time and scalable incremental segmentation on dense SLAM. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September 2015.
[178]
Tenorth M and Beetz M (2013) Knowrob: a knowledge processing infrastructure for cognition-enabled robots. The International Journal of Robotics Research 32(5): 566–590.
[179]
Thomas A and Green PJ (2009) Enumerating the junction trees of a decomposable graph. Journal of Computational & Graphical Statistics: A Joint Publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America 18(4): 930–940.
[180]
Thrun S (2003) Robotic mapping: a survey. Exploring Artificial Intelligence in the New Millennium. Burlington, MA: Morgan Kaufmann, Inc., 1–35.
[181]
Tuli S, Bansal R, and Paul R, et al. (2022) ToolTango: common sense generalization in predicting sequential tool interactions for robot plan synthesis. Journal of Artificial Intelligence Research 75: 1595–1631.
[182]
Veličković P, Cucurull G, and Casanova A, et al. (2018) Graph attention networks. International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 3 May 2018.
[183]
Vrandečić D and Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Communications of the ACM 57(10): 78–85.
[184]
Wald J, Dhamo H, and Navab N, et al. (2020) Learning 3D semantic scene graphs from 3D indoor reconstructions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020.
[185]
Wang J, Sun K, and Cheng T, et al. (2021) Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(10): 3349–3364.
[186]
Wang W, Zhou T, and Qi S, et al. (2022) Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(7): 3508–3522.
[187]
Webb T, Holyoak KJ, and Lu H (2022) Emergent analogical reasoning in large language models. Nature Human Behaviour 7: 1526.
[188]
Whelan T, McDonald JB, and Kaess M, et al. (2012) Kintinuous: spatially extended kinect-fusion. RSS Workshop on RGB-D: advanced reasoning with depth cameras, Sydney, Australia, 12 July 2012.
[189]
Whelan T, Kaess M, and Johannsson H, et al. (2015) Real-time large-scale dense RGB-D SLAM with volumetric fusion. The International Journal of Robotics Research 34(4–5): 598–626.
[190]
Whelan T, Salas-Moreno R, and Glocker B, et al. (2016) ElasticFusion: real-time dense SLAM and light source estimation. The International Journal of Robotics Research 35(14).
[191]
Wu S, Wald J, and Tateno K, et al. (2021) SceneGraphFusion: incremental 3D scene graph prediction from RGB-D sequences. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021.
[192]
Xie S, Morcos AS, and Zhu SC, et al. (2022) COAT: measuring object compositionality in emergent representations. International Conference on Machine Learning (ICML), Baltimore, MA, 17–23 July 2022.
[193]
Xu D, Zhu Y, and Choy CB, et al. (2017) Scene graph generation by iterative message passing. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
[194]
Xu B, Li W, and Tzoumanikas D, et al. (2019a) MID-fusion: Octree-Based Object-Level Multi-Instance Dynamic SLAM. arXiv:1812.07976v4.
[195]
Xu K, Hu W, and Leskovec J, et al. (2019b) How powerful are graph neural networks? International Conference on Learning Representations (ICLR), New Orleans, LO, USA, 6–9 May 2019.
[196]
Yang J, Lu J, and Lee S, et al. (2018) Graph R-CNN for scene graph generation. European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018.
[197]
Yang H, Shi J, and Carlone L (2020) TEASER: fast and certifiable point cloud registration. IEEE Transactions on Robotics 37(2): 314–333.
[198]
Yuan J, Chen T, and Li B, et al. (2023) Compositional Scene Representation Learning via Reconstruction: A Survey. arXiv:2202.07135v4.
[199]
Zareian A, Karaman S, and Chang SF (2020) Bridging knowledge graphs to generate scene graphs. European Confernce On Computer Vision (ECCV). Berlin, Germany: Springer, 606–623.
[200]
Zellers R, Yatskar M, and Thomson S, et al. (2017) Neural motifs: scene graph parsing with global context. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
[201]
Zender H, Martínez Mozos O, and Jensfelt P, et al. (2008) Conceptual spatial representations for indoor mobile robots. Robotics and Autonomous Systems 56(6): 493–502.
[202]
Zeng M, Zhao F, and Zheng J, et al. (2013) Octree-based fusion for realtime 3d reconstruction. Graphical Models 75(3): 126–136.
[203]
Zheng T, Zhang G, and Han L, et al. (2020) Building fusion: semantic-aware structural building-scale 3d reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(5): 2328–2345.
[204]
Zheng W, Yin L, and Chen X, et al. (2021) Knowledge base graph embedding module design for visual question answering model. Pattern Recognition 120: 108153.
[205]
Zhou B, Zhao H, and Puig X, et al. (2017) Scene parsing through ade20k dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017.
[206]
Zhu SC and Huang S (2021) Computer Vision: Stochastic Grammars for Parsing Objects, Scenes, and Events. Berlin, Germany: Springer.
[207]
Zhu SC and Mumford D (2006) A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision 2(4): 259–362.
[208]
Zhu LL, Chen Y, and Yuille A (2011) Recursive compositional models for vision: description and review of recent work. Journal of Mathematical Imaging and Vision 41(1): 122–146.
[209]
Zhu G, Zhang L, and Jiang Y, et al. (2022) Scene Graph Generation: A Comprehensive Survey. arXiv preprint arXiv:2201.00443.

Index Terms

  1. Foundations of spatial perception for robotics: Hierarchical representations and real-time systems
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image International Journal of Robotics Research
            International Journal of Robotics Research  Volume 43, Issue 10
            Sep 2024
            204 pages
            This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).

            Publisher

            Sage Publications, Inc.

            United States

            Publication History

            Published: 16 October 2024

            Author Tags

            1. Simultaneous localization and mapping (SLAM)
            2. spatial perception
            3. 3D scene graphs
            4. graph learning
            5. computer vision

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 0
              Total Downloads
            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 25 Dec 2024

            Other Metrics

            Citations

            View Options

            View options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media