[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Unlocking Robotic Autonomy: A Survey on the Applications of Foundation Models

  • Regular Papers
  • Invited Paper
  • Published:
International Journal of Control, Automation and Systems Aims and scope Submit manuscript

Abstract

The advancement of foundation models, such as large language models (LLMs), vision-language models (VLMs), diffusion models, and robotics foundation models (RFMs), has become a new paradigm in robotics by offering innovative approaches to the long-standing challenge of building robot autonomy. These models enable the development of robotic agents that can independently understand and reason about semantic contexts, plan actions, physically interact with surroundings, and adapt to new environments and untrained tasks. This paper presents a comprehensive and systematic survey of recent advancements in applying foundation models to robot perception, planning, and control. It introduces the key concepts and terminology associated with foundation models, providing a clear understanding for researchers in robotics and control engineering. The relevant studies are categorized based on how foundation models are utilized in various elements of robotic autonomy, focusing on 1) perception and situational awareness: object detection and classification, semantic understanding, mapping, and navigation; 2) decision making and task planning: mission understanding, task decomposition and coordination, planning with symbolic and learning-based approaches, plan validation and correction, and LLM-robot interaction; 3) motion planning and control: motion planning, control command and reward generation, and trajectory generation and optimization with diffusion models. Furthermore, the survey covers essential environmental setups, including real-world and simulation datasets and platforms used in training and validating these models. It concludes with a discussion on current challenges such as robustness, explainability, data scarcity, and real-time performance, and highlights promising future directions, including retrieval augmented generation, on-device foundation models, and explainability. This survey aims to systematically summarize the latest research trends in applying foundation models to robotics, bridging the gap between the state-of-the-art in artificial intelligence and robotics. By sharing knowledge and resources, this survey is expected to foster the introduction of a new research paradigm for building generalized and autonomous robots.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. J. H. Lee, “Model predictive control: Review of the three decades of development,” International Journal of Control, Automation, and Systems, vol. 9, pp. 415–424, 2011.

    Article  Google Scholar 

  2. C. Jing, H. Shu, and Y. Song, “Model predictive control for integrated lateral stability and rollover prevention based on a multi-actuator control system,” International Journal of Control, Automation, and Systems, vol. 21, no. 5, pp. 1518–1537, 2023.

    Article  Google Scholar 

  3. Y. Zhang, S. Li, and L. Liao, “Near-optimal control of nonlinear dynamical systems: A brief survey,” Annual Reviews in Control, vol. 47, pp. 71–80, 2019.

    Article  MathSciNet  Google Scholar 

  4. K. Prag, M. Woolway, and T. Celik, “Toward data-driven optimal control: A systematic review of the landscape,” IEEE Access, vol. 10, pp. 32190–32212, 2022.

    Article  Google Scholar 

  5. Y.-Q. Jiang, S.-Q. Zhang, P. Khandelwal, and P. Stone, “Task planning in robotics: An empirical comparison of PDDL-and ASP-based systems,” Frontiers of Information Technology & Electronic Engineering, vol. 20, pp. 363–373, 2019.

    Article  Google Scholar 

  6. L. G. D. Véras, F. L. Medeiros, and L. N. Guimaráes, “Systematic literature review of sampling process in rapidly-exploring random trees,” IEEE Access, vol. 7, pp. 50933–50953, 2019.

    Article  Google Scholar 

  7. S. Lim and S. Jin, “Safe trajectory path planning algorithm based on RRT* while maintaining moderate margin from obstacles,” International Journal of Control, Automation, and Systems, vol. 21, no. 11, pp. 3540–3550, 2023.

    Article  Google Scholar 

  8. S. Aradi, “Survey of deep reinforcement learning for motion planning of autonomous vehicles,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 2, pp. 740–759, 2020.

    Article  Google Scholar 

  9. B. Singh, R. Kumar, and V. P. Singh, “Reinforcement learning in robotic applications: A comprehensive survey,” Artificial Intelligence Review, vol. 55, no. 2, pp. 945–990, 2022.

    Article  Google Scholar 

  10. X. Xiao, B. Liu, G. Warnell, and P. Stone, “Motion planning and control for mobile robot navigation using machine learning: A survey,” Autonomous Robots, vol. 46, no. 5, pp. 569–597, 2022.

    Article  Google Scholar 

  11. L. Le Mero, D. Yi, M. Dianati, and A. Mouzakitis, “A survey on imitation learning techniques for end-to-end autonomous vehicles,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 14128–14147, 2022.

    Article  Google Scholar 

  12. S. Choi, S. Kim, and H. Jin Kim, “Inverse reinforcement learning control for trajectory tracking of a multirotor UAV,” International Journal of Control, Automation, and Systems, vol. 15, pp. 1826–1834, 2017.

    Article  Google Scholar 

  13. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.

    Google Scholar 

  14. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.

  15. R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “LaMDA: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.

  16. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “PaLM 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.

  17. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al., “RT-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023.

  18. M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, and A. Zeng, “Do as I can, not as I say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.

  19. A. Prasad, A. Koller, M. Hartmann, P. Clark, A. Sabharwal, M. Bansal, and T. Khot, “ADaPT: As-needed decomposition and planning with language models,” arXiv preprint arXiv:2311.05772, 2023.

  20. I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Prog-Prompt: Generating situated robot task plans using large language models,” Proc. of 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 11523–11530, 2023.

  21. S. S. Kannan, V. L. N. Venkatesh, and B.-C. Min, “SMART-LLM: Smart multi-agent robot task planning using large language models,” arXiv preprint arXiv:2309.10062, 2023.

  22. Z. Liu, W. Yao, J. Zhang, L. Yang, Z. Liu, J. Tan, P. K. Choubey, T. Lan, J. Wu, H. Wang et al., “AgentLite: A lightweight library for building and advancing task-oriented LLM agent system,” arXiv preprint arXiv:2402.15538, 2024.

  23. C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence, “Interactive language: Talking to robots in real time,” IEEE Robotics and Automation Letters, 2023.

  24. C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, “LLM-Planner: Few-shot grounded planning for embodied agents with large language models,” Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3009, 2023.

  25. G. Dagan, F. Keller, and A. Lascarides, “Dynamic planning with a LLM,” arXiv preprint arXiv:2308.06391, 2023.

  26. K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “SayPlan: Grounding large language models using 3d scene graphs for scalable robot task planning,” Proc. of Conference on Robot Learning (CoRL), 2023.

  27. Y. Ding, X. Zhang, S. Amiri, N. Cao, H. Yang, A. Kaminski, C. Esselink, and S. Zhang, “Integrating action knowledge and LLMs for task planning and situation handling in open worlds,” Autonomous Robots, 2023.

  28. E. Zelikman, Q. Huang, G. Poesia, N. Goodman, and N. Haber, “Parsel: Algorithmic reasoning with language models by composing decompositions,” Advances in Neural Information Processing Systems, vol. 36, pp. 31466–31523, 2023.

    Google Scholar 

  29. Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents,” arXiv preprint arXiv:2302.01560, 2023.

  30. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” Proc. of International Conference on Machine Learning, PMLR, pp. 8748–8763, 2021.

  31. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” Proc. of International Conference on Machine Learning, PMLR, pp. 4904–4916, 2021.

  32. J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” Proc. of International Conference on Machine Learning, PMLR, pp. 12888–12900, 2022.

  33. Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan, “VIMA: General robot manipulation with multimodal prompts,” NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.

  34. R. Shah, R. Martín-Martín, and Y. Zhu, “MUTEX: Learning unified policies from multimodal task specifications,” Proc. of 7th Annual Conference on Robot Learning, 2023.

  35. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Audio visual language maps for robot navigation,” arXiv preprint arXiv:2303.07522, 2023.

  36. K. M. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, A. Maalouf, S. Li, G. Iyer, S. Saryazdi, N. Keetha et al., “ConceptFusion: Open-set multimodal 3d mapping,” arXiv preprint arXiv:2302.07241, 2023.

  37. D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “PaLM-E: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.

  38. L. Xue, N. Yu, S. Zhang, A. Panagopoulou, J. Li, R. Martin-Martin, J/ Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip-2: Towards scalable multimodal pre-training for 3d understanding”, arXiv preprint arXiv:2305.08275, 2023.

  39. P. Sermanet, T. Ding, J. Zhao, F. Xia, D. Dwibedi, K. Gopalakrishnan, C. Chan, G. Dulac-Arnold, S. Maddineni, N. J. Joshi et al., “RoboVQA: Multimodal long-horizon reasoning for robotics,” arXiv preprint arXiv:2311.00899, 2023.

  40. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “RT-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.

  41. K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y. Zhou, A. Gupta, A. Raju et al., “RoboCat: A self-improving foundation agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023.

  42. Y. Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F. Yin, J. Zhao et al., “Exploring large language model based intelligent agents: Definitions, methods, and prospects,” arXiv preprint arXiv:2401.03428, 2024.

  43. X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y. Wang, R. Tang, and E. Chen, “Understanding the planning of LLM agents: A survey,” arXiv preprint arXiv:2402.02716, 2024.

  44. H. Li, J. Leung, and Z. Shen, “Towards goal-oriented large language model prompting: A survey,” arXiv preprint arXiv:2401.14043, 2024.

  45. K. Yang, J. Liu, J. Wu, C. Yang, Y. R. Fung, S. Li, Z. Huang, X. Cao, X. Wang, Y. Wang et al., “If LLM is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents,” arXiv preprint arXiv:2401.00812, 2024.

  46. K. Kawaharazuka, T. Matsushima, A. Gambardella, J. Guo, C. Paxton, and A. Zeng, “Real-world robot applications of foundation models: A review,” arXiv preprint arXiv:2402.05741, 2024.

  47. R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman et al., “Foundation models in robotics: Applications, challenges, and the future,” arXiv preprint arXiv:2312.07843, 2023.

  48. J. Wang, Z. Wu, Y. Li, H. Jiang, P. Shu, E. Shi, H. Hu, C. Ma, Y. Liu, X. Wang et al., “Large language models for robotics: Opportunities, challenges, and perspectives,” arXiv preprint arXiv:2401.04334, 2024.

  49. Y. Hu, Q. Xie, V. Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y. Xie, T. Zhang, Z. Zhao et al., “Toward generalpurpose robots via foundation models: A survey and meta-analysis,” arXiv preprint arXiv:2312.08782, 2023.

  50. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

  51. J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014.

  52. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837, 2022.

    Google Scholar 

  53. S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of Thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.

  54. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.

  55. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

  56. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.

  57. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022.

    Google Scholar 

  58. Y. Inoue and H. Ohashi, “Prompter: Utilizing large language model prompting for a data efficient embodied instruction following,” arXiv preprint arXiv:2211.03267, 2022.

  59. Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Task and motion planning with large language models for object rearrangement,” Proc. of 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 2086–2092, 2023.

  60. V. S. Dorbala, J. F. Mullen Jr, and D. Manocha, “Can an embodied agent find your “cat-shaped mug”? LLM-based zero-shot object navigation,” IEEE Robotics and Automation Letters, 2023.

  61. Z. Wang, S. Cai, A. Liu, Y. Jin, J. Hou, B. Zhang, H. Lin, Z. He, Z. Zheng, Y. Yang et al., “JARVIS-1: Open-world multi-task agents with memory-augmented multimodal language models,” arXiv preprint arXiv:2311.05997, 2023.

  62. K. Zhou, K. Zheng, C. Pryor, Y. Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with soft commonsense constraints for zero-shot object navigation,” Proc. of International Conference on Machine Learning, PMLR, 2023, pp. 42829–42842.

  63. X. Sun, H. Meng, S. Chakraborty, A. S. Bedi, and A. Bera, “Beyond Text: Utilizing vocal cues to improve decision making in llms for robot navigation tasks,” arXiv preprint arXiv:2402.03494, 2024.

  64. T. Birr, C. Pohl, A. Younes, and T. Asfour, “AutoGPT+P: Affordance-based task planning with large language models,” arXiv preprint arXiv:2402.10778, 2024.

  65. H. H. Zhuo, X. Chen, and R. Pan, “On the roles of llms in planning: Embedding llms into planning graphs,” arXiv preprint arXiv:2403.00783, 2024.

  66. J. Yang, Y. Dong, S. Liu, B. Li, Z. Wang, C. Jiang, H. Tan, J. Kang, Y. Zhang, K. Zhou et al., “Octopus: Embodied vision-language programmer from environmental feedback,” arXiv preprint arXiv:2310.08588, 2023.

  67. Y. Chen, J. Arkin, Y. Zhang, N. Roy, and C. Fan, “Scalable multi-robot collaboration with large language models: Centralized or decentralized systems?” arXiv preprint arXiv:2309.15943, 2023.

  68. Y.-J. Wang, B. Zhang, J. Chen, and K. Sreenath, “Prompt a robot to walk with large language models,” arXiv preprint arXiv:2309.09969, 2023.

  69. W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L. Chiang, T. Erez, L. Hasenclever, J. Humplik et al., “Language to rewards for robotic skill synthesis,” arXiv preprint arXiv:2306.08647, 2023.

  70. Y. Shukla, W. Gao, V. Sarathy, A. Velasquez, R. Wright, and J. Sinapov, “LgTS: Dynamic task sampling using LLM-generated sub-goals for reinforcement learning agents,” arXiv preprint arXiv:2310.09454, 2023.

  71. Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” arXiv preprint arXiv:2310.12931, 2023.

  72. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.

  73. S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.

  74. M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, X. Wang, X. Zhai, T. Kipf, and N. Houlsby, “Simple open-vocabulary object detection,” Proc. of European Conference on Computer Vision, pp. 726–755, 2022.

  75. X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” arXiv preprint arXiv:2104.13921, 2021.

  76. X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting twenty-thousand classes using imagelevel supervision,” Proc. of European Conference on Computer Vision, Springer, pp. 350–368, 2022.

  77. L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975, 2022.

  78. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” Proc. of International Conference on Machine Learning, PMLR, pp. 2256–2265, 2015.

  79. Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Advances in Neural Information Processing Systems, vol. 32, 2019.

  80. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.

    Google Scholar 

  81. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020.

  82. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021.

    Google Scholar 

  83. M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” arXiv preprint arXiv:2205.09991, 2022.

  84. J. Carvalho, A. T. Le, M. Baierl, D. Koert, and J. Peters, “Motion planning diffusion: Learning and planning of robot motions with diffusion models,” Proc. of 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 1916–1923, 2023.

  85. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “PaLM: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.

    Google Scholar 

  86. S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg et al., “A generalist agent,” arXiv preprint arXiv:2205.06175, 2022.

  87. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.

  88. (2023) GPT-4 model documentation. [Online]. Available: https://platform.openai.com/docs/models/gpt-4

  89. M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024.

  90. S. Ma, S. Vemprala, W. Wang, J. K. Gupta, Y. Song, D. McDufft, and A. Kapoor, “Compass: Contrastive multimodal pretraining for autonomous systems,” Proc. of 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 1000–1007, 2022.

  91. Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa et al., “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” arXiv preprint arXiv:2309.16650, 2023.

  92. N. H. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision-language frontier maps for zero-shot semantic navigation,” Proc. of 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.

  93. J. Yang, W. Tan, C. Jin, B. Liu, J. Fu, R. Song, and L. Wang, “Pave the way to grasp anything: Transferring foundation models for universal pick-place robots,” arXiv preprint arXiv:2306.05716, 2023.

  94. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026, 2023.

  95. F. Liu, K. Fang, P. Abbeel, and S. Levine, “MOKA: Openvocabulary robotic manipulation through mark-based visual prompting,” arXiv preprint arXiv:2403.03174, 2024.

  96. T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan et al., “Grounded sam: Assembling open-world models for diverse visual tasks,” arXiv preprint arXiv:2401.14159, 2024.

  97. P. Liu, Y. Orru, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “OK-robot: What really matters in integrating open-knowledge models for robotics,” arXiv preprint arXiv:2401.12202, 2024.

  98. S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” Proc. of Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23171–23181, 2023.

  99. A. Stone, T. Xiao, Y. Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, S. Kirmani, B. Zitkovich, F. Xia et al., “Open-world object manipulation using pre-trained vision-language models,” arXiv preprint arXiv:2303.00905, 2023.

  100. W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer: Composable 3d value maps for robotic manipulation with language models,” arXiv preprint arXiv:2307.05973, 2023.

  101. W. Huang, F. Xia, D. Shah, D. Driess, A. Zeng, Y. Lu, P. Florence, I. Mordatch, S. Levine, K. Hausman et al., “Grounded decoding: Guiding text generation with grounded models for robot control,” arXiv preprint arXiv:2303.00855, 2023.

  102. J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh, “Physically grounded vision-language models for robotic manipulation,” arXiv preprint arXiv:2309.02561, 2023.

  103. N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam, “Clip-fields: Weakly supervised semantic fields for robotic memory,” arXiv preprint arXiv:2210.05663, 2022.

  104. S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet, T.-Y. Yang, V. Jain, A. W. Clegg, J. Turner et al., “Homerobot: Open-vocabulary mobile manipulation,” arXiv preprint arXiv:2306.11565, 2023.

  105. B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary queryable scene representations for real world planning,” Proc. of 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 11509–11522, 2023.

  106. T. Yoneda, J. Fang, P. Li, H. Zhang, T. Jiang, S. Lin, B. Picker, D. Yunis, H. Mei, and M. R. Walter, “Statler: State-maintaining language models for embodied reasoning,” arXiv preprint arXiv:2306.17840, 2023.

  107. A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790, 2021.

  108. B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven semantic segmentation,” arXiv preprint arXiv:2201.03546, 2022.

  109. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” Proc. of 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 10608–10615, 2023.

  110. J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser, “Tidybot: Personalized robot assistance with large language models,” Autonomous Robots, vol. 47, no. 8, pp. 1087–1102, 2023.

    Article  Google Scholar 

  111. A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: Clip embeddings for embodied AI,” Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14829–14838, 2022.

  112. R. Abdelfattah, Q. Guo, X. Li, X. Wang, and S. Wang, “Cdul: Clip-driven unsupervised learning for multi-label image classification,” Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 1348–1357, 2023.

  113. N. Kanazawa, K. Kawaharazuka, Y. Obinata, K. Okada, and M. Inaba, “Recognition of heat-induced food state changes by time-series use of vision-language model for cooking robot,” arXiv preprint arXiv:2309.01528, 2023.

  114. R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8552–8562, 2022.

  115. L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1179–1189, 2023.

  116. J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfusion: Dense 3d semantic mapping with convolutional neural networks,” Proc. of 2017 IEEE International Conference on Robotics and automation (ICRA), IEEE, pp. 4628–4635, 2017.

  117. C. Liu, K. Wang, J. Shi, Z. Qiao, and S. Shen, “FM-Fusion: Instance-aware semantic mapping boosted by vision-language foundation models,” IEEE Robotics and Automation Letters, 2024.

  118. S. Taguchi and H. Deguchi, “Online embedding multi-scale CLIP features into 3D maps,” arXiv preprint arXiv:2403.18178, 2024.

  119. K. Yamazaki, T. Hanyu, K. Vo, T. Pham, M. Tran, G. Doretto, A. Nguyen, and N. Le, “Open-Fusion: Real-time open-vocabulary 3D mapping and queryable scene representation,” arXiv preprint arXiv:2310.03923, 2023.

  120. N. Keetha, A. Mishra, J. Karhade, K. M. Jatavallabhula, S. Scherer, M. Krishna, and S. Garg, “AnyLoc: Towards universal visual place recognition,” IEEE Robotics and Automation Letters, 2023.

  121. B. Yu, H. Kasaei, and M. Cao, “L3MVN: Leveraging large language models for visual target navigation,” Proc. of 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 3554–3560, 2023.

  122. A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, and A. Velasquez, “SayNav: Grounding large language models for dynamic planning to navigation in new environments,” arXiv preprint arXiv:2309.04077, 2023.

  123. Y. Qiao, Y. Qi, Z. Yu, J. Liu, and Q. Wu, “March in Chat: Interactive prompting for remote embodied referring expression,” Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 15758–15767, 2023.

  124. G. Zhou, Y. Hong, and Q. Wu, “NavGPT: Explicit reasoning in vision-and-language navigation with large language models,” Proc. of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, pp. 7641–7649, 2024.

    Article  Google Scholar 

  125. D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “ViNT: A foundation model for visual navigation,” arXiv preprint arXiv:2306.14846, 2023.

  126. A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “ZSON: Zero-shot object-goal navigation using multimodal goal embeddings,” Advances in Neural Information Processing Systems, vol. 35, pp. 32340–32352, 2022.

    Google Scholar 

  127. K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “OVRLV2: A simple state-of-art baseline for ImageNav and ObjectNav,” arXiv preprint arXiv:2303.07798, 2023.

  128. Y. Kuang, H. Lin, and M. Jiang, “OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,” arXiv preprint arXiv:2402.10670, 2024.

  129. J. Chen, G. Li, S. Kumar, B. Ghanem, and F. Yu, “How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers,” arXiv preprint arXiv:2305.16925, 2023.

  130. P. Wu, Y. Mu, B. Wu, Y. Hou, J. Ma, S. Zhang, and C. Liu, “VoroNav: Voronoi-based zero-shot object navigation with large language model,” arXiv preprint arXiv:2401.02695, 2024.

  131. L. Zhang, Q. Zhang, H. Wang, E. Xiao, Z. Jiang, H. Chen, and R. Xu, “TriHelper: Zero-shot object navigation with dynamic assistance,” arXiv preprint arXiv:2403.15223, 2024.

  132. Q. Xie, T. Zhang, K. Xu, M. Johnson-Roberson, and Y. Bisk, “Reasoning about the unseen for efficient outdoor object navigation,” arXiv preprint arXiv:2309.10103, 2023.

  133. R. Schumann, W. Zhu, W. Feng, T.-J. Fu, S. Riezler, and W. Y. Wang, “VELMA: Verbalization embodiment of LLM agents for vision and language navigation in street view,” Proc. of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, pp. 18924–18933, 2024.

    Article  Google Scholar 

  134. S. Zheng, Y. Feng, Z. Lu et al., “Steve-Eye: Equipping LLM-based embodied agents with visual perception in open worlds,” Proc. of The Twelfth International Conference on Learning Representations, 2023.

  135. M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman et al., “Do as I can, not as I say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.

  136. A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan et al., “Open X-Embodiment: Robotic learning datasets and RT-X models,” arXiv preprint arXiv:2310.08864, 2023.

  137. M. Ahn, D. Dwibedi, C. Finn, M. G. Arenas, K. Gopalakrishnan, K. Hausman, B. Ichter, A. Irpan, N. Joshi, R. Julian et al., “AutoRT: Embodied foundation models for large scale orchestration of robotic agents,” arXiv preprint arXiv:2401.12963, 2024.

  138. X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz, S. Goodman, X. Wang, Y. Tay et al., “PaLI-X: On scaling up a multilingual vision and language model,” arXiv preprint arXiv:2305.18565, 2023.

  139. Z. Fu, T. Z. Zhao, and C. Finn, “Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” arXiv preprint arXiv:2401.02117, 2024.

  140. J. Aldaco, T. Armstrong, R. Baruch, J. Bingham, S. Chan, K. Draper, D. Dwibedi, C. Finn, P. Florence, S. Goodrich et al., “ALOHA 2: An enhanced low-cost hardware for bimanual teleoperation,” arXiv preprint arXiv:2405.02292, 2024.

  141. D. Shah, B. Osiński, S. Levine et al., “LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action,” Proc. of Conference on Robot Learning, PMLR, pp. 492–504, 2023.

  142. K. Hori, K. Suzuki, and T. Ogata, “Interactively robot action planning with uncertainty analysis and active questioning by large language model,” Proc. of 2024 IEEE/SICE International Symposium on System Integration (SII), IEEE, pp. 85–91, 2024.

  143. Z. Yang, S. S. Raman, A. Shah, and S. Tellex, “Plug in the safety chip: Enforcing constraints for LLM-driven robot agents,” arXiv preprint arXiv:2309.09919, 2023.

  144. H. Sha, Y. Mu, Y. Jiang, L. Chen, C. Xu, P. Luo, S. E. Li, M. Tomizuka, W. Zhan, and M. Ding, “LanguageMPC: Large language models as decision makers for autonomous driving,” arXiv preprint arXiv:2310.03026, 2023.

  145. L. Guan, K. Valmeekam, S. Sreedharan, and S. Kambhampati, “Leveraging pre-trained large language models to construct and utilize world models for model-based task planning,” Advances in Neural Information Processing Systems, vol. 36, pp. 79081–79094, 2023.

    Google Scholar 

  146. B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone, “LLM+P: Empowering large language models with optimal planning proficiency,” arXiv preprint arXiv:2304.11477, 2023.

  147. Y. Xie, C. Yu, T. Zhu, J. Bai, Z. Gong, and H. Soh, “Translating natural language to planning goals with large-language models,” arXiv preprint arXiv:2302.05128, 2023.

  148. Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as commonsense knowledge for large-scale task planning,” Advances in Neural Information Processing Systems, vol. 36, 2024.

  149. G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “VOYAGER: An open-ended embodied agent with large language models,” arXiv preprint arXiv:2305.16291, 2023.

  150. S. Lifshitz, K. Paster, H. Chan, J. Ba, and S. McIl-raith, “Steve-1: A generative model for text-to-behavior in minecraft,” Advances in Neural Information Processing Systems, vol. 36, 2024.

  151. K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati, “On the planning abilities of large language models-a critical investigation,” Advances in Neural Information Processing Systems, vol. 36, 2024.

  152. S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, “Reasoning with language model is planning with world model,” arXiv preprint arXiv:2305.14992, 2023.

  153. T. Silver, V. Hariprasad, R. S. Shuttleworth, N. Kumar, T. Lozano-Pérez, and L. P. Kaelbling, “PDDL planning with pretrained large language models,” Proc. of NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.

  154. Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Leveraging commonsense knowledge from large language models for task and motion planning,” Proc. of RSS 2023 Workshop on Learning for Task and Motion Planning, 2023.

  155. D. Shah, B. Eysenbach, G. Kahn, N. Rhinehart, and S. Levine, “ViNG: Learning open-world navigation with visual goals,” Proc. of 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 13215–13222, 2021.

  156. S. Chen, A. Xiao, and D. Hsu, “LLM-State: Expandable state representation for long-horizon task planning in the open world,” arXiv preprint arXiv:2311.17406, 2023.

  157. E. Latif, “3P-LLM: Probabilistic path planning using large language model for autonomous robot navigation,” arXiv preprint arXiv:2403.18778, 2024.

  158. W. Chen, S. Koenig, and B. Dilkina, “Why solving multiagent path finding with large language model has not succeeded yet,” arXiv preprint arXiv:2401.03630, 2024.

  159. Y. Kong, J. Ruan, Y. Chen, B. Zhang, T. Bao, S. Shi, G. Du, X. Hu, H. Mao, Z. Li et al., “TPTU-v2: Boosting task planning and tool usage of large language model-based agents in real-world systems,” arXiv preprint arXiv:2311.11315, 2023.

  160. T. Silver, S. Dan, K. Srinivas, J. B. Tenenbaum, L. Kaelbling, and M. Katz, “Generalized planning in PDDL domains with pretrained large language models,” Proc. of the AAAI Conference on Artificial Intelligence, vol. 38, no. 18, pp. 20256–20264, 2024.

    Article  Google Scholar 

  161. Y. Wu, J. Zhang, N. Hu, L. Tang, G. Qi, J. Shao, J. Ren, and W. Song, “MLDT: Multi-level decomposition for complex long-horizon robotic task planning with open-source large language model,” arXiv preprint arXiv:2403.18760, 2024.

  162. W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” Proc. of International Conference on Machine Learning, PMLR, pp. 9118–9147, 2022.

  163. Z. Wu, Z. Wang, X. Xu, J. Lu, and H. Yan, “Embodied task planning with large language models,” arXiv preprint arXiv:2307.01848, 2023.

  164. Y. Zhen, S. Bi, L. Xing-tong, P. Wei-qin, S. Haipeng, C. Zi-rui, and F. Yi-shu, “Robot task planning based on large language model representing knowledge with directed graph structures,” arXiv preprint arXiv:2306.05171, 2023.

  165. K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg, “Text2Motion: From natural language instructions to feasible plans,” Autonomous Robots, vol. 47, no. 8, pp. 1345–1365, 2023.

    Article  Google Scholar 

  166. B. Pan, J. Lu, K. Wang, L. Zheng, Z. Wen, Y. Feng, M. Zhu, and W. Chen, “AgentCoord: Visually exploring coordination strategy for llm-based multi-agent collaboration,” arXiv preprint arXiv:2404.11943, 2024.

  167. Z. Zhou, J. Song, K. Yao, Z. Shu, and L. Ma, “ISRLLM: Iterative self-refined large language model for long-horizon sequential task planning,” arXiv preprint arXiv:2308.13724, 2023.

  168. Y. Liu, L. Palmieri, S. Koch, I. Georgievski, and M. Aiello, “DELTA: Decomposed efficient long-term robot task planning using large language models,” arXiv preprint arXiv:2404.03275, 2024.

  169. Z. Yang, A. Ishay, and J. Lee, “Coupling large language models with logic programming for robust and general reasoning from text,” arXiv preprint arXiv:2307.07696, 2023.

  170. G. Chalvatzaki, A. Younes, D. Nandha, A. T. Le, L. F. R. Ribeiro, and I. Gurevych, “Learning to reason over scene graphs: A case study of finetuning GPT-2 into a robot language model for grounded task planning,” Frontiers in Robotics and AI, 2023.

  171. D. Han, T. McInroe, A. Jelley, S. V. Albrecht, P. Bell, and A. Storkey, “LLM-Personalize: Aligning LLM planners with human preferences via reinforced self-training for housekeeping robots,” arXiv preprint arXiv:2404.14285, 2024.

  172. B. Y. Lin, Y. Fu, K. Yang, F. Brahman, S. Huang, C. Bhagavatula, P. Ammanabrolu, Y. Choi, and X. Ren, “Swift-Sage: A generative agent with fast and slow thinking for complex interactive tasks,” Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023.

  173. N. Wake, A. Kanehira, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “ChatGPT empowered long-step robot control in various environments: A case application,” IEEE Access, 2023.

  174. S. H. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “ChatGPT for robotics: Design principles and model abilities,” IEEE Access, 2024.

  175. S. S. Raman, V. Cohen, E. Rosen, I. Idrees, D. Paulius, and S. Tellex, “Planning with large language models via corrective re-prompting,” Proc. of NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.

  176. W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar et al., “Inner Monologue: Embodied reasoning through planning with language models,” arXiv preprint arXiv:2207.05608, 2022.

  177. J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as Policies: Language model programs for embodied control,” Proc. of 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 9493–9500, 2023.

  178. A. Jiao, T. P. Patel, S. Khurana, A.-M. Korol, L. Brunke, V. K. Adajania, U. Culha, S. Zhou, and A. P. Schoellig, “Swarm-GPT: Combining large language models with safe motion planning for robot choreography design,” arXiv preprint arXiv:2312.01059, 2023.

  179. Z. Mandi, S. Jain, and S. Song, “RoCo: Dialectic multirobot collaboration with large language models,” arXiv preprint arXiv:2307.04738, 2023.

  180. A. Bucker, L. Figueredo, S. Haddadin, A. Kapoor, S. Ma, S. Vemprala, and R. Bonatti, “LaTTe: Language trajectory transformer,” Proc. of 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 7287–7294, 2023.

  181. S. Wang, M. Han, Z. Jiao, Z. Zhang, Y. N. Wu, S.-C. Zhu, and H. Liu, “LLM3: Large language model-based task and motion planning with motion failure reasoning,” arXiv preprint arXiv:2403.11552, 2024.

  182. T. Xie, S. Zhao, C. H. Wu, Y. Liu, Q. Luo, V. Zhong, Y. Yang, and T. Yu, “TEXT2REWARD: Reward shaping with language models for reinforcement learning,” arXiv preprint arXiv:2309.11489, 2023.

  183. D. M. Proux, C. Roux, M. Niemaz et al., “LARG2, language-based automatic reward and goal generation,” 2023.

  184. J. Song, Z. Zhou, J. Liu, C. Fang, Z. Shu, and L. Ma, “Self-refined large language model as automated reward function designer for deep reinforcement learning in robotics,” arXiv preprint arXiv:2309.06687, 2023.

  185. Y. Tang, W. Yu, J. Tan, H. Zen, A. Faust, and T. Harada, “SayTap: Language to quadrupedal locomotion,” arXiv preprint arXiv:2306.07580, 2023.

  186. J. Y. Zhu, C. G. Cano, D. V. Bermudez, and M. Drozdzal, “InCoRo: In-context learning for robotics control with feedback loops,” 2024.

  187. Y. Cao and C. G. Lee, “Ground manipulator primitive tasks to executable actions using large language models,” Proc. of the AAAI Symposium Series, vol. 2, no. 1, pp. 502–507, 2023.

    Article  Google Scholar 

  188. H. He, C. Bai, K. Xu, Z. Yang, W. Zhang, D. Wang, B. Zhao, and X. Li, “Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, 2024.

  189. J. Chang, H. Ryu, J. Kim, S. Yoo, J. Seo, N. Prakash, J. Choi, and R. Horowitz, “Denoising heat-inspired diffusion with insulators for collision free motion planning,” arXiv preprint arXiv:2310.12609, 2023.

  190. H. Ryu, J. Kim, J. Chang, H. S. Ahn, J. Seo, T. Kim, J. Choi, and R. Horowitz, “Diffusion-EDFs: Bi-equivariant denoising generative modeling on se (3) for visual robotic manipulation,” arXiv preprint arXiv:2309.02685, 2023.

  191. J. Urain, N. Funk, J. Peters, and G. Chalvatzaki, “SE(3)-DiffusionFields: Learning smooth cost functions for joint grasp and motion optimization through diffusion,” Proc. of 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 5923–5930, 2023.

  192. J. Carvalho, M. Baierl, J. Urain, and J. Peters, “Conditioned score-based models for learning collision-free trajectory generation,” Proc. of NeurIPS 2022 Workshop on Score-Based Methods, 2022.

  193. Z. Wu, S. Ye, M. Natarajan, and M. C. Gombolay, “Diffusion-reinforcement learning hierarchical motion planning in adversarial multi-agent games,” arXiv preprint arXiv:2403.10794, 2024.

  194. C. Jiang, A. Cornman, C. Park, B. Sapp, Y. Zhou, D. Anguelov et al., “MotionDiffuser: Controllable multiagent motion prediction using diffusion,” Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9644–9653, 2023.

  195. K. Saha, V. Mandadi, J. Reddy, A. Srikanth, A. Agarwal, B. Sen, A. Singh, and M. Krishna, “EDMP: Ensemble-of-costs-guided diffusion for motion planning,” 2023.

  196. S. Zhou, Y. Du, S. Zhang, M. Xu, Y. Shen, W. Xiao, D.-Y. Yeung, and C. Gan, “Adaptive online replanning with diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024.

  197. W. Liu, Y. Du, T. Hermans, S. Chernova, and C. Paxton, “StructDiffusion: Language-guided creation of physically-valid structures using unseen objects,” arXiv preprint arXiv:2211.04604, 2022.

  198. S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “RoboNet: Large-scale multi-robot learning,” arXiv preprint arXiv:1910.11215, 2019.

  199. F. Ebert, Y. Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine, “Bridge Data: Boosting generalization of robotic skills with cross-domain datasets,” arXiv preprint arXiv:2109.13396, 2021.

  200. H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V. Myers, M. J. Kim, M. Du et al., “BridgeData V2: A dataset for robot learning at scale,” Proc. of Conference on Robot Learning, PMLR, pp. 1723–1736, 2023.

  201. H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu, “RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot,” Proc. of RSS 2023 Workshop on Learning for Task and Motion Planning, 2023.

  202. D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine, “GNM: A general navigation model to drive any robot,” Proc. of 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2023, pp. 7226–7233.

  203. K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu et al., “Ego4D: Around the world in 3,000 hours of egocentric video,” Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012, 2022.

  204. J. Slaney and S. Thiébaux, “Blocks world revisited,” Artificial Intelligence, vol. 125, no. 1–2, pp. 119–153, 2001.

    Article  MathSciNet  Google Scholar 

  205. M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED: A benchmark for interpreting grounded instructions for everyday tasks,” Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10740–10749, 2020.

  206. S. James, Z. Ma, D. Rovick Arrojo, and A. J. Davison, “RLBench: The robot learning benchmark & learning environment,” IEEE Robotics and Automation Letters, 2020.

  207. A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V. Sindhwani et al., “Transporter Networks: Rearranging the visual world for robotic manipulation,” Proc. of Conference on Robot Learning, PMLR, pp. 726–747, 2021.

  208. O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7327–7334, 2022.

    Article  Google Scholar 

  209. J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P. Schoellig, “Learning to fly—a gym environment with pybullet physics for reinforcement learning of multiagent quadcopter control,” Proc. of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 7512–7519, 2021.

  210. C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martin-Martin, C. Wang, G. Levine, M. Lingelbach, J. Sun et al., “BEHAVIOR-1K: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation,” Proc. of Conference on Robot Learning, PMLR, pp. 80–93, 2023.

  211. Z. Mandi, H. Bharadhwaj, V. Moens, S. Song, A. Rajeswaran, and V. Kumar, “CACTI: A framework for scalable multi-task multi-scene visual imitation learning,” arXiv preprint arXiv:2212.05711, 2022.

  212. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.

  213. Z. Chen, S. Kiami, A. Gupta, and V. Kumar, “GenAug: Retargeting behaviors to unseen situations via generative augmentation,” arXiv preprint arXiv:2302.06671, 2023.

  214. T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang, J. Singh, C. Tan, J. Peralta, B. Ichter et al., “Scaling robot learning with semantically imagined experience,” arXiv preprint arXiv:2302.11550, 2023.

  215. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494, 2022.

    Google Scholar 

  216. K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipulation with pretrained image-editing diffusion models,” arXiv preprint arXiv:2310.10639, 2023.

  217. T. Xiao, H. Chan, P. Sermanet, A. Wahid, A. Brohan, K. Hausman, S. Levine, and J. Tompson, “Robotic skill acquisition via instruction augmentation with vision-language models,” arXiv preprint arXiv:2211.11736, 2022.

  218. H. Ha, P. Florence, and S. Song, “Scaling Up and Distilling Down: Language-guided robot skill acquisition,” Proc. of Conference on Robot Learning, PMLR, pp. 3766–3777, 2023.

  219. L. Wang, Y. Ling, Z. Yuan, M. Shridhar, C. Bao, Y. Qin, B. Wang, H. Xu, and X. Wang, “GenSim: Generating robotic simulation tasks via large language models,” arXiv preprint arXiv:2310.01361, 2023.

  220. Y. Wang, Z. Xian, F. Chen, T.-H. Wang, Y. Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan, “Robo-Gen: Towards unleashing infinite data for automated robot learning via generative simulation,” arXiv preprint arXiv:2311.01455, 2023.

  221. J. Gu, S. Kirmani, P. Wohlhart, Y. Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu et al., “RT-Trajectory: Robotic task generalization via hindsight trajectory sketches,” arXiv preprint arXiv:2311.01977, 2023.

  222. A. Z. Ren, A. Dixit, A. Bodrova, S. Singh, S. Tu, N. Brown, P. Xu, L. Takayama, F. Xia, J. Varley et al., “Robots that ask for help: Uncertainty alignment for large language model planners,” arXiv preprint arXiv:2307.01928, 2023.

  223. C. Kassab, M. Mattamala, L. Zhang, and M. Fallon, “Language-EXtended Indoor SLAM (LEXIS): A versatile system for real-time visual scene understanding,” arXiv preprint arXiv:2309.15065, 2023.

  224. Z. Liu, A. Bahety, and S. Song, “REFLECT: Summarizing robot experiences for failure explanation and correction,” arXiv preprint arXiv:2306.15724, 2023.

  225. G. Tatiya, J. Francis, and J. Sinapov, “Cross-tool and cross-behavior perceptual knowledge transfer for grounded object recognition,” arXiv preprint arXiv:2303.04023, 2023.

  226. S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3M: A universal visual representation for robot manipulation,” arXiv preprint arXiv:2203.12601, 2022.

  227. A. Z. Ren, B. Govil, T.-Y. Yang, K. R. Narasimhan, and A. Majumdar, “Leveraging language for accelerated learning of tool manipulation,” Proc. of Conference on Robot Learning, PMLR, pp. 1531–1541, 2023.

  228. M. Shridhar, L. Manuelli, and D. Fox, “CLIPort: What and where pathways for robotic manipulation,” Proc. of Conference on robot learning, PMLR, pp. 894–906, 2022.

  229. L.-H. Lin, Y. Cui, Y. Hao, F. Xia, and D. Sadigh, “Gesture-informed robot assistance via foundation models,” Proc. of 7th Annual Conference on Robot Learning, 2023.

  230. R. Mirjalili, M. Krawez, and W. Burgard, “FM-Loc: Using foundation models for improved vision-based localization,” Proc. of 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 1381–1387, 2023.

  231. Y. Ze, G. Yan, Y.-H. Wu, A. Macaluso, Y. Ge, J. Ye, N. Hansen, L. E. Li, and X. Wang, “GNFactor: Multitask real robot learning with generalizable neural feature fields,” Proc. of Conference on Robot Learning, PMLR, pp. 284–301, 2023.

  232. K. Chu, X. Zhao, C. Weber, M. Li, W. Lu, and S. Wermter, “Large language models for orchestrating bimanual robots,” arXiv preprint arXiv:2404.02018, 2024.

  233. X. Zhao, M. Li, C. Weber, M. B. Hafez, and S. Wermter, “Chat with the environment: Interactive multimodal perception using large language models,” Proc. of 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 3590–3596, 2023.

  234. T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki, “Act3D: 3d feature field transformers for multi-task robotic manipulation,” Proc. of 7th Annual Conference on Robot Learning, 2023.

  235. M. Gramopadhye and D. Szafir, “Generating executable action plans with environmentally-aware language models,” Proc. of 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 3568–3575, 2023.

  236. M. Hu, Y. Mu, X. Yu, M. Ding, S. Wu, W. Shao, Q. Chen, B. Wang, Y. Qiao, and P. Luo, “Tree-Planner: Efficient close-loop task planning with large language models,” arXiv preprint arXiv:2310.08582, 2023.

  237. Z. Liu, H. Hu, S. Zhang, H. Guo, S. Ke, B. Liu, and Z. Wang, “Reason for future, act for now: A principled framework for autonomous LLM agents with provable sample efficiency,” arXiv preprint arXiv:2309.17382, 2023.

  238. J. Yu, R. He, and R. Ying, “Thought Propagation: An analogical approach to complex reasoning with large language models,” arXiv preprint arXiv:2310.03965, 2023.

  239. J. Brawer, K. Bishop, B. Hayes, and A. Roncone, “Towards a natural language interface for flexible multi-agent task assignment,” Proc. of the AAAI Symposium Series, vol. 2, no. 1, pp. 167–171, 2023.

    Article  Google Scholar 

  240. T. T. Andersen, “Optimizing the universal robots ros driver.” 2015.

  241. S. Haddadin, S. Parusel, L. Johannsmeier, S. Golz, S. Gabl, F. Walch, M. Sabaghian, C. Jähne, L. Hausperger, and S. Haddadin, “The Franka Emika robot: A reference platform for robotics research and education,” IEEE Robotics & Automation Magazine, vol. 29, no. 2, pp. 46–64, 2022.

    Article  Google Scholar 

  242. F. Kaplan, “Everyday robotics: Robots as everyday objects,” Proc. of the 2005 Joint Conference on Smart Objects and Ambient Intelligence: Innovative Context-aware Services: Usages and Technologies, pp. 59–64, 2005.

  243. U. Yamaguchi, F. Saito, K. Ikeda, and T. Yamamoto, “HSR, human support robot as research and development platform,” Proc. of The Abstracts of the international conference on advanced mechatronics: Toward evolutionary fusion of IT and mechatronics: ICAM 2015.6, The Japan Society of Mechanical Engineers, pp. 39–40, 2015.

  244. G. Elias, M. Schuenck, Y. Negócio, J. Dias Jr, and S. M. Filho, “X-ARM: An asset representation model for component repository systems,” Proc. of the 2006 ACM symposium on Applied computing, pp. 1690–1694, 2006.

  245. R. Amsters and P. Slaets, “Turtlebot 3 as a robotics education platform,” Proc. of Robotics in Education: Current Research and Innovations 10, Springer, pp. 170–181, 2020.

  246. M. Kerzel, P. Allgeuer, E. Strahl, N. Frick, J.-G. Habekost, M. Eppe, and S. Wermter, “NICOL: A neuro-inspired collaborative semi-humanoid robot that bridges social interaction and reliable manipulation,” IEEE Access, vol. 11, pp. 123531–123542, 2023.

    Article  Google Scholar 

  247. E. Rohmer, S. P. Singh, and M. Freese, “V-REP: A versatile and scalable robot simulation framework,” Proc. of 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp. 1321–1326, 2013.

  248. E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” Proc. of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp. 5026–5033, 2012.

  249. J. Liang, V. Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and D. Fox, “Gpu-accelerated robotic simulation for distributed reinforcement learning,” Proc. of Conference on Robot Learning, PMLR, pp. 270–282, 2018.

  250. X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba, “VirtualHome: Simulating household activities via programs,” Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8494–8502, 2018.

  251. X. Puig, T. Shu, S. Li, Z. Wang, J. B. Tenenbaum, S. Fidler, and A. Torralba, “Watch-And-Help: A challenge for social perception and human-AI collaboration,” 2020.

  252. M. Shridhar, X. Yuan, M.-A. Côté, Y. Bisk, A. Trischler, and M. Hausknecht, “ALFWorld: Aligning text and embodied environments for interactive learning,” Proc. of the International Conference on Learning Representations (ICLR), 2021.

  253. M.-A. Côté, A. Kádár, X. Yuan, B. Kybartas, T. Barnes, E. Fine, J. Moore, R. Y. Tao, M. Hausknecht, L. E. Asri, M. Adada, W. Tay, and A. Trischler, “TextWorld: A learning environment for text-based games,” CoRR, vol. abs/1806.11532, 2018.

  254. M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford, L. Weihs, M. Yatskar, and A. Farhadi, “RoboTHOR: An open simulation-to-real embodied ai platform,” Proc. of CVPR, 2020.

  255. E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi, “AI2-THOR: An interactive 3d environment for visual AI,” arXiv preprint arXiv:1712.05474, 2017.

  256. X. Puig, E. Undersander, A. Szot, M. D. Cote, R. Partsey, J. Yang, R. Desai, A. W. Clegg, M. Hlavac, T. Min, T. Gervet, V. Vondruš, V.-P. Berges, J. Turner, O. Maksymets, Z. Kira, M. Kalakrishnan, J. Malik, D. S. Chaplot, U. Jain, D. Batra, A. Rai, and R. Mottaghi, “Habitat 3.0: A co-habitat for humans, avatars and robots,” 2023.

  257. A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V. Vondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V. Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” Proc. of Advances in Neural Information Processing Systems (NeurIPS), 2021.

  258. M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A platform for embodied ai research,” Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.

  259. M. Carroll, R. Shah, M. K. Ho, T. Griffiths, S. Seshia, P. Abbeel, and A. Dragan, “On the utility of learning about humans for human-ai coordination,” Advances in Neural Information Processing Systems, vol. 32, 2019.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Han-Lim Choi.

Ethics declarations

The authors declare that there is no competing financial interest or personal relationship that could have appeared to influence the work reported in this paper. The corresponding author, Han-Lim Choi, is a senior editor of this journal.

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was partly supported by Unmanned Vehicles Core Technology Research and Development Program through the National Research Foundation of Korea (NRF) (2020M3C1C1A0108237512), and by Gyonggi-do Regional Research Centre (GRRC) funded by Gyonggi Province (GRRC Aerospace 2023-B01).

Dae-Sung Jang received his B.S. and Ph.D. degrees in aerospace engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2008 and 2015, respectively. He is an Associate Professor with the Department of Aeronautical and Astronautical Engineering, Korea Aerospace University (KAU), Goyang, Korea. Before joining KAU in 2018, he worked at KAIST and then NASA Ames Research Center as a Postdoctoral Researcher. His research interests include multi-agent system decision making and task assignment/scheduling, combinatorial optimization and approximation algorithms, sensor system resource management, navigation and planning of autonomous robots, and cooperative estimation/control.

Doo-Hyun Cho received his B.S., M.S., and Ph.D. degrees in aerospace engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2013, 2015, and 2019, respectively. He is currently with D.Notitia, having joined the organization in 2023. Prior to this, he worked at Samsung Electronics DS and later at Stradvision as a research engineer. His research interests include foundation model-based multi-agent system decision-making, task assignment/scheduling, system resource optimization, and vision-related AI model optimization.

Woo-Cheol Lee received his B.S. degree in aerospace engineering from Korea Aerospace University (KAU), Goyang, Korea, in 2015. He received his M.S. and Ph.D. degrees in aerospace engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2017 and 2022, respectively. He is a Senior Researcher with the Extreme Robotics Team, Korea Atomic Energy Research Institute (KAERI), Daejeon, Korea. Before joining KAERI in 2023, he worked at Samsung Electronics and then Hyundai Motor Company as a Senior Robotics Researcher. His research interests include navigation, perception, and task management.

Seung-Keol Ryu is a post-master researcher at Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea. He received his B.S. degree (double majors) in aerospace engineering and physics from KAIST in 2022. He received an M.S. degree in aerospace engineering from KAIST in 2024. He will begin his Ph.D. studies in aerospace engineering at the University of Colorado Boulder (CU Boulder) in fall 2024. His research interests include motion planning, aerial robotics, and diffusion models.

Byeongmin Jeong is currently a Ph.D. stundent of aerospace engineering at Korea Advanced Institute of Science and Technology (KAIST). He received his B.S. degree in mechanical engineering from Korea Aerospace University (KAU) in 2012. He received an M.S. degree in aerospace engineering from KAIST in 2014. His research interests inlcude multiagent systems and path planning.

Minji Hong received her B.S. degree in aerospace engineering from Korea Aerospace University (KAU), Goyang, Korea, in 2022. She is a Ph.D. student of aerospace engineering at Korea Advanced Institute of Science and Technology (KAIST). Her research interests include multi agent path planning and task assignment.

Minjo Jung received his B.S. degree in mechanical engineering from Korea Aerospace University (KAU), Goyang, Korea, in 2023. He is an M.S. student of aerospace engineering at Korea Advanced Institute of Science and Technology (KAIST). His research interests inlcude task assignment, multi agent path planning and robot motion planning.

Minchae Kim received her B.S. degree in aerospace engineering from Inha University, Incheon, Korea, in 2023. She is an M.S. student of aerospace engineering at Korea Advanced Institute of Science and Technology (KAIST). Her research interests include decision-making under uncertainty, astrodynamics, and spacecraft autonomous control.

Minjoon Lee received his B.S. and M.S. degrees in aerospace engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2020 and 2022, respectively. He is a Researcher with Space Systems Team, Defense Agency for Technology and Quality. His research interests include task assignment/scheduling and spacecraft autonomous control.

SeungJae Lee received his B.S. and M.S. degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2004 and 2006, respectively. He obtained his Ph.D. in electrical and computer engineering from the University of California Irvine, CA, USA, in 2016. He has held various engineering roles, including serving as a principal engineer at Samsung Electronics. He later worked as a computer vision/perception engineer at autonomous driving companies 42dot and Stradvision. Currently, he is with D.notitia, leading the AI team and focusing on research in physics-informed neural networks, large language models (LLM/VLM), and their applications to solving diverse real-world problems.

Han-Lim Choi is a Professor of aerospace engineering at Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea. He received his B.S. and M.S. degrees in aerospace engineering from KAIST, in 2000 and 2002, respectively, and his Ph.D. degree in aeronautics and astronautics from Massachusetts Institute of Technology (MIT), Cambridge, MA, USA, in 2009. He then studied at MIT as a postdoctoral associate until he joined KAIST in 2010. His current research interests include decision-making for multi-agent systems, decision under uncertainty and learning, and intelligent aerospace systems.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jang, DS., Cho, DH., Lee, WC. et al. Unlocking Robotic Autonomy: A Survey on the Applications of Foundation Models. Int. J. Control Autom. Syst. 22, 2341–2384 (2024). https://doi.org/10.1007/s12555-024-0438-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12555-024-0438-7

Keywords

Navigation