Sample-Efficient Reinforcement Learning Based on Dynamics Models via Meta-policy Optimization

Guoyu Zuo ORCID: orcid.org/0000-0002-7624-4728^11,12,
Zhipeng Tian^11,12,
Shuai Huang^11,12 &
…
Daoxiong Gong^11,12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1515))

Included in the following conference series:

International Conference on Cognitive Systems and Signal Processing

1297 Accesses
1 Citations

Abstract

Model-based reinforcement learning (RL) can acquire remarkable sample efficiency, which makes it a suitable choice for applications where experiment data is hard to collect. However, it is difficult to learn an accurate dynamics model fully matched with the real-world, and the accuracy of the model usually affects the agent’s final performance. In this paper, we propose a novel model-based RL approach called Meta-policy Optimization method with branched rollouts (MPOBR), which gets rid of strong dependency on an accurate model. In MPOBR, meta-learning is used to train a policy prior on an ensemble of learned dynamics models, so that this prior can be rapidly adapted to the environment when combined with environment rollouts. To reduce the affect of model compounding bias, short model-generated rollouts branched from real data are used to update the meta-policy. The experiments on simulated robotic tasks are designed to verify the effectiveness of our method. Results show that our approach can achieve the same asymptotic performance of state-of-the-art model-free algorithms while significantly reducing sample complexity.

Supported by National Natural Science Foundation of China (61873008 and 61773022), the Beijing Natural Science Foundation (4192010) and the National Key R & D Plan (2018YFB1307004).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Offline reinforcement learning with anderson acceleration for robotic tasks

Article 10 January 2022

Meta-Reinforcement Learning Algorithm Based on Reward and Dynamic Inference

State-Dependent Maximum Entropy Reinforcement Learning for Robot Long-Horizon Task Learning

Article Open access 24 January 2024

References

Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Article Google Scholar
Deisenroth, M., Rasmussen, C.E.: PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on machine learning (ICML 2011), pp. 465–472 (2011)
Google Scholar
Depeweg, S., Hernández-Lobato, J.M., Doshi-Velez, F., Udluft, S.: Learning and policy search in stochastic dynamical systems with Bayesian neural networks, arXiv preprint arXiv:1605.07127 (2016)
Farahmand, A.-M.: Iterative value-aware model learning. In: NeurIPS, pp. 9090–9101 (2018)
Google Scholar
Wu, Y.-H., Fan, T.-H., Ramadge, P.J., Su, H.: Model imitation for model-based reinforcement learning, arXiv preprint arXiv:1909.11821 (2019)
Nagabandi, A., Kahn, G., Fearing, R.S., Levine, S.: Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE (2018)
Google Scholar
Moerland, T.M., Broekens, J., Jonker, C.M.: Model-based reinforcement learning: a survey, arXiv preprint arXiv:2006.16712 (2020)
Kurutach, T., Clavera, I., Duan, Y., Tamar, A., Abbeel, P.: Model-ensemble trust-region policy optimization, arXiv preprint arXiv:1802.10592 (2018)
Gal, Y., McAllister, R., Rasmussen, C.E.: Improving PILCO with Bayesian neural network dynamics models. In: Data-Efficient Machine Learning Workshop, ICML, vol. 4, no. 34, p. 25 (2016)
Google Scholar
Wang, T., et al.: Benchmarking model-based reinforcement learning, arXiv preprint arXiv:1907.02057 (2019)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017)
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)
Google Scholar
Peters, J., Schaal, S.: Policy gradient methods for robotics. In: 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2219–2225. IEEE (2006)
Google Scholar
Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: Model-based policy optimization, arXiv preprint arXiv:1906.08253 (2019)
Brockman, G., et al.: OpenAI gym. https://github.com/openai/gym
Plappert, M., Andrychowicz, M., Ray, A., Mcgrew, B., Zaremba, W.: Multi-goal reinforcement learning: challenging robotics environments and request for research, arXiv preprint arXiv:1802.09464 (2018)
Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning, arXiv preprint arXiv:1509.02971 (2015)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897. PMLR (2015)
Google Scholar
Wu, Y., Mansimov, E., Liao, S., Grosse, R., Ba, J.: Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation, arXiv preprint arXiv:1708.05144 (2017)
Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., Abbeel, P.: Model-based reinforcement learning via meta-policy optimization. In: Conference on Robot Learning, pp. 617–629. PMLR (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
Guoyu Zuo, Zhipeng Tian, Shuai Huang & Daoxiong Gong
Beijing Key Laboratory of Computing Intelligence and Intelligent Systems, Beijing, 100124, China
Guoyu Zuo, Zhipeng Tian, Shuai Huang & Daoxiong Gong

Authors

Guoyu Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Tian
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Huang
View author publications
You can also search for this author in PubMed Google Scholar
Daoxiong Gong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoyu Zuo .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Fuchun Sun
National University of Defense Technology, Changsha, China
Dewen Hu
Universität Hamburg, Hamburg, Germany
Stefan Wermter
Tsingzhan Artificial Intelligence Research Institute, Nanjing, China
Lei Yang
Tsinghua University, Beijing, China
Huaping Liu
Tsinghua University, Beijing, China
Bin Fang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zuo, G., Tian, Z., Huang, S., Gong, D. (2022). Sample-Efficient Reinforcement Learning Based on Dynamics Models via Meta-policy Optimization. In: Sun, F., Hu, D., Wermter, S., Yang, L., Liu, H., Fang, B. (eds) Cognitive Systems and Information Processing. ICCSIP 2021. Communications in Computer and Information Science, vol 1515. Springer, Singapore. https://doi.org/10.1007/978-981-16-9247-5_28

Download citation

DOI: https://doi.org/10.1007/978-981-16-9247-5_28
Published: 11 January 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-9246-8
Online ISBN: 978-981-16-9247-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sample-Efficient Reinforcement Learning Based on Dynamics Models via Meta-policy Optimization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Offline reinforcement learning with anderson acceleration for robotic tasks

Meta-Reinforcement Learning Algorithm Based on Reward and Dynamic Inference

State-Dependent Maximum Entropy Reinforcement Learning for Robot Long-Horizon Task Learning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Sample-Efficient Reinforcement Learning Based on Dynamics Models via Meta-policy Optimization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Offline reinforcement learning with anderson acceleration for robotic tasks

Meta-Reinforcement Learning Algorithm Based on Reward and Dynamic Inference

State-Dependent Maximum Entropy Reinforcement Learning for Robot Long-Horizon Task Learning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation