Abstract
It’s generally believed that model-based reinforcement learning (RL) is more sample efficient than model-free RL. However, model-based RL methods typically suffer from model bias, which severely limits the asymptotic performance of the algorithm. Although previous model-based RL approaches use ensemble models to reduce the model error, we find that vanilla ensemble learning does not consider the model discrepancy. The discrepancy between different models is huge, which is not conducive to policy optimization. To alleviate the problem, this paper proposes an Ensemble Model Consistency Actor-Critic (EMC-AC) method to decrease the discrepancy between models while maintaining the model diversity. Specifically, we design ablation experiments to analyze the effects of the trade-off between diversity and consistency on the EMC-AC algorithm performance. Finally, extensive experiments on the continuous control benchmarks demonstrate that our approach achieves the significant performance to exceed the sample efficiency of prior model-based RL methods and to match the asymptotic performance of the state-of-the-art model-free RL algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abbeel, P., Quigley, M., Ng, A.Y.: Using inaccurate models in reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 1–8 (2006)
Abdullah, A., Veltkamp, R.C., Wiering, M.A.: An ensemble of deep support vector machines for image categorization. In: 2009 International Conference of Soft Computing and Pattern Recognition, pp. 301–306. IEEE (2009)
Bagnell, J.A., Schneider, J.G.: Autonomous helicopter control using reinforcement learning policy search methods. In: Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No. 01CH37164), vol. 2, pp. 1615–1620. IEEE (2001)
Botev, Z.I., Kroese, D.P., Rubinstein, R.Y., et al.: The cross-entropy method for optimization. In: Handbook of statistics, vol. 31, pp. 35–59. Elsevier (2013)
Bousquet, O., Chapelle, O., Hein, M.: Measure based regularization. In: Advances in Neural Information Processing Systems, pp. 1221–1228 (2004)
Buckman, J., Hafner, D., et al.: Sample-efficient reinforcement learning with stochastic ensemble value expansion. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8234–8244 (2018)
Chua, K., Calandra, R., et al.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4759–4770 (2018)
Clavera, I., Fu, Y., Abbeel, P.: Model-augmented actor-critic: backpropagating through paths. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. OpenReview.net (2020)
Deisenroth, M.P., Rasmussen, C.E.: PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, pp. 465–472 (2011)
Feinberg, V., Wan, A., Stoica, I., Jordan, M.I., Gonzalez, J.E., Levine, S.: Model-based value estimation for efficient model-free reinforcement learning. CoRR abs/1803.00101 (2018). arXiv:1803.00101
Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013). arXiv:1308.0850
Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 2455–2467 (2018)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861–1870. PMLR (2018)
Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning, pp. 2555–2565. PMLR (2019)
Heess, N., Wayne, G., Silver, D., et al.: Learning continuous control policies by stochastic value gradients. In: Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pp. 2944–2952 (2015)
Janner, M., Fu, J., Zhang, M., Levine, S.: When to trust your model: model-based policy optimization. In: Advances in Neural Information Processing Systems 32, pp. 12498–12509 (2019)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014, Conference Track Proceedings (2014)
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014. Citeseer (2000)
Kurutach, T., Clavera, I., Duan, Y., Tamar, A., Abbeel, P.: Model-ensemble trust-region policy optimization. In: International Conference on Learning Representations (2018)
Levine, S., Abbeel, P.: Learning neural network policies with guided policy search under unknown dynamics. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 8–13 December 2014, Montreal, Quebec, Canada, pp. 1071–1079 (2014)
Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17, 39:1-39:40 (2016)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: ICLR (Poster) (2016)
Lyu, J., Ma, X., Yan, J., Li, X.: Efficient continuous control with double actors and regularized critics. arXiv preprint arXiv:2106.03050 (2021)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Nagabandi, A., Kahn, G., Fearing, R.S., Levine, S.: Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE (2018)
Richards, A.G.: Robust constrained model predictive control. Ph.D. thesis, Massachusetts Institute of Technology (2005)
Sagi, O., Rokach, L.: Ensemble learning: a survey. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 8(4), e1249 (2018)
Schrittwieser, J.,et al.: Mastering atari, go, chess and shogi by planning with a learned model. CoRR abs/1911.08265 (2019)
Schulman, J., Levine, S., Abbeel, P., Jordan, M.I., Moritz, P.: Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015, JMLR Workshop and Conference Proceedings, vol. 37, pp. 1889–1897. JMLR.org (2015)
Schulman, J., Moritz, P., Levine, S., Jordan, M.I., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: 4th International Conference on Learning Representations (2016)
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. MIT Press, Cambridge (1998)
Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2012, Vilamoura, Algarve, Portugal, pp. 5026–5033. IEEE (2012)
Wang, T., Bao, X., Clavera, I., Hoang, J., et al.: Benchmarking model-based reinforcement learning. CoRR abs/1907.02057 (2019)
Acknowledgement
This work is funded by the National Natural Science Foundation of China (Grand No. 61876181), Beijing Nova Program of Science and Technology under Grand No. Z191100001119043 and in part by the Youth Innovation Promotion Association, CAS.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Jia, R., Li, Q., Huang, W., Zhang, J., Li, X. (2021). Consistency Regularization for Ensemble Model Based Reinforcement Learning. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13033. Springer, Cham. https://doi.org/10.1007/978-3-030-89370-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-89370-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89369-9
Online ISBN: 978-3-030-89370-5
eBook Packages: Computer ScienceComputer Science (R0)