Abstract
Offline reinforcement learning (RL) can learn effective policy from a fixed batch of data without interaction. However, the real-world requirements, such as better performance and high sample efficiency, put substantial challenges on current offline RL algorithms. In this paper, we propose a novel offline RL method, Constrained and Conservative Reinforcement Learning with Anderson Acceleration (CCRL-AA), which aims to enable the agent to effectively and efficiently learn from offline demonstration data. In our method, Constrained and Conservative Reinforcement Learning (CCRL) restricts the policy’s actions with respect to a batch of training data and learns a conservative Q-function to make the agent effectively learn from the previously collected demonstrations. The mechanism of Anderson acceleration (AA) is integrated to speed up the learning process and improve sample efficiency. Experiments were conducted on robotic simulation tasks, and the results demonstrate that our method can efficiently learn from given demonstrations and give better performance than several other state-of-the-art methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Zhu Z, Zhao H (2021) A Survey of Deep RL and IL for Autonomous Driving Policy Learning. arXiv:2101.01993
Kuderer M, Gulati S, Burgard W (2015) Learning driving styles for autonomous vehicles from demonstration. In: IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp 2641–2646
Yu P, Lee J S, Kulyatin I et al (2019) Model-based deep reinforcement learning for dynamic portfolio optimization. arXiv:1901.08740
Yuan Y, Wen W, Yang J (2020) Using data augmentation based reinforcement learning for daily stock trading. Electronics 9(9):1384
Paulus R, Xiong C, Socher R (2018) A Deep Reinforced Model for Abstractive Summarization. In: International Conference on Learning Representations (ICLR)
Grissom II A, He H, Boyd-Graber J et al (2014) Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation. In: Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP), pp 1342– 1352
Kalashnikov D, Irpan A, Pastor P et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning (PMLR), pp 651–673
Radosavovic I, Wang X, Pinto L et al (2020) State-only imitation learning for dexterous manipulation. arXiv:2004.04650
Wu R, Li M, Yao Z et al (2021) Reinforcement Learning Enabled Automatic Impedance Control of a Robotic Knee Prosthesis to Mimic the Intact Knee Motion in a Co-Adapting Environment. CoRR arXiv:2101.03487
Sutton R S, Barto A G (2018) Reinforcement learning: An introduction. MIT press
Heess N, TB D, Sriram S et al (2017) Emergence of locomotion behaviours in rich environments. arXiv:1707.02286
Yuan Y, Kitani K (2020) Residual Force Control for Agile Human Behavior Imitation and Extended Motion Synthesis. In: 33th Advances in Neural Information Processing Systems (NIPS)
Dulac-Arnold G, Levine N, Mankowitz DJ et al (2021) Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning
Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. In: International Conference on Machine Learning (PMLR), pp 2052–2062
Kumar A, Zhou A, Tucker G et al (2020) Conservative q-learning for offline reinforcement learning. In: 33th Advances in Neural Information Processing Systems (NIPS)
Kumar A, Fu J, Soh M, Tucker G, Levine S (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. In: Advances in Neural Information Processing Systems (NIPS), pp 11761–11771
Wang Z, Novikov A, Zolna K et al (2020) Critic Regularized Regression. In: 33th Advances in Neural Information Processing Systems (NIPS)
Shi W, Song S, Wu H, et al (2019) Regularized anderson acceleration for off-policy deep reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp 10231–10241
Yang S et al (2021) Efficient spike-driven learning with dendritic event-based processing. Front Neurosci 15(2021):97
Yang S, Deng B, Wang J et al (2019) Scalable digital neuromorphic architecture for large-scale biophysically meaningful neural network with multi-compartment neurons[J]. IEEE Trans Neural Netw Learn Syst 31(1):148–162
Geist M, Scherrer B (2018) Anderson acceleration for reinforcement learning. In: 2018-4th European workshop on Reinforcement Learning (EWRL)
Gordon G J (1995) Stable function approximation in dynamic programming. In: Machine Learning Proceedings, Morgan Kaufmann, pp 261–268
Ormoneit D, Sen Ś (2002) Kernel-based reinforcement learning. Mach Learn 49(2):161–178
Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J Mach Learn Res 6:503–556
Levine S, Kumar A, Tucker G et al (2020) Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv:2005.01643
Nair A, Dalal M, Gupta A et al (2020) Accelerating online reinforcement learning with offline datasets. arXiv:2006.09359
Laroche R, Trichelair P, Des Combes R T (2019) Safe policy improvement with baseline bootstrapping. In: International Conference on Machine Learning (PMLR), pp: 3652– 3661
Nadjahi K, Laroche R, des Combes R T (2019) Safe policy improvement with soft baseline bootstrapping. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD). Springer, pp 53–68
Agarwal R, Schuurmans D, Norouzi M (2020) An optimistic perspective on offline reinforcement learning. In: International Conference on Machine Learning (PMLR), pp 104–114
Wu Y, Tucker G, Nachum O (2019) Behavior regularized offline reinforcement learning. arXiv:1911.11361
Jaques N, Ghandeharioun A, Shen J H et al (2019) Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv:1907.00456
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning (PMLR), pp 1587– 1596
Lillicrap T P, Hunt J J, Pritzel A et al (2016) Continuous control with deep reinforcement learning (ICLR) (Poster)
Zuo G, Zhao Q, Chen K et al (2020) Off-policy adversarial imitation learning for robotic tasks with low-quality demonstrations. Appl Soft Comput 97:106795
Kingma D P, Welling M (2013) Auto-encoding variational bayes. Stat 1050:1
Brockman G, Cheung V, Pettersson L et al (2016) OpenAI Gym. arXiv:1606.01540
Hester T, Vecerik M, Pietquin O et al (2018) Deep Q-learning From Demonstrations (AAAI)
Nair A, McGrew B, Andrychowicz M et al (2018) Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp 6292–6299
Farag W, Saleh Z (2018) Behavior cloning for autonomous driving using convolutional neural networks. In: 2018 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), pp 17
Anschel O, Baram N, Shimkin N (2017) Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In: International Conference on Machine Learning (PMLR), pp 176–185
Yang S, Wang J, Deng B et al (2021) Neuromorphic context-dependent learning framework with fault-tolerant spike routing[J]. In: IEEE Transactions on Neural Networks and Learning Systems
Acknowledgements
This work is partially supported by National Natural Science Foundation of China (61873008), Beijing Natural Science Foundation (4192010) and National Key R & D Plan (2018YFB1307004).
Funding
This work is partially supported by National Natural Science Foundation of China (61873008), Beijing Natural Science Foundation (4192010) and National Key R & D Plan (2018YFB1307004).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interests
All authors of this paper declare no conflict of interest in this paper and agree to submit this manuscript to the journal of Applied Intelligence.
Additional information
Author contributions
All authors contributed to the study conception and design. Material preparation, experimental design and data analysis were performed by Guoyu Zuo and Shuai Huang. Manuscript writing and organization were done by Jiangeng Li. Review and commentary were done by Daoxiong Gong. All authors read and approved this manuscript.
Availability of data and materials
The raw/processed data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.
Ethical Approval
The authors declare that they have no conflict of interest. This paper has not been previously published, it is published with the permission of the authors’ institution, and all authors of this paper are responsible for the authenticity of the data in the paper.
Consent to Participate
All authors of this paper have been informed of the revision and publication of the paper, have checked all data, figures and tables in the manuscript, and are responsible for their truthfulness and accuracy. Names of all contributing authors: Guoyu Zuo; Shuai Huang; Jiangeng Li; Daoxiong Gong.
Consent to Publish
The publication has been approved by all co-authors.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zuo, G., Huang, S., Li, J. et al. Offline reinforcement learning with anderson acceleration for robotic tasks. Appl Intell 52, 9885–9898 (2022). https://doi.org/10.1007/s10489-021-02953-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02953-8