[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3292500.3330892acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Separated Trust Regions Policy Optimization Method

Published: 25 July 2019 Publication History

Abstract

In this work, we propose a moderate policy update method for reinforcement learning, which encourages the agent to explore more boldly in early episodes but updates the policy more cautious. Based on the maximum entropy framework, we propose a softer objective with more conservative constraints and build the separated trust regions for optimization. To reduce the variance of expected entropy return, a calculated state policy entropy of Gaussian distribution is preferred instead of collecting log probability by sampling. This new method, which we call separated trust region for policy mean and variance (STRMV), can be view as an extension to proximal policy optimization (PPO) but it is gentler for policy update and more lively for exploration. We test our approach on a wide variety of continuous control benchmark tasks in the MuJoCo environment. The experiments demonstrate that STRMV outperforms the previous state of art on-policy methods, not only achieving higher rewards but also improving the sample efficiency.

Supplementary Material

MP4 File (rt1227o.mp4)
Supplemental video

References

[1]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
[2]
Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, YuhuaiWu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
[3]
Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562, 2015.
[4]
Audrunas Gruslys, Mohammad Gheshlaghi Azar, Marc G Bellemare, and Remi Munos. The reactor: A sample-efficient actor-critic architecture. arXiv preprint arXiv:1704.04651, 2017.
[5]
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.
[6]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
[7]
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267--274, 2002.
[8]
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334--1373, 2016.
[9]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[10]
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928--1937, 2016.
[11]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[12]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
[13]
Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2775--2785, 2017.
[14]
Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trustpcl: An off-policy trust region method for continuous control. arXiv preprint arXiv:1707.01891, 2017.
[15]
Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7--9):1180-- 1190, 2008.
[16]
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
[17]
Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723--1738, 2002.
[18]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889--1897, 2015.
[19]
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
[20]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[21]
D Silver. D. silver, a. huang, cj maddison, a. guez, l. sifre, g. van den driessche, j. schrittwieser, i. antonoglou, v. panneershelvam, m. lanctot, s. dieleman, d. grewe, j. nham, n. kalchbrenner, i. sutskever, t. lillicrap, m. leach, k. kavukcuoglu, t. graepel, and d. hassabis, nature (london) 529, 484 (2016). Nature (London), 529:484, 2016.
[22]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
[23]
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
[24]
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
[25]
Marc Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pages 1049--1056. ACM, 2009.
[26]
Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.
[27]
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
[28]
YuhuaiWu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5279--5288, 2017.
[29]
Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010.
[30]
Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433--1438. Chicago, IL, USA, 2008.

Cited By

View all
  • (2022)Research on Knowledge Graph Completion Model Combining Temporal Convolutional Network and Monte Carlo Tree SearchMathematical Problems in Engineering10.1155/2022/22905402022(1-13)Online publication date: 16-Mar-2022
  • (2020)Soft Policy Optimization using Dual-Track Advantage Estimator2020 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM50108.2020.00126(1064-1069)Online publication date: Nov-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2019
3305 pages
ISBN:9781450362016
DOI:10.1145/3292500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. entropy maximization
  2. reinforcement learning
  3. trust region

Qualifiers

  • Research-article

Funding Sources

  • the National Key R&D Program of China
  • National Natural Science Foundation of China

Conference

KDD '19
Sponsor:

Acceptance Rates

KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Research on Knowledge Graph Completion Model Combining Temporal Convolutional Network and Monte Carlo Tree SearchMathematical Problems in Engineering10.1155/2022/22905402022(1-13)Online publication date: 16-Mar-2022
  • (2020)Soft Policy Optimization using Dual-Track Advantage Estimator2020 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM50108.2020.00126(1064-1069)Online publication date: Nov-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media