More Web Proxy on the site http://driver.im/

Article

Free access

Non-delusional Q-learning and value iteration

Authors:

Dale Schuurmans,

Craig BoutilierAuthors Info & Claims

NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

Pages 9971 - 9981

Published: 03 December 2018 Publication History

PDF eReader Publisher Site

Abstract

We identify a fundamental source of error in Q-learning and other forms of dynamic programming with function approximation. Delusional bias arises when the approximation architecture limits the class of expressible greedy policies. Since standard Q-updates make globally uncoordinated action choices with respect to the expressible policy class, inconsistent or even conflicting Q-value estimates can result, leading to pathological behaviour such as over/under-estimation, instability and even divergence. To solve this problem, we introduce a new notion of policy consistency and define a local backup process that ensures global consistency through the use of information sets—sets that record constraints on policies consistent with backed-up Q-values. We prove that both the model-based and model-free algorithms using this backup remove delusional bias, yielding the first known algorithms that guarantee optimal results under general conditions. These algorithms furthermore only require polynomially many information sets (from a potentially exponential support). Finally, we suggest other practical heuristics for value-iteration and Q-learning that attempt to reduce delusional bias.

References

[1]

Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the International Conference on Machine Learning (ICML-95), pages 30-37, 1995.

Digital Library

[2]

Peter L. Bartlett, Nick Harvey, Chris Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. arXiv:1703.02930, 2017.

[3]

M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253-279, June 2013.

[4]

Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML-17), 2017.

Digital Library

[5]

Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, and Rémi Munos. Increasing the action gap: New operators for reinforcement learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), pages 1476-1483, Québec City, QC, 2016.

Digital Library

[6]

Dimitri P. Bertsekas and John. N. Tsitsiklis. Neuro-dynamic Programming. Athena, Belmont, MA, 1996.

Digital Library

[7]

Avrim Blum and Ronald L. Rivest. Training a 3-node neural network is NP-complete. In COLT, pages 9-18, 1988.

Digital Library

[8]

Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7 (NIPS-94). MIT Press, Cambridge, 1995.

Digital Library

[9]

Daniela Pucci de Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850-865, 2003.

Digital Library

[10]

Rick Durrett. Probability: Theory and Examples. Cambridge University Press, 2013.

Digital Library

[11]

Matthieu Geist, Bilal Piot, and Olivier Pietquin. Is the bellman residual a bad proxy? In Advances in Neural Information Processing Systems 30 (NIPS-17), pages 3208-3217, Long Beach, CA, 2017.

Digital Library

[12]

Geoffrey J. Gordon. Stable function approximation in dynamic programming. In Proceedings of the Twelfth International Conference on Machine Learning (ICML-95), pages 261-268, Lake Tahoe, 1995.

Digital Library

[13]

Geoffrey J. Gordon. Approximation Solutions to Markov Decision Problems. PhD thesis, Carnegie Mellon University, 1999.

Digital Library

[14]

Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dab-ney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv:1710.02298, 2017.

[15]

Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the Thirteenth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-14), pages 1181-1189, Istanbul, Turkey, 2015.

Digital Library

[16]

Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of long horizon problems. In Proceedings of the Thirty-second AAAI Conference on Artificial Intelligence (AAAI-18), 2018.

[17]

Hamid Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S. Sutton. Toward off-policy learning control wtih function approximation. In International Conference on Machine Learning, pages 719-726, Haifa, Israel, 2010.

Digital Library

[18]

Francisco Melo and M. Isabel Ribeiro. Q-learning with linear function approximation. In Proceedings of the International Conference on Computational Learning Theory (COLT), pages 308-322, 2007.

Digital Library

[19]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei Rusu, Joel Veness, Marc Bellemare, Alex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Science, 518:529-533, 2015.

[20]

Rémi Munos. Performance bounds in lp-norm for approximate value iteration. SIAM Journal on Control and Optimization, 46(2):541-561, 2007.

Digital Library

[21]

Rémi Munos, Thomas Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems 29 (NIPS-16), pages 1046-1054, Barcelona, 2016.

[22]

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems 30 (NIPS-17), pages 1476-1483, Long Beach, CA, 2017.

[23]

Marek Petrik. Optimization-based Approximate Dynamic Programming. PhD thesis, University of Massachusetts, 2010.

Digital Library

[24]

Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13:145-147, July 1972.

[25]

Saharon Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41:247-261, April 1972.

[26]

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2018.

Digital Library

[27]

Csaba Szepesvári and William Smart. Interpolation-based Q-learning. In Proceedings of the International Conference on Machine Learning (ICML-04), 2004.

Digital Library

[28]

Gerald Tesauro. Practical issues in temporal difference learning. Machine Learning, 8(3):257-277, May 1992.

Digital Library

[29]

John H. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674-690, 1996.

[30]

Hado van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems 23 (NIPS-10), pages 2613-2621, Vancouver, BC, 2010.

Digital Library

[31]

Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.

Digital Library

[32]

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML-16), 2016.

Digital Library

[33]

Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, UK, May 1989.

[34]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8:279-292, 1992.

Digital Library

Non-delusional Q-learning and value iteration
1. Computing methodologies

Recommendations

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of ...
Focused topological value iteration
ICAPS'09: Proceedings of the Nineteenth International Conference on International Conference on Automated Planning and Scheduling

Topological value iteration (TVI) is an effective algorithm for solving Markov decision processes (MDPs) optimally, which divides an MDP into strongly-connected components, and solves these components sequentially. Yet, TVI's usefulness tends to degrade ...
Approximate Q-Learning: An Introduction
ICMLC '10: Proceedings of the 2010 Second International Conference on Machine Learning and Computing

This paper introduces an approach to Q-learning algorithm with rough set theory introduced by Zdzislaw Pawlak in 1981. During Q-learning, an agent makes action selections in an effort to maximize a reward signal obtained from the environment. Based on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

December 2018

11021 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 December 2018

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
114
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)11

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten