[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3327546.3327661guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article
Free access

Non-delusional Q-learning and value iteration

Published: 03 December 2018 Publication History

Abstract

We identify a fundamental source of error in Q-learning and other forms of dynamic programming with function approximation. Delusional bias arises when the approximation architecture limits the class of expressible greedy policies. Since standard Q-updates make globally uncoordinated action choices with respect to the expressible policy class, inconsistent or even conflicting Q-value estimates can result, leading to pathological behaviour such as over/under-estimation, instability and even divergence. To solve this problem, we introduce a new notion of policy consistency and define a local backup process that ensures global consistency through the use of information sets—sets that record constraints on policies consistent with backed-up Q-values. We prove that both the model-based and model-free algorithms using this backup remove delusional bias, yielding the first known algorithms that guarantee optimal results under general conditions. These algorithms furthermore only require polynomially many information sets (from a potentially exponential support). Finally, we suggest other practical heuristics for value-iteration and Q-learning that attempt to reduce delusional bias.

References

[1]
Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the International Conference on Machine Learning (ICML-95), pages 30-37, 1995.
[2]
Peter L. Bartlett, Nick Harvey, Chris Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. arXiv:1703.02930, 2017.
[3]
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253-279, June 2013.
[4]
Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML-17), 2017.
[5]
Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, and Rémi Munos. Increasing the action gap: New operators for reinforcement learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), pages 1476-1483, Québec City, QC, 2016.
[6]
Dimitri P. Bertsekas and John. N. Tsitsiklis. Neuro-dynamic Programming. Athena, Belmont, MA, 1996.
[7]
Avrim Blum and Ronald L. Rivest. Training a 3-node neural network is NP-complete. In COLT, pages 9-18, 1988.
[8]
Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7 (NIPS-94). MIT Press, Cambridge, 1995.
[9]
Daniela Pucci de Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850-865, 2003.
[10]
Rick Durrett. Probability: Theory and Examples. Cambridge University Press, 2013.
[11]
Matthieu Geist, Bilal Piot, and Olivier Pietquin. Is the bellman residual a bad proxy? In Advances in Neural Information Processing Systems 30 (NIPS-17), pages 3208-3217, Long Beach, CA, 2017.
[12]
Geoffrey J. Gordon. Stable function approximation in dynamic programming. In Proceedings of the Twelfth International Conference on Machine Learning (ICML-95), pages 261-268, Lake Tahoe, 1995.
[13]
Geoffrey J. Gordon. Approximation Solutions to Markov Decision Problems. PhD thesis, Carnegie Mellon University, 1999.
[14]
Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dab-ney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv:1710.02298, 2017.
[15]
Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the Thirteenth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-14), pages 1181-1189, Istanbul, Turkey, 2015.
[16]
Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of long horizon problems. In Proceedings of the Thirty-second AAAI Conference on Artificial Intelligence (AAAI-18), 2018.
[17]
Hamid Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S. Sutton. Toward off-policy learning control wtih function approximation. In International Conference on Machine Learning, pages 719-726, Haifa, Israel, 2010.
[18]
Francisco Melo and M. Isabel Ribeiro. Q-learning with linear function approximation. In Proceedings of the International Conference on Computational Learning Theory (COLT), pages 308-322, 2007.
[19]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei Rusu, Joel Veness, Marc Bellemare, Alex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Science, 518:529-533, 2015.
[20]
Rémi Munos. Performance bounds in lp-norm for approximate value iteration. SIAM Journal on Control and Optimization, 46(2):541-561, 2007.
[21]
Rémi Munos, Thomas Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems 29 (NIPS-16), pages 1046-1054, Barcelona, 2016.
[22]
Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems 30 (NIPS-17), pages 1476-1483, Long Beach, CA, 2017.
[23]
Marek Petrik. Optimization-based Approximate Dynamic Programming. PhD thesis, University of Massachusetts, 2010.
[24]
Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13:145-147, July 1972.
[25]
Saharon Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41:247-261, April 1972.
[26]
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2018.
[27]
Csaba Szepesvári and William Smart. Interpolation-based Q-learning. In Proceedings of the International Conference on Machine Learning (ICML-04), 2004.
[28]
Gerald Tesauro. Practical issues in temporal difference learning. Machine Learning, 8(3):257-277, May 1992.
[29]
John H. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674-690, 1996.
[30]
Hado van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems 23 (NIPS-10), pages 2613-2621, Vancouver, BC, 2010.
[31]
Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.
[32]
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML-16), 2016.
[33]
Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, UK, May 1989.
[34]
Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8:279-292, 1992.
  1. Non-delusional Q-learning and value iteration

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems
    December 2018
    11021 pages

    Publisher

    Curran Associates Inc.

    Red Hook, NY, United States

    Publication History

    Published: 03 December 2018

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 114
      Total Downloads
    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 28 Jan 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media