[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/2968618.2968820guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article

A convergent form of approximate policy iteration

Published: 01 January 2002 Publication History

Abstract

We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a "policy improvement operator" to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces e-soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

References

[1]
L. C. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30-37. Morgan Kaufmann, 1995.
[2]
A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1):81-138, 1995.
[3]
D. P. Bertsekas. Dynamic Programming and Optimal Control, Volumes 1 and 2. Athena Scientific, 2001.
[4]
D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
[5]
D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Opt. Theory and Applications, 105(3), 2000.
[6]
G. Gordon. Chattering in Sarsa(λ). CMU Learning Lab Internal Report. Available at www.cs.cmu.edu/~ggordon, 1996.
[7]
G. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University, 1999.
[8]
G. J. Gordon. Reinforcement learning with function approximation converges to a region. Advances in Neural Information Processing Systems 13, pages 1040-1046. MIT Press, 2001.
[9]
C. D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, 2000.
[10]
T. J. Perkins and M. D. Pendrith. On the existence of fixed points for Q-learning and Sarsa in partially observable domains. In Proceedings of the Nineteenth International Conference on Machine Learning, 2002.
[11]
M. L. Puterman. Markov Decision Processes: Disrete Stochastic Dynamic Programming. John Wiley & Sons, Inc, New York, 1994.
[12]
E. Seneta. Sensitivity analysis, ergodicity coefficients, and rank-one updates for finite markov chains. In W. J. Stewart, editor, Numerical Solutions of Markov Chains. Dekker, NY, 1991.
[13]
S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvari. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287-308, 2000.
[14]
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press/Bradford Books, Cambridge, Massachusetts, 1998.
[15]
G.J. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215-219, 1994.
[16]
J. N. Tsitsiklis and B. Van Roy. Optimal stopping of markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions on Automatic Control, 44(10):1840-1851, 1999.
[17]
J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674-690, 1997.

Cited By

View all
  • (2017)An alternative softmax operator for reinforcement learningProceedings of the 34th International Conference on Machine Learning - Volume 7010.5555/3305381.3305407(243-252)Online publication date: 6-Aug-2017
  • (2016)PAC reinforcement learning with rich observationsProceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157096.3157303(1848-1856)Online publication date: 5-Dec-2016
  • (2013)Optimistic policy iteration and natural actor-criticProceedings of the 27th International Conference on Neural Information Processing Systems - Volume 110.5555/2999611.2999789(1592-1600)Online publication date: 5-Dec-2013

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS'02: Proceedings of the 16th International Conference on Neural Information Processing Systems
January 2002
1674 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 01 January 2002

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2017)An alternative softmax operator for reinforcement learningProceedings of the 34th International Conference on Machine Learning - Volume 7010.5555/3305381.3305407(243-252)Online publication date: 6-Aug-2017
  • (2016)PAC reinforcement learning with rich observationsProceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157096.3157303(1848-1856)Online publication date: 5-Dec-2016
  • (2013)Optimistic policy iteration and natural actor-criticProceedings of the 27th International Conference on Neural Information Processing Systems - Volume 110.5555/2999611.2999789(1592-1600)Online publication date: 5-Dec-2013

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media