More Web Proxy on the site http://driver.im/

Article

A convergent form of approximate policy iteration

Authors:

Theodore J. Perkins,

Doina PrecupAuthors Info & Claims

NIPS'02: Proceedings of the 16th International Conference on Neural Information Processing Systems

Pages 1627 - 1634

Published: 01 January 2002 Publication History

Abstract

We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a "policy improvement operator" to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces e-soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

References

[1]

L. C. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30-37. Morgan Kaufmann, 1995.

Digital Library

[2]

A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1):81-138, 1995.

Digital Library

[3]

D. P. Bertsekas. Dynamic Programming and Optimal Control, Volumes 1 and 2. Athena Scientific, 2001.

Digital Library

[4]

D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.

Digital Library

[5]

D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Opt. Theory and Applications, 105(3), 2000.

Digital Library

[6]

G. Gordon. Chattering in Sarsa(λ). CMU Learning Lab Internal Report. Available at www.cs.cmu.edu/~ggordon, 1996.

[7]

G. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University, 1999.

Digital Library

[8]

G. J. Gordon. Reinforcement learning with function approximation converges to a region. Advances in Neural Information Processing Systems 13, pages 1040-1046. MIT Press, 2001.

[9]

C. D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, 2000.

Digital Library

[10]

T. J. Perkins and M. D. Pendrith. On the existence of fixed points for Q-learning and Sarsa in partially observable domains. In Proceedings of the Nineteenth International Conference on Machine Learning, 2002.

Digital Library

[11]

M. L. Puterman. Markov Decision Processes: Disrete Stochastic Dynamic Programming. John Wiley & Sons, Inc, New York, 1994.

Digital Library

[12]

E. Seneta. Sensitivity analysis, ergodicity coefficients, and rank-one updates for finite markov chains. In W. J. Stewart, editor, Numerical Solutions of Markov Chains. Dekker, NY, 1991.

[13]

S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvari. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287-308, 2000.

Digital Library

[14]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press/Bradford Books, Cambridge, Massachusetts, 1998.

Digital Library

[15]

G.J. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215-219, 1994.

Digital Library

[16]

J. N. Tsitsiklis and B. Van Roy. Optimal stopping of markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions on Automatic Control, 44(10):1840-1851, 1999.

[17]

J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674-690, 1997.

Cited By

Asadi KLittman M(2017)An alternative softmax operator for reinforcement learningProceedings of the 34th International Conference on Machine Learning - Volume 7010.5555/3305381.3305407(243-252)Online publication date: 6-Aug-2017
https://dl.acm.org/doi/10.5555/3305381.3305407
Krishnamurthy AAgarwal ALangford J(2016)PAC reinforcement learning with rich observationsProceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157096.3157303(1848-1856)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.5555/3157096.3157303
Wagner P(2013)Optimistic policy iteration and natural actor-criticProceedings of the 27th International Conference on Neural Information Processing Systems - Volume 110.5555/2999611.2999789(1592-1600)Online publication date: 5-Dec-2013
https://dl.acm.org/doi/10.5555/2999611.2999789

Index Terms

A convergent form of approximate policy iteration

Index terms have been assigned to the content through auto-classification.

Recommendations

Superlinearly convergent approximate Newton methods for LC1 optimization problems

In the literature, the proof of superlinear convergence of approximate Newton or SQP methods for solving nonlinear programming problems requires twice smoothness of the objective and constraint functions. Sometimes, the second-order derivatives of those ...
Error bounds for approximate policy iteration
ICML'03: Proceedings of the Twentieth International Conference on International Conference on Machine Learning

In Dynamic Programming, convergence of algorithms such as Value Iteration or Policy Iteration results -in discounted problems- from a contraction property of the back-up operator, guaranteeing convergence to its fixed-point. When approximation is ...
Convergence Properties of Policy Iteration

This paper analyzes asymptotic convergence properties of policy iteration in a class of stationary, infinite-horizon Markovian decision problems that arise in optimal growth theory. These problems have continuous state and control variables and must ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'02: Proceedings of the 16th International Conference on Neural Information Processing Systems

January 2002

1674 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 01 January 2002

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Asadi KLittman M(2017)An alternative softmax operator for reinforcement learningProceedings of the 34th International Conference on Machine Learning - Volume 7010.5555/3305381.3305407(243-252)Online publication date: 6-Aug-2017
https://dl.acm.org/doi/10.5555/3305381.3305407
Krishnamurthy AAgarwal ALangford J(2016)PAC reinforcement learning with rich observationsProceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157096.3157303(1848-1856)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.5555/3157096.3157303
Wagner P(2013)Optimistic policy iteration and natural actor-criticProceedings of the 27th International Conference on Neural Information Processing Systems - Volume 110.5555/2999611.2999789(1592-1600)Online publication date: 5-Dec-2013
https://dl.acm.org/doi/10.5555/2999611.2999789

View Options

View options

Figures

Tables

Media

View Table of Conten