Active Learning of MDP Models

Mauricio Araya-López²¹,
Olivier Buffet²¹,
Vincent Thomas²¹ &
…
François Charpillet²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7188))

Included in the following conference series:

European Workshop on Reinforcement Learning

2411 Accesses

Abstract

We consider the active learning problem of inferring the transition model of a Markov Decision Process by acting and observing transitions. This is particularly useful when no reward function is a priori defined. Our proposal is to cast the active learning task as a utility maximization problem using Bayesian reinforcement learning with belief-dependent rewards. After presenting three possible performance criteria, we derive from them the belief-dependent rewards to be used in the decision-making process. As computing the optimal Bayesian value function is intractable for large horizons, we use a simple algorithm to approximately solve this optimization problem. Despite the sub-optimality of this technique, we show experimentally that our proposal is efficient in a number of domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Partially-Observable Markov Decision Process for Dealing with Dynamically Changing Environments

A Non-stationary Infinite Partially-Observable Markov Decision Process

Online Learning in Markov Decision Processes with Continuous Actions

References

Araya-López, M., Buffet, O., Thomas, V., Charpillet, F.: A POMDP extension with belief-dependent rewards. In: Advances in Neural Information Processing Systems 23 (NIPS 2010) (2010)
Google Scholar
Asmuth, J., Li, L., Littman, M., Nouri, A., Wingate, D.: A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI 2009) (2009)
Google Scholar
Bellman, R.: The theory of dynamic programming. Bull. Amer. Math. Soc. 60, 503–516 (1954)
Article MathSciNet MATH Google Scholar
Brafman, R., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3, 213–231 (2003)
MathSciNet MATH Google Scholar
Şimşek, O., Barto, A.G.: An intrinsic reward mechanism for efficient exploration. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 833–840. ACM, New York (2006)
Google Scholar
Dimitrakakis, C.: Tree exploration for Bayesian RL exploration. In: CIMCA/IAWTIC/ISE, pp. 1029–1034 (2008)
Google Scholar
Duff, M.: Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. Ph.D. thesis, University of Massachusetts Amherst (2002)
Google Scholar
Gittins, J.C.: Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society 41(2), 148–177 (1979)
MathSciNet MATH Google Scholar
Jonsson, A., Barto, A.G.: Active Learning of Dynamic Bayesian Networks in Markov Decision Processes. In: Miguel, I., Ruml, W. (eds.) SARA 2007. LNCS (LNAI), vol. 4612, pp. 273–284. Springer, Heidelberg (2007)
Chapter Google Scholar
Kolter, J., Ng, A.: Near-Bayesian exploration in polynomial time. In: Proceedings of the Twenty-Sixth International Conference on Machine Learning, ICML 2009 (2009)
Google Scholar
Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 278–287. Morgan Kaufmann (1999)
Google Scholar
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning (ICML 2006) (2006)
Google Scholar
Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience (April 1994)
Google Scholar
Rauber, T., Braun, T., Berns, K.: Probabilistic distance measures of the Dirichlet and Beta distributions. Pattern Recognition 41(2), 637–645 (2008)
Article MATH Google Scholar
Roy, N., Thrun, S.: Coastal navigation with mobile robots. In: Advances in Neural Information Processing Systems 12, pp. 1043–1049 (1999)
Google Scholar
Sorg, J., Singh, S., Lewis, R.: Variance-based rewards for approximate Bayesian reinforcement learning. In: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (2010)
Google Scholar
Strens, M.J.A.: A Bayesian framework for reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML 2000), pp. 943–950 (2000)
Google Scholar
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press (1998)
Google Scholar
Szepesvári, C.: Reinforcement learning algorithms for MDPs – a survey. Tech. Rep. TR09-13, Department of Computing Science, University of Alberta (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Nancy Université / INRIA LORIA, Campus Scientifique, BP 239, 54506, Vandoeuvre-lès-Nancy Cedex, France
Mauricio Araya-López, Olivier Buffet, Vincent Thomas & François Charpillet

Authors

Mauricio Araya-López
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Buffet
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Thomas
View author publications
You can also search for this author in PubMed Google Scholar
François Charpillet
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

NICTA and the Australian National University, 7 London Circuit, ACT 2601, Canberra, Australia
Scott Sanner
Research School of Computer Science, Australian National University, ACT 0200, Canberra, Australia
Marcus Hutter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Araya-López, M., Buffet, O., Thomas, V., Charpillet, F. (2012). Active Learning of MDP Models. In: Sanner, S., Hutter, M. (eds) Recent Advances in Reinforcement Learning. EWRL 2011. Lecture Notes in Computer Science(), vol 7188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29946-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-29946-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29945-2
Online ISBN: 978-3-642-29946-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Active Learning of MDP Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Partially-Observable Markov Decision Process for Dealing with Dynamically Changing Environments

A Non-stationary Infinite Partially-Observable Markov Decision Process

Online Learning in Markov Decision Processes with Continuous Actions

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Active Learning of MDP Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Partially-Observable Markov Decision Process for Dealing with Dynamically Changing Environments

A Non-stationary Infinite Partially-Observable Markov Decision Process

Online Learning in Markov Decision Processes with Continuous Actions

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation