Nearly Optimal Exploration-Exploitation Decision Thresholds

Christos Dimitrakakis²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4131))

Included in the following conference series:

International Conference on Artificial Neural Networks

3387 Accesses
6 Citations

Abstract

While in general trading off exploration and exploitation in reinforcement learning is hard, under some formulations relatively simple solutions exist. Optimal decision thresholds for the multi-armed bandit problem, one for the infinite horizon discounted reward case and one for the finite horizon undiscounted reward case are derived, which make the link between the reward horizon, uncertainty and the need for exploration explicit. From this result follow two practical approximate algorithms, which are illustrated experimentally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Pack Light on the Move: Exploitation and Exploration in a Dynamic Environment

A Survey of Preference-Based Online Learning with Bandit Algorithms

Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case

References

Wald, A.: Sequential Analysis. John Wiley & Sons, Chichester (1947) (Republished by Dover in 2004)
MATH Google Scholar
DeGroot, M.H.: Optimal Statistical Decisions. John Wiley & Sons, Chichester (1970) (Republished in 2004)
MATH Google Scholar
Bellman, R.E.: A problem in the sequential design of experiments. Sankhya 16, 221–229 (1957)
Google Scholar
Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5, 623–648 (2004)
MathSciNet Google Scholar
Dearden, R., Friedman, N., Russell, S.J.: Bayesian Q-learning. In: AAAI/IAAI, pp. 761–768 (1998)
Google Scholar
Even-Dar, E., Mannor, S., Mansour, Y.: Action elimination and stopping conditions for the multi-armed and reinforcement learning problems. Journal of Machine Learning Research (2006) (to appear)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Google Scholar
Corless, R.M., Gonnet, G.H., Hare, D.E.G., Jeffrey, D.J., Knuth, D.E.: On the lambert W function. Advances in Computational Mathematics 5, 329–359 (1996)
Article MATH MathSciNet Google Scholar
Sutton, R.S., Precup, D., Singh, S.P.: Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2), 181–211 (1999)
Article MATH MathSciNet Google Scholar
Madani, O., Lizotte, D.J., Greiner, R.: The budgeted multi-armed bandit problem. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp. 643–645. Springer, Heidelberg (2004)
Chapter Google Scholar
Madani, O., Lizotte, D.J., Greiner, R.: Active model selection. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, Canada, pp. 357–365. AUAI Press, Arlington (2004)
Google Scholar
Auer, P.: Models for trading exploration and exploitation using upper confidence bounds. In: PASCAL workshop on principled methods of trading exploration and exploitation, PASCAL Network (2005)
Google Scholar
Bernardo, J.M.: Expected information as expected utility. In: The Annals of Statistics, Institute of Mathematical Statistics, vol. 7, pp. 686–690 (1979)
Google Scholar
Dimitrakakis, C., Bengio, S.: Gradient estimates of return. IDIAP-RR 05-29, IDIAP (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

IDIAP Research Institute, 4 Rue de Simplon, Martigny, CH 1920, Switzerland
Christos Dimitrakakis

Authors

Christos Dimitrakakis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Electrical and Computer Engineering, Image, Video and Multimedia Systems Laboratory, National Technical University of Athens, GR-157 80, Zographou, Greece
Stefanos D. Kollias
Department of Electrical and Computer Engineering, National Technical University of Athens, 15780, Zographou, Greece
Andreas Stafylopatis
Department of Informatics, Nicolaus Copernicus University, Toruń, Poland
Włodzisław Duch
Adaptive Informatics Research Centre, Helsinki University of Technology, HUT, P.O. Box 5400, 02015, Finland
Erkki Oja

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dimitrakakis, C. (2006). Nearly Optimal Exploration-Exploitation Decision Thresholds. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds) Artificial Neural Networks – ICANN 2006. ICANN 2006. Lecture Notes in Computer Science, vol 4131. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11840817_88

Download citation

DOI: https://doi.org/10.1007/11840817_88
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-38625-4
Online ISBN: 978-3-540-38627-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Nearly Optimal Exploration-Exploitation Decision Thresholds

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Pack Light on the Move: Exploitation and Exploration in a Dynamic Environment

A Survey of Preference-Based Online Learning with Bandit Algorithms

Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Nearly Optimal Exploration-Exploitation Decision Thresholds

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Pack Light on the Move: Exploitation and Exploration in a Dynamic Environment

A Survey of Preference-Based Online Learning with Bandit Algorithms

Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation