[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Nearly Optimal Exploration-Exploitation Decision Thresholds

  • Conference paper
Artificial Neural Networks – ICANN 2006 (ICANN 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4131))

Included in the following conference series:

Abstract

While in general trading off exploration and exploitation in reinforcement learning is hard, under some formulations relatively simple solutions exist. Optimal decision thresholds for the multi-armed bandit problem, one for the infinite horizon discounted reward case and one for the finite horizon undiscounted reward case are derived, which make the link between the reward horizon, uncertainty and the need for exploration explicit. From this result follow two practical approximate algorithms, which are illustrated experimentally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 71.50
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Wald, A.: Sequential Analysis. John Wiley & Sons, Chichester (1947) (Republished by Dover in 2004)

    MATH  Google Scholar 

  2. DeGroot, M.H.: Optimal Statistical Decisions. John Wiley & Sons, Chichester (1970) (Republished in 2004)

    MATH  Google Scholar 

  3. Bellman, R.E.: A problem in the sequential design of experiments. Sankhya 16, 221–229 (1957)

    Google Scholar 

  4. Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5, 623–648 (2004)

    MathSciNet  Google Scholar 

  5. Dearden, R., Friedman, N., Russell, S.J.: Bayesian Q-learning. In: AAAI/IAAI, pp. 761–768 (1998)

    Google Scholar 

  6. Even-Dar, E., Mannor, S., Mansour, Y.: Action elimination and stopping conditions for the multi-armed and reinforcement learning problems. Journal of Machine Learning Research (2006) (to appear)

    Google Scholar 

  7. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

    Google Scholar 

  8. Corless, R.M., Gonnet, G.H., Hare, D.E.G., Jeffrey, D.J., Knuth, D.E.: On the lambert W function. Advances in Computational Mathematics 5, 329–359 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  9. Sutton, R.S., Precup, D., Singh, S.P.: Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2), 181–211 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  10. Madani, O., Lizotte, D.J., Greiner, R.: The budgeted multi-armed bandit problem. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI), vol. 3120, pp. 643–645. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  11. Madani, O., Lizotte, D.J., Greiner, R.: Active model selection. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, Canada, pp. 357–365. AUAI Press, Arlington (2004)

    Google Scholar 

  12. Auer, P.: Models for trading exploration and exploitation using upper confidence bounds. In: PASCAL workshop on principled methods of trading exploration and exploitation, PASCAL Network (2005)

    Google Scholar 

  13. Bernardo, J.M.: Expected information as expected utility. In: The Annals of Statistics, Institute of Mathematical Statistics, vol. 7, pp. 686–690 (1979)

    Google Scholar 

  14. Dimitrakakis, C., Bengio, S.: Gradient estimates of return. IDIAP-RR 05-29, IDIAP (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dimitrakakis, C. (2006). Nearly Optimal Exploration-Exploitation Decision Thresholds. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds) Artificial Neural Networks – ICANN 2006. ICANN 2006. Lecture Notes in Computer Science, vol 4131. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11840817_88

Download citation

  • DOI: https://doi.org/10.1007/11840817_88

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-38625-4

  • Online ISBN: 978-3-540-38627-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics