Abstract
Consider an agent interacting with an environment in cycles. In every interaction cycle the agent is rewarded for its performance. We compare the average reward U from cycle 1 to m (average value) with the future discounted reward V from cycle k to ∞ (discounted value). We consider essentially arbitrary (non-geometric) discount sequences and arbitrary reward sequences (non-MDP environments). We show that asymptotically U for m→∞ and V for k→∞ are equal, provided both limits exist. Further, if the effective horizon grows linearly with k or faster, then the existence of the limit of U implies that the limit of V exists. Conversely, if the effective horizon grows linearly with k or slower, then existence of the limit of V implies that the limit of U exists.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Avrachenkov, K.E., Altman, E.: Sensitive discount optimality via nested linear programs for ergodic Markov decision processes. In: Proceedings of Information Decision and Control 1999, Adelaide, Australia, pp. 53–58. IEEE, Los Alamitos (1999)
Berry, D.A., Fristedt, B.: Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London (1985)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. In: Athena Scientific, Belmont, MA (1996)
Frederick, S., Loewenstein, G., O’Donoghue, T.: Time discounting and time preference: A critical review. Journal of Economic Literature 40, 351–401 (2002)
Hutter, M.: Self-optimizing and pareto-optimal policies in general environments based on bayes-mixtures. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 364–379. Springer, Heidelberg (2002)
Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability, p. 300. Springer, Berlin (2005), http://www.idsia.ch/~marcus/ai/uaibook.htm
Kakade, S.: Optimizing average reward using discounted rewards. In: Helmbold, D.P., Williamson, B. (eds.) COLT 2001 and EuroCOLT 2001. LNCS (LNAI), vol. 2111, pp. 605–615. Springer, Heidelberg (2001)
Kelly, F.P.: Multi-armed bandits with discount factor near one: The Bernoulli case. Annals of Statistics 9, 987–1001 (1981)
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)
Kumar, P.R., Varaiya, P.P.: Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall, Englewood Cliffs, NJ (1986)
Mahadevan, S.: Sensitive discount optimality: Unifying discounted and average reward reinforcement learning. In: Proc. 13th International Conference on Machine Learning, pp. 328–336. Morgan Kaufmann, San Francisco (1996)
Russell, S.J., Norvig, P.: Artificial Intelligence. A Modern Approach, 2nd edn. Prentice-Hall, Englewood Cliffs, NJ (2003)
Samuelson, P.: A note on measurement of utility. Review of Economic Studies 4, 155–161 (1937)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998)
Strotz, R.H.: Myopia and inconsistency in dynamic utility maximization. Review of Economic Studies 23, 165–180 (1955–1956)
Vieille, N., Weibull, J.W.: Dynamic optimization with non-exponential discounting: On the uniqueness of solutions. Technical Report WP No. 577, Department of Economics, Boston Univeristy, Boston, MA (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hutter, M. (2006). General Discounting Versus Average Reward. In: Balcázar, J.L., Long, P.M., Stephan, F. (eds) Algorithmic Learning Theory. ALT 2006. Lecture Notes in Computer Science(), vol 4264. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11894841_21
Download citation
DOI: https://doi.org/10.1007/11894841_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46649-9
Online ISBN: 978-3-540-46650-5
eBook Packages: Computer ScienceComputer Science (R0)