More Web Proxy on the site http://driver.im/

research-article

Underestimation estimators to Q-learning

Authors:

Patigül Abliz,

Shi YingAuthors Info & Claims

Volume 607, Issue C

Pages 173 - 185

https://doi.org/10.1016/j.ins.2022.05.090

Published: 01 August 2022 Publication History

Abstract

Q-learning (QL) is a popular method for control problems, which approximates the maximum expected action value using the maximum estimated action value, thus it suffers from positive overestimation bias. Various algorithms have been proposed to reduce overestimation bias. But some of these methods cause underestimation bias. Furtherly, about underestimation, which kind of estimators causes underestimation is less understood. In this paper, instead of a certain specific method, we focus on underestimation estimators, especially on K estimates of the action values. We generalize these estimators to propose an Underestimation Estimator Set (UES) and theoretically prove all of the estimators in this set suffer from underestimation bias. We further study the bias properties of these estimators and conclude that their biases are different from each other’s and depend on the specific conditions they meet. Thus, our set provides various estimators for QL in different settings. Finally, to better illustrate the properties of estimators, we test the performance of several estimators in our set. Empirical results show that Median estimator (Me) underestimates less than double Q-learning (DQL) and doesn’t suffer overestimation as QL, and Min estimator (M1E) underestimates more than DQL. Besides, Me and M1E perform as well as or better than other estimators on some benchmark environments.

References

[1]

C.J. Watkins, P. Dayan, Q-learning, Mach. Learn. 8 (3–4) (1992) 279–292.

Digital Library

[2]

R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction (2018).

[3]

J.N. Tsitsiklis, Asynchronous stochastic approximation and q-learning, Mach. Learn. 16 (3) (1994) 185–202.

[4]

H. Van Hasselt, Estimating the maximum expected value: an analysis of (nested) cross validation and the maximum sample average, arXiv preprint arXiv:1302.7175.

[5]

S. Thrun, A. Schwartz, Issues in using function approximation for reinforcement learning, in: Proceedings of the Fourth Connectionist Models Summer School, Hillsdale, NJ, 1993, pp. 255–263.

[6]

H. Hasselt, Double q-learning, Advances in neural information processing systems 23 (2010) 2613–2621.

[7]

H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-learning, in: Proceedings of the AAAI conference on artificial intelligence, vol. 30, 2016.

[8]

H. Xiong, L. Zhao, Y. Liang, W. Zhang, Finite-time analysis for double q-learning, Advances in Neural Information Processing Systems 33.

[9]

Y. Bengio, Y. Grandvalet, No unbiased estimator of the variance of k-fold cross-validation, J. Mach. Learn. Res. 5 (Sep) (2004) 1089–1105.

[10]

Q. He, X. Hou, Reducing estimation bias via weighted delayed deep deterministic policy gradient, arXiv preprint arXiv:2006.12622.

[11]

J. Zhang, Z. Wang, H. Zhang, Data-based optimal control of multiagent systems: A reinforcement learning design approach, IEEE Trans. Cybern. 49 (12) (2018) 4441–4449.

[12]

D. Lee, B. Defourny, W.B. Powell, Bias-corrected q-learning to control max-operator bias in q-learning, in: 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), IEEE, 2013, pp. 93–99.

[13]

A. Pentaliotis, M.A. Wiering, Variation-resistant q-learning: Controlling and utilizing estimation bias in reinforcement learning for better performance, in: ICAART (2), 2021, pp. 17–28.

[14]

S. Fujimoto, H. Hoof, D. Meger, Addressing function approximation error in actor-critic methods, in: International Conference on Machine Learning, PMLR, 2018, pp. 1587–1596.

[15]

G. Chen, Decorrelated double q-learning, arXiv preprint arXiv:2006.06956.

[16]

O. Anschel, N. Baram, N. Shimkin, Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning, in: International conference on machine learning, PMLR, 2017, pp. 176–185.

[17]

Q. Lan, Y. Pan, A. Fyshe, M. White, Maxmin q-learning: Controlling the estimation bias of q-learning, arXiv preprint arXiv:2002.06487.

[18]

C. D’Eramo, M. Restelli, A. Nuara, Estimating maximum expected value through gaussian approximation, in: International Conference on Machine Learning, PMLR, 2016, pp. 1032–1040.

[19]

Z. Zhang, Z. Pan, M.J. Kochenderfer, Weighted double q-learning., in: IJCAI, 2017, pp. 3455–3461.

[20]

Z. Ren, G. Zhu, B. Han, J. Chen, C. Zhang, On the estimation bias in double q-learning.

[21]

L. Busoniu, R. Babuska, B. De Schutter, D. Ernst, Reinforcement learning and dynamic programming using function approximators, CRC Press, 2017.

[22]

Q. Lan, A pytorch reinforcement learning framework for exploring new ideas, URL: https://github.com/qlan3/Explorer (2019).

[23]

B. Ishwaei, D. Shabma, K. Krishnamoorthy, Non-existence of unbiased estimators of ordered parameters, Statistics 16(1) (1985) 89–95.

[24]

J.E. Smith, R.L. Winkler, The optimizer’s curse: Skepticism and postdecision surprise in decision analysis, Manage. Sci. 52 (3) (2006) 311–322.

[25]

T.G. Dietterich, An overview of maxq hierarchical reinforcement learning, in: International Symposium on Abstraction, Reformulation, and Approximation, Springer, 2000, pp. 26–44.

[26]

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, Openai gym, arXiv preprint arXiv:1606.01540.

[27]

N. Tasfi, Pygame learning environment, GitHub repository.

Cited By

Zhang QLiu MWang HQian WZhang X(2023)Off‐policy correction algorithm for double Q network based on deep reinforcement learningIET Cyber-Systems and Robotics10.1049/csy2.121025:4Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1049/csy2.12102
Chen YZhang HLiu MYe MXie HPan Y(2023)Traffic signal optimization control method based on adaptive weighted averaged double deep Q networkApplied Intelligence10.1007/s10489-023-04469-953:15(18333-18354)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1007/s10489-023-04469-9

Index Terms

Underestimation estimators to Q-learning
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
2. Theory of computation
  1. Design and analysis of algorithms

Index terms have been assigned to the content through auto-classification.

Recommendations

A controlling estimation bias method: Max_Mix_Min estimator for Q-learning
Abstract
Although Q-learning (QL) is widely used in reinforcement learning, it suffers from overestimation bias, which can lead to poor performance in stochastic environments due to its susceptibility to maximization bias. To address this problem, various ...
Interleaved Q-Learning with Partially Coupled Training Process
AAMAS '19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems

This paper studies estimating the maximum expected value (MEV) of several independent random variables (RVs). No unbiased estimator exists without knowing the distributions of those RVs a priori. Two of the most famous estimators, maximum estimator (ME) ...
A Meta-Learning Approach to Mitigating the Estimation Bias of Q-Learning
It is a longstanding problem that Q-learning suffers from the overestimation bias. This issue originates from the fact that Q-learning uses the expectation of maximum Q-value to approximate the maximum expected Q-value. A number of algorithms, such as ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 607, Issue C

Aug 2022

1637 pages

ISSN:0020-0255

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 August 2022

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang QLiu MWang HQian WZhang X(2023)Off‐policy correction algorithm for double Q network based on deep reinforcement learningIET Cyber-Systems and Robotics10.1049/csy2.121025:4Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1049/csy2.12102
Chen YZhang HLiu MYe MXie HPan Y(2023)Traffic signal optimization control method based on adaptive weighted averaged double deep Q networkApplied Intelligence10.1007/s10489-023-04469-953:15(18333-18354)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1007/s10489-023-04469-9

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents