[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Underestimation estimators to Q-learning

Published: 01 August 2022 Publication History

Abstract

Q-learning (QL) is a popular method for control problems, which approximates the maximum expected action value using the maximum estimated action value, thus it suffers from positive overestimation bias. Various algorithms have been proposed to reduce overestimation bias. But some of these methods cause underestimation bias. Furtherly, about underestimation, which kind of estimators causes underestimation is less understood. In this paper, instead of a certain specific method, we focus on underestimation estimators, especially on K estimates of the action values. We generalize these estimators to propose an Underestimation Estimator Set (UES) and theoretically prove all of the estimators in this set suffer from underestimation bias. We further study the bias properties of these estimators and conclude that their biases are different from each other’s and depend on the specific conditions they meet. Thus, our set provides various estimators for QL in different settings. Finally, to better illustrate the properties of estimators, we test the performance of several estimators in our set. Empirical results show that Median estimator (Me) underestimates less than double Q-learning (DQL) and doesn’t suffer overestimation as QL, and Min estimator (M1E) underestimates more than DQL. Besides, Me and M1E perform as well as or better than other estimators on some benchmark environments.

References

[1]
C.J. Watkins, P. Dayan, Q-learning, Mach. Learn. 8 (3–4) (1992) 279–292.
[2]
R.S. Sutton, A.G. Barto, Reinforcement learning: An introduction (2018).
[3]
J.N. Tsitsiklis, Asynchronous stochastic approximation and q-learning, Mach. Learn. 16 (3) (1994) 185–202.
[4]
H. Van Hasselt, Estimating the maximum expected value: an analysis of (nested) cross validation and the maximum sample average, arXiv preprint arXiv:1302.7175.
[5]
S. Thrun, A. Schwartz, Issues in using function approximation for reinforcement learning, in: Proceedings of the Fourth Connectionist Models Summer School, Hillsdale, NJ, 1993, pp. 255–263.
[6]
H. Hasselt, Double q-learning, Advances in neural information processing systems 23 (2010) 2613–2621.
[7]
H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-learning, in: Proceedings of the AAAI conference on artificial intelligence, vol. 30, 2016.
[8]
H. Xiong, L. Zhao, Y. Liang, W. Zhang, Finite-time analysis for double q-learning, Advances in Neural Information Processing Systems 33.
[9]
Y. Bengio, Y. Grandvalet, No unbiased estimator of the variance of k-fold cross-validation, J. Mach. Learn. Res. 5 (Sep) (2004) 1089–1105.
[10]
Q. He, X. Hou, Reducing estimation bias via weighted delayed deep deterministic policy gradient, arXiv preprint arXiv:2006.12622.
[11]
J. Zhang, Z. Wang, H. Zhang, Data-based optimal control of multiagent systems: A reinforcement learning design approach, IEEE Trans. Cybern. 49 (12) (2018) 4441–4449.
[12]
D. Lee, B. Defourny, W.B. Powell, Bias-corrected q-learning to control max-operator bias in q-learning, in: 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), IEEE, 2013, pp. 93–99.
[13]
A. Pentaliotis, M.A. Wiering, Variation-resistant q-learning: Controlling and utilizing estimation bias in reinforcement learning for better performance, in: ICAART (2), 2021, pp. 17–28.
[14]
S. Fujimoto, H. Hoof, D. Meger, Addressing function approximation error in actor-critic methods, in: International Conference on Machine Learning, PMLR, 2018, pp. 1587–1596.
[15]
G. Chen, Decorrelated double q-learning, arXiv preprint arXiv:2006.06956.
[16]
O. Anschel, N. Baram, N. Shimkin, Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning, in: International conference on machine learning, PMLR, 2017, pp. 176–185.
[17]
Q. Lan, Y. Pan, A. Fyshe, M. White, Maxmin q-learning: Controlling the estimation bias of q-learning, arXiv preprint arXiv:2002.06487.
[18]
C. D’Eramo, M. Restelli, A. Nuara, Estimating maximum expected value through gaussian approximation, in: International Conference on Machine Learning, PMLR, 2016, pp. 1032–1040.
[19]
Z. Zhang, Z. Pan, M.J. Kochenderfer, Weighted double q-learning., in: IJCAI, 2017, pp. 3455–3461.
[20]
Z. Ren, G. Zhu, B. Han, J. Chen, C. Zhang, On the estimation bias in double q-learning.
[21]
L. Busoniu, R. Babuska, B. De Schutter, D. Ernst, Reinforcement learning and dynamic programming using function approximators, CRC Press, 2017.
[22]
Q. Lan, A pytorch reinforcement learning framework for exploring new ideas, URL:  https://github.com/qlan3/Explorer (2019).
[23]
B. Ishwaei, D. Shabma, K. Krishnamoorthy, Non-existence of unbiased estimators of ordered parameters, Statistics 16(1) (1985) 89–95.
[24]
J.E. Smith, R.L. Winkler, The optimizer’s curse: Skepticism and postdecision surprise in decision analysis, Manage. Sci. 52 (3) (2006) 311–322.
[25]
T.G. Dietterich, An overview of maxq hierarchical reinforcement learning, in: International Symposium on Abstraction, Reformulation, and Approximation, Springer, 2000, pp. 26–44.
[26]
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, Openai gym, arXiv preprint arXiv:1606.01540.
[27]
N. Tasfi, Pygame learning environment, GitHub repository.

Cited By

View all
  • (2023)Off‐policy correction algorithm for double Q network based on deep reinforcement learningIET Cyber-Systems and Robotics10.1049/csy2.121025:4Online publication date: 15-Nov-2023
  • (2023)Traffic signal optimization control method based on adaptive weighted averaged double deep Q networkApplied Intelligence10.1007/s10489-023-04469-953:15(18333-18354)Online publication date: 1-Aug-2023

Index Terms

  1. Underestimation estimators to Q-learning
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Information Sciences: an International Journal
        Information Sciences: an International Journal  Volume 607, Issue C
        Aug 2022
        1637 pages

        Publisher

        Elsevier Science Inc.

        United States

        Publication History

        Published: 01 August 2022

        Author Tags

        1. 00-01
        2. 99-00

        Author Tags

        1. Q-learning
        2. Double Q-learning
        3. Overestimation Bias reduction
        4. Maximum estimator
        5. Cross-validation
        6. Underestimation reduction

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 12 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Off‐policy correction algorithm for double Q network based on deep reinforcement learningIET Cyber-Systems and Robotics10.1049/csy2.121025:4Online publication date: 15-Nov-2023
        • (2023)Traffic signal optimization control method based on adaptive weighted averaged double deep Q networkApplied Intelligence10.1007/s10489-023-04469-953:15(18333-18354)Online publication date: 1-Aug-2023

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media