[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Adaptive pessimism via target Q-value for offline reinforcement learning

Published: 07 January 2025 Publication History

Abstract

Offline reinforcement learning (RL) methods learn from datasets without further environment interaction, facing errors due to out-of-distribution (OOD) actions. Although effective methods have been proposed to conservatively estimate the Q-values of those OOD actions to mitigate this problem, insufficient or excessive pessimism under constant constraints often harms the policy learning process. Moreover, since the distribution of each task on the dataset varies among different environments and behavior policies, it is desirable to learn an adaptive weight for balancing constraints on the conservative estimation of Q-value and the standard RL objectives depending on each task. To achieve this, in this paper, we point out that the quantile of the Q-value is an effective metric to refer to the Q-value distribution of the fixed data set. Based on this observation, we design Adaptive Pessimism via a Target Q-value (APTQ) algorithm that balances between the pessimism constraint and the RL objective; this leads the expectation of Q-value to stably converge to a given target Q-value from a reasonable quantile of the Q-value distribution of the dataset. Experiments show that our method remarkably improves the performance of the state-of-the-art method CQL by 6.20% on the D4RL-v0 and 1.89% on the D4RL-v2.

Highlights

Dynamically balancing constraints and reinforcement learning objectives.
Enhancing CQL on challenging datasets while maintaining training stability.

References

[1]
Agarwal R., Schuurmans D., Norouzi M., An optimistic perspective on offline reinforcement learning, in: International conference on machine learning, PMLR, 2020, pp. 104–114.
[2]
An G., Moon S., Kim J., Song H.O., Uncertainty-based offline reinforcement learning with diversified q-ensemble, in: Advances in neural information processing systems, 2021.
[3]
Aravkin A.Y., Kambadur A., Lozano A.C., Luss R., Sparse quantile huber regression for efficient and robust estimation, 2014, arXiv preprint arXiv:1402.4624.
[4]
Baker, B., Kanitscheider, I., Markov, T. M., Wu, Y., Powell, G., McGrew, B., et al. (2020). Emergent tool use from multi-agent autocurricula. In International conference on learning representations.
[5]
Chen X., Zhou Z., Wang Z., Wang C., Wu Y., Ross K.W., BAIL: best-action imitation learning for batch deep reinforcement learning, in: Advances in neural information processing systems, 2020.
[6]
D. engine Contributors, DI-engine: OpenDILab decision intelligence engine, 2021, https://github.com/opendilab/DI-engine.
[7]
Dabney W., Ostrovski G., Silver D., Munos R., Implicit quantile networks for distributional reinforcement learning, in: International conference on machine learning, PMLR, 2018, pp. 1096–1105.
[8]
Dabney, W., Rowland, M., Bellemare, M. G., & Munos, R. (2018). Distributional reinforcement learning with quantile regression. In Thirty-second AAAI conference on artificial intelligence.
[9]
Fu J., Kumar A., Nachum O., Tucker G., Levine S., D4rl: Datasets for deep data-driven reinforcement learning, 2020, arXiv preprint arXiv:2004.07219.
[10]
Fujimoto S., Gu S.S., A minimalist approach to offline reinforcement learning, in: Advances in neural information processing systems, 2021.
[11]
Fujimoto S., Meger D., Precup D., Off-policy deep reinforcement learning without exploration, in: International conference on machine learning, PMLR, 2019, pp. 2052–2062.
[12]
Gu S., Holly E., Lillicrap T., Levine S., Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates, in: 2017 IEEE international conference on robotics and automation, IEEE, 2017, pp. 3389–3396.
[13]
Jaques N., Ghandeharioun A., Shen J.H., Ferguson C., Lapedriza A., Jones N., et al., Way off-policy batch deep reinforcement learning of implicit human preferences in dialog, 2019, arXiv preprint arXiv:1907.00456.
[14]
Kendall A., Hawke J., Janz D., Mazur P., Reda D., Allen J.-M., et al., Learning to drive in a day, in: 2019 international conference on robotics and automation, IEEE, 2019, pp. 8248–8254.
[15]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980.
[16]
Koenker R., Hallock K.F., Quantile regression, Journal of Economic Perspectives 15 (2001) 143–156.
[17]
Kumar A., Fu J., Soh M., Tucker G., Levine S., Stabilizing off-policy q-learning via bootstrapping error reduction, in: Advances in neural information processing systems, 2019, pp. 11761–11771.
[18]
Kumar A., Zhou A., Tucker G., Levine S., Conservative q-learning for offline reinforcement learning, in: Advances in neural information processing systems, 2020.
[19]
Lange S., Gabel T., Riedmiller M., Batch reinforcement learning, in: Reinforcement learning, Springer, 2012, pp. 45–73.
[20]
Leibo J.Z., Zambaldi V.F., Lanctot M., Marecki J., Graepel T., Multi-agent reinforcement learning in sequential social dilemmas, in: Proceedings of the 16th conference on autonomous agents and multiAgent systems, ACM, 2017, pp. 464–473.
[21]
Levine S., Kumar A., Tucker G., Fu J., Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020, arXiv preprint arXiv:2005.01643.
[22]
Li, C., Jia, R., Yao, J., Liu, J., Zhang, Y., Niu, Y., et al. (2023). Theoretically guaranteed policy improvement distilled from model-based planning. In PRL workshop at IJCAI.
[23]
Li, C., Liu, J., Zhang, Y., Wei, Y., Niu, Y., Yang, Y., et al. (2023). Ace: Cooperative multi-agent q-learning with bidirectional action-dependency. vol. 37, In Proceedings of the AAAI conference on artificial intelligence (pp. 8536–8544).
[24]
Liu J., Zhang Y., Li C., Yang C., Yang Y., Liu Y., et al., Masked pretraining for multi-agent decision making, 2023, arXiv preprint arXiv:2310.11846.
[25]
Liu M., Zhao H., Yang Z., Shen J., Zhang W., Zhao L., et al., Curriculum offline imitating learning, in: Advances in neural information processing systems, vol. 34, 2021.
[26]
Loshchilov I., Hutter F., Decoupled weight decay regularization, 2017, arXiv preprint arXiv:1711.05101.
[27]
O’Donoghue, B., Osband, I., Munos, R., & Mnih, V. (2018). The uncertainty bellman equation and exploration. In International conference on machine learning (pp. 3836–3845).
[28]
Oh J., Guo X., Lee H., Lewis R.L., Singh S., Action-conditional video prediction using deep networks in atari games, Advances in Neural Information Processing Systems 28 (2015) 2863–2871.
[29]
Peng X.B., Kumar A., Zhang G., Levine S., Advantage-weighted regression: Simple and scalable off-policy reinforcement learning, 2019, arXiv preprint arXiv:1910.00177.
[30]
Rajeswaran A., Kumar V., Gupta A., Vezzani G., Schulman J., Todorov E., et al., Learning complex dexterous manipulation with deep reinforcement learning and demonstrations, in: Robotics: science and systems XIV, 2018.
[31]
Robbins H., Monro S., A stochastic approximation method, The Annals of Mathematical Statistics (1951) 400–407.
[32]
Sallab A.E., Abdou M., Perot E., Yogamani S., Deep reinforcement learning framework for autonomous driving, Electronic Imaging 2017 (2017) 70–76.
[33]
Siegel N.Y., Springenberg J.T., Berkenkamp F., Abdolmaleki A., Neunert M., Lampe T., et al., Keep doing what worked: Behavioral modelling priors for offline reinforcement learning, 2020, arXiv preprint arXiv:2002.08396.
[34]
Tieleman T., Hinton G., Divide the gradient by a running average of its recent magnitude, coursera neural netw, Machine Learning 6 (2012) 26–31.
[35]
Wang L., Liu J., Shao H., Wang W., Chen R., Liu Y., et al., Efficient reinforcement learning for autonomous driving with parameterized skills and priors, Science and Systems (RSS) (2023).
[36]
Wu Y., Tucker G., Nachum O., Behavior regularized offline reinforcement learning, 2019, arXiv:1911.11361.
[37]
Wu Y., Zhai S., Srivastava N., Susskind J.M., Zhang J., Salakhutdinov R., et al., Uncertainty weighted actor-critic for offline reinforcement learning, in: International conference on machine learning, PMLR, 2021, pp. 11319–11328.
[38]
Yu T., Thomas G., Yu L., Ermon S., Zou J.Y., Levine S., et al., Mopo: Model-based offline policy optimization, Advances in Neural Information Processing Systems 33 (2020) 14129–14142.
[39]
Zhang, Y., Liu, J., Li, C., Niu, Y., Yang, Y., Liu, Y., et al. (2024). A perspective of q-value estimation on offline-to-online reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neural Networks
Neural Networks  Volume 180, Issue C
Dec 2024
1432 pages

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 07 January 2025

Author Tags

  1. Reinforcement learning
  2. Offline reinforcement learning
  3. Machine learning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media