More Web Proxy on the site http://driver.im/

research-article

Adaptive pessimism via target Q-value for offline reinforcement learning

Authors:

Wanli OuyangAuthors Info & Claims

Volume 180, Issue C

https://doi.org/10.1016/j.neunet.2024.106588

Published: 01 December 2024 Publication History

Abstract

Offline reinforcement learning (RL) methods learn from datasets without further environment interaction, facing errors due to out-of-distribution (OOD) actions. Although effective methods have been proposed to conservatively estimate the Q-values of those OOD actions to mitigate this problem, insufficient or excessive pessimism under constant constraints often harms the policy learning process. Moreover, since the distribution of each task on the dataset varies among different environments and behavior policies, it is desirable to learn an adaptive weight for balancing constraints on the conservative estimation of Q-value and the standard RL objectives depending on each task. To achieve this, in this paper, we point out that the quantile of the Q-value is an effective metric to refer to the Q-value distribution of the fixed data set. Based on this observation, we design Adaptive Pessimism via a Target Q-value (APTQ) algorithm that balances between the pessimism constraint and the RL objective; this leads the expectation of Q-value to stably converge to a given target Q-value from a reasonable quantile of the Q-value distribution of the dataset. Experiments show that our method remarkably improves the performance of the state-of-the-art method CQL by 6.20% on the D4RL-v0 and 1.89% on the D4RL-v2.

Highlights

•

Dynamically balancing constraints and reinforcement learning objectives.

•

Enhancing CQL on challenging datasets while maintaining training stability.

References

[1]

Agarwal R., Schuurmans D., Norouzi M., An optimistic perspective on offline reinforcement learning, in: International conference on machine learning, PMLR, 2020, pp. 104–114.

[2]

An G., Moon S., Kim J., Song H.O., Uncertainty-based offline reinforcement learning with diversified q-ensemble, in: Advances in neural information processing systems, 2021.

[3]

Aravkin A.Y., Kambadur A., Lozano A.C., Luss R., Sparse quantile huber regression for efficient and robust estimation, 2014, arXiv preprint arXiv:1402.4624.

[4]

Baker, B., Kanitscheider, I., Markov, T. M., Wu, Y., Powell, G., McGrew, B., et al. (2020). Emergent tool use from multi-agent autocurricula. In International conference on learning representations.

[5]

Chen X., Zhou Z., Wang Z., Wang C., Wu Y., Ross K.W., BAIL: best-action imitation learning for batch deep reinforcement learning, in: Advances in neural information processing systems, 2020.

[6]

D. engine Contributors, DI-engine: OpenDILab decision intelligence engine, 2021, https://github.com/opendilab/DI-engine.

[7]

Dabney W., Ostrovski G., Silver D., Munos R., Implicit quantile networks for distributional reinforcement learning, in: International conference on machine learning, PMLR, 2018, pp. 1096–1105.

[8]

Dabney, W., Rowland, M., Bellemare, M. G., & Munos, R. (2018). Distributional reinforcement learning with quantile regression. In Thirty-second AAAI conference on artificial intelligence.

[9]

Fu J., Kumar A., Nachum O., Tucker G., Levine S., D4rl: Datasets for deep data-driven reinforcement learning, 2020, arXiv preprint arXiv:2004.07219.

[10]

Fujimoto S., Gu S.S., A minimalist approach to offline reinforcement learning, in: Advances in neural information processing systems, 2021.

[11]

Fujimoto S., Meger D., Precup D., Off-policy deep reinforcement learning without exploration, in: International conference on machine learning, PMLR, 2019, pp. 2052–2062.

[12]

Gu S., Holly E., Lillicrap T., Levine S., Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates, in: 2017 IEEE international conference on robotics and automation, IEEE, 2017, pp. 3389–3396.

[13]

Jaques N., Ghandeharioun A., Shen J.H., Ferguson C., Lapedriza A., Jones N., et al., Way off-policy batch deep reinforcement learning of implicit human preferences in dialog, 2019, arXiv preprint arXiv:1907.00456.

[14]

Kendall A., Hawke J., Janz D., Mazur P., Reda D., Allen J.-M., et al., Learning to drive in a day, in: 2019 international conference on robotics and automation, IEEE, 2019, pp. 8248–8254.

[15]

Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980.

[16]

Koenker R., Hallock K.F., Quantile regression, Journal of Economic Perspectives 15 (2001) 143–156.

[17]

Kumar A., Fu J., Soh M., Tucker G., Levine S., Stabilizing off-policy q-learning via bootstrapping error reduction, in: Advances in neural information processing systems, 2019, pp. 11761–11771.

[18]

Kumar A., Zhou A., Tucker G., Levine S., Conservative q-learning for offline reinforcement learning, in: Advances in neural information processing systems, 2020.

[19]

Lange S., Gabel T., Riedmiller M., Batch reinforcement learning, in: Reinforcement learning, Springer, 2012, pp. 45–73.

[20]

Leibo J.Z., Zambaldi V.F., Lanctot M., Marecki J., Graepel T., Multi-agent reinforcement learning in sequential social dilemmas, in: Proceedings of the 16th conference on autonomous agents and multiAgent systems, ACM, 2017, pp. 464–473.

[21]

Levine S., Kumar A., Tucker G., Fu J., Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020, arXiv preprint arXiv:2005.01643.

[22]

Li, C., Jia, R., Yao, J., Liu, J., Zhang, Y., Niu, Y., et al. (2023). Theoretically guaranteed policy improvement distilled from model-based planning. In PRL workshop at IJCAI.

[23]

Li, C., Liu, J., Zhang, Y., Wei, Y., Niu, Y., Yang, Y., et al. (2023). Ace: Cooperative multi-agent q-learning with bidirectional action-dependency. vol. 37, In Proceedings of the AAAI conference on artificial intelligence (pp. 8536–8544).

[24]

Liu J., Zhang Y., Li C., Yang C., Yang Y., Liu Y., et al., Masked pretraining for multi-agent decision making, 2023, arXiv preprint arXiv:2310.11846.

[25]

Liu M., Zhao H., Yang Z., Shen J., Zhang W., Zhao L., et al., Curriculum offline imitating learning, in: Advances in neural information processing systems, vol. 34, 2021.

[26]

Loshchilov I., Hutter F., Decoupled weight decay regularization, 2017, arXiv preprint arXiv:1711.05101.

[27]

O’Donoghue, B., Osband, I., Munos, R., & Mnih, V. (2018). The uncertainty bellman equation and exploration. In International conference on machine learning (pp. 3836–3845).

[28]

Oh J., Guo X., Lee H., Lewis R.L., Singh S., Action-conditional video prediction using deep networks in atari games, Advances in Neural Information Processing Systems 28 (2015) 2863–2871.

[29]

Peng X.B., Kumar A., Zhang G., Levine S., Advantage-weighted regression: Simple and scalable off-policy reinforcement learning, 2019, arXiv preprint arXiv:1910.00177.

[30]

Rajeswaran A., Kumar V., Gupta A., Vezzani G., Schulman J., Todorov E., et al., Learning complex dexterous manipulation with deep reinforcement learning and demonstrations, in: Robotics: science and systems XIV, 2018.

[31]

Robbins H., Monro S., A stochastic approximation method, The Annals of Mathematical Statistics (1951) 400–407.

[32]

Sallab A.E., Abdou M., Perot E., Yogamani S., Deep reinforcement learning framework for autonomous driving, Electronic Imaging 2017 (2017) 70–76.

[33]

Siegel N.Y., Springenberg J.T., Berkenkamp F., Abdolmaleki A., Neunert M., Lampe T., et al., Keep doing what worked: Behavioral modelling priors for offline reinforcement learning, 2020, arXiv preprint arXiv:2002.08396.

[34]

Tieleman T., Hinton G., Divide the gradient by a running average of its recent magnitude, coursera neural netw, Machine Learning 6 (2012) 26–31.

[35]

Wang L., Liu J., Shao H., Wang W., Chen R., Liu Y., et al., Efficient reinforcement learning for autonomous driving with parameterized skills and priors, Science and Systems (RSS) (2023).

[36]

Wu Y., Tucker G., Nachum O., Behavior regularized offline reinforcement learning, 2019, arXiv:1911.11361.

[37]

Wu Y., Zhai S., Srivastava N., Susskind J.M., Zhang J., Salakhutdinov R., et al., Uncertainty weighted actor-critic for offline reinforcement learning, in: International conference on machine learning, PMLR, 2021, pp. 11319–11328.

[38]

Yu T., Thomas G., Yu L., Ermon S., Zou J.Y., Levine S., et al., Mopo: Model-based offline policy optimization, Advances in Neural Information Processing Systems 33 (2020) 14129–14142.

[39]

Zhang, Y., Liu, J., Li, C., Niu, Y., Yang, Y., Liu, Y., et al. (2024). A perspective of q-value estimation on offline-to-online reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence.

Index Terms

Adaptive pessimism via target Q-value for offline reinforcement learning
1. Computing methodologies
  1. Machine learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Index terms have been assigned to the content through auto-classification.

Recommendations

Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Using Transfer Learning to Speed-Up Reinforcement Learning: A Cased-Based Approach
LARS '10: Proceedings of the 2010 Latin American Robotics Symposium and Intelligent Robotics Meeting

Reinforcement Learning (RL) is a well-known technique for the solution of problems where agents need to act with success in an unknown environment, learning through trial and error. However, this technique is not efficient enough to be used in ...
Deadly triad matters for offline reinforcement learning
Abstract
It is well known that the deadly triad of function approximation, bootstrapping, and off-policy learning can make reinforcement learning (RL) unstable or even cause it to diverge. Compared to online RL, the deadly triad is more likely to cause ...
Highlights
- This study highlights the importance of the deadly triad in offline RL.
- Two effective methods for overcoming the deadly triad in offline RL are investigated.
- The performance of TD3BC can be significantly improved by addressing the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neural Networks

Neural Networks Volume 180, Issue C

Dec 2024

1432 pages

Issue’s Table of Contents

The Author(s).

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 December 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents