[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3618408.3620068guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

On the power of pre-training for generalization in RL: provable benefits and hardness

Published: 23 July 2023 Publication History

Abstract

Generalization in Reinforcement Learning (RL) aims to train an agent during training that generalizes to the target environment. In this work, we first point out that RL generalization is fundamentally different from the generalization in supervised learning, and fine-tuning on the target environment is necessary for good test performance. Therefore, we seek to answer the following question: how much can we expect pretraining over training environments to be helpful for efficient and effective fine-tuning? On one hand, we give a surprising result showing that asymptotically, the improvement from pretraining is at most a constant factor. On the other hand, we show that pre-training can be indeed helpful in the non-asymptotic regime by designing a policy collection-elimination (PCE) algorithm and proving a distribution-dependent regret bound that is independent of the state-action space. We hope our theoretical results can provide insight towards understanding pre-training and generalization in RL.

References

[1]
Agarwal, A., Kakade, S., Krishnamurthy, A., and Sun, W. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095-20107, 2020.
[2]
Andrychowicz, O. M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3-20, 2020.
[3]
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48-77, 2002.
[4]
Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263-272. PMLR, 2017.
[5]
Bousquet, O. and Elisseeff, A. Stability and generalization. The Journal of Machine Learning Research, 2:499-526, 2002.
[6]
Brunskill, E. and Li, L. Sample complexity of multitask reinforcement learning. arXiv preprint arXiv:1309.6821, 2013.
[7]
Cai, H., Ren, K., Zhang, W., Malialis, K., Wang, J., Yu, Y., and Guo, D. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 661-670, 2017.
[8]
Chen, X., Hu, J., Yang, L. F., and Wang, L. Near-optimal reward-free exploration for linear mixture mdps with plug-in solver. arXiv preprint arXiv:2110.03244, 2021.
[9]
Dann, C., Lattimore, T., and Brunskill, E. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
[10]
Du, S., Krishnamurthy, A., Jiang, N., Agarwal, A., Dudik, M., and Langford, J. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pp. 1665-1674. PMLR, 2019.
[11]
Farebrother, J., Machado, M. C., and Bowling, M. Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123, 2018.
[12]
Ghosh, D., Rahme, J., Kumar, A., Zhang, A., Adams, R. P., and Levine, S. Why generalization in rl is difficult: Epistemic pomdps and implicit partial observability. Advances in Neural Information Processing Systems, 34, 2021.
[13]
Hu, J., Chen, X., Jin, C., Li, L., and Wang, L. Near-optimal representation learning for linear bandits and linear rl. In International Conference on Machine Learning, pp. 4349-4358. PMLR, 2021.
[14]
James, S., Wohlhart, P., Kalakrishnan, M., Kalashnikov, D., Irpan, A., Ibarz, J., Levine, S., Hadsell, R., and Bousmalis, K. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12627- 12637, 2019.
[15]
Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
[16]
Jin, C., Kakade, S., Krishnamurthy, A., and Liu, Q. Sample-efficient reinforcement learning of undercomplete pomdps. Advances in Neural Information Processing Systems, 33:18530-18539, 2020a.
[17]
Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137-2143. PMLR, 2020b.
[18]
Jin, C., Liu, Q., and Miryoosefi, S. Bellman eluder dimension: New rich classes of rl problems, and sampleefficient algorithms. Advances in neural information processing systems, 34:13406-13418, 2021.
[19]
Kawaguchi, K., Kaelbling, L. P., and Bengio, Y. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.
[20]
Kirk, R., Zhang, A., Grefenstette, E., and Rocktäschel, T. A survey of generalisation in deep reinforcement learning. arXiv preprint arXiv:2111.09794, 2021.
[21]
Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238-1274, 2013.
[22]
Kormushev, P., Calinon, S., and Caldwell, D. G. Reinforcement learning in robotics: Applications and real-world challenges. Robotics, 2(3):122-148, 2013.
[23]
Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
[24]
Li, Y., Wang, R., and Yang, L. F. Settling the horizon-dependence of sample complexity in reinforcement learning. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 965-976. IEEE, 2022.
[25]
Lu, R., Huang, G., and Du, S. S. On the power of multitask representation learning in linear mdp. arXiv preprint arXiv:2106.08053, 2021.
[26]
Malik, D., Li, Y., and Ravikumar, P. When is generalizable reinforcement learning tractable? Advances in Neural Information Processing Systems, 34, 2021.
[27]
Mao, H., Alizadeh, M., Menache, I., and Kandula, S. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM workshop on hot topics in networks, pp. 50-56, 2016.
[28]
Mitchell, T. M., Keller, R. M., and Kedar-Cabelli, S. T. Explanation-based generalization: A unifying view. Machine learning, 1(1):47-80, 1986.
[29]
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[30]
Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.
[31]
O'Donoghue, B. Variational bayesian reinforcement learning with regret bounds. Advances in Neural Information Processing Systems, 34, 2021.
[32]
Osband, I. and Van Roy, B. Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pp. 2701- 2710. PMLR, 2017.
[33]
Osband, I., Russo, D., and Van Roy, B. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013.
[34]
Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., and Song, D. Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282, 2018.
[35]
Peng, X. B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 3803-3810. IEEE, 2018.
[36]
Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine, S. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
[37]
Ren, T., Li, J., Dai, B., Du, S. S., and Sanghavi, S. Nearly horizon-free offline reinforcement learning. Advances in neural information processing systems, 34, 2021.
[38]
Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning, pp. 4344-4353. PMLR, 2018.
[39]
Rusu, A. A., Večerík, M., Rothörl, T., Heess, N., Pascanu, R., and Hadsell, R. Sim-to-real robot learning from pixels with progressive nets. In Conference on Robot Learning, pp. 262-270. PMLR, 2017.
[40]
Sallab, A. E., Abdou, M., Perot, E., and Yogamani, S. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70-76, 2017.
[41]
Shalev-Shwartz, S., Shammah, S., and Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
[42]
Shani, G., Heckerman, D., Brafman, R. I., and Boutilier, C. An mdp-based recommender system. Journal of Machine Learning Research, 6(9), 2005.
[43]
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354-359, 2017.
[44]
Sutton, R. S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in neural information processing systems, 8, 1995.
[45]
Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
[46]
Tirinzoni, A., Poiani, R., and Restelli, M. Sequential transfer in reinforcement learning with a generative model. In International Conference on Machine Learning, pp. 9481- 9492. PMLR, 2020.
[47]
Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
[48]
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575 (7782):350-354, 2019.
[49]
Wang, H., Zheng, S., Xiong, C., and Socher, R. On the generalization gap in reparameterizable reinforcement learning. In International Conference on Machine Learning, pp. 6648-6658. PMLR, 2019.
[50]
Wang, R., Salakhutdinov, R. R., and Yang, L. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33:6123-6135, 2020.
[51]
Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
[52]
Yu, C., Liu, J., Nemati, S., and Yin, G. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1-36, 2021.
[53]
Zhang, A., McAllister, R., Calandra, R., Gal, Y., and Levine, S. Learning invariant representations for reinforcement learning without reconstruction. arXiv preprint arXiv:2006.10742, 2020.
[54]
Zhang, C. and Wang, Z. Provably efficient multi-task reinforcement learning with model transfer. Advances in Neural Information Processing Systems, 34, 2021.
[55]
Zhang, Z., Ji, X., and Du, S. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory, pp. 4528-4531. PMLR, 2021.
[56]
Zheng, G., Zhang, F., Zheng, Z., Xiang, Y., Yuan, N. J., Xie, X., and Li, Z. Drn: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, pp. 167-176, 2018.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICML'23: Proceedings of the 40th International Conference on Machine Learning
July 2023
43479 pages

Publisher

JMLR.org

Publication History

Published: 23 July 2023

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media