[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3692070.3692287guideproceedingsArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Limited preference aided imitation learning from imperfect demonstrations

Published: 21 July 2024 Publication History

Abstract

Imitation learning mimics high-quality policies from expert data for sequential decision-making tasks. However, its efficacy is hindered in scenarios where optimal demonstrations are unavailable, and only imperfect demonstrations are present. To address this issue, introducing additional limited human preferences is a suitable approach as it can be obtained in a human-friendly manner, offering a promising way to learn the policy that exceeds the performance of imperfect demonstrations. In this paper, we propose a novel imitation learning (IL) algorithm, Preference Aided Imitation Learning from imperfect demonstrations (PAIL). Specifically, PAIL learns a preference reward by querying experts for limited preferences from imperfect demonstrations. This serves two purposes during training: 1) Reweighting imperfect demonstrations with the preference reward for higher quality. 2) Selecting explored trajectories with high cumulative preference rewards to augment imperfect demonstrations. The dataset with continuously improving quality empowers the performance of PAIL to transcend the initial demonstrations. Comprehensive empirical results across a synthetic task and two locomotion benchmarks show that PAIL surpasses baselines by 73.2% and breaks through the performance bottleneck of imperfect demonstrations.

References

[1]
Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., and Munos, R. A general theoretical paradigm to understand learning from human preferences. CoRR, abs/2310.12036, 2023.
[2]
Bloem, M. and Bambos, N. Infinite time horizon maximum causal entropy inverse reinforcement learning. In Proceedings of the 53rd IEEE Conference on Decision and Control, pp. 4911-4916, Los Angeles, CA, 2014.
[3]
Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324-345, 1952.
[4]
Brown, D. S., Goo, W., Nagarajan, P., and Niekum, S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Proceedings of the 36th International Conference on Machine Learning, pp. 783-792, Long Beach, CA, 2019a.
[5]
Brown, D. S., Goo, W., and Niekum, S. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In Proceedings of the 3rd Annual Conference on Robot Learning, pp. 330-359, Osaka, Japan, 2019b.
[6]
Cai, X.-Q., Zhang, Y.-J., Chiang, C.-K., and Sugiyama, M. Imitation learning from vague feedback. Advances in Neural Information Processing Systems, 36:48275-48292, 2023.
[7]
Chen, L., Paleja, R. R., and Gombolay, M. C. Learning from suboptimal demonstration via self-supervised reward regression. In Proceedings of the 4th Conference on Robot Learning, pp. 1262-1277, virtual/Cambridge, MA, 2020.
[8]
Choi, S., Lee, K., and Oh, S. Robust learning from demonstrations with mixed qualities using leveraged gaussian processes. IEEE Transactions on Robotics, 35(3):564-576, 2019.
[9]
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30, pp. 4299-4307, Long Beach, CA, 2017.
[10]
Degrave, J., Felici, F., Buchli, J., Neunert, M., Tracey, B. D., Carpanese, F., Ewalds, T., Hafner, R., Abdolmaleki, A., de Las Casas, D., Donner, C., Fritz, L., Galperti, C., Huber, A., Keeling, J., Tsimpoukelli, M., Kay, J., Merle, A., Moret, J., Noury, S., Pesamosca, F., Pfau, D., Sauter, O., Sommariva, C., Coda, S., Duval, B., Fasoli, A., Kohli, P., Kavukcuoglu, K., Hassabis, D., and Riedmiller, M. A. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414-419, 2022.
[11]
Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. CoRR, abs/1710.11248, 2017.
[12]
Garg, D., Chakraborty, S., Cundy, C., Song, J., and Ermon, S. IQ-Learn: Inverse soft-Q learning for imitation. In Advances in Neural Information Processing Systems 34, pp. 4028-4039, virtual, 2021.
[13]
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pp. 1861-1870, Stockholm, Sweden, 2018a.
[14]
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., and Levine, S. Soft actor-critic algorithms and applications. CoRR, abs/1812.05905, 2018b.
[15]
Hejna, J. and Sadigh, D. Inverse preference learning: Preference-based RL without a reward function. CoRR, abs/2305.15363, 2023.
[16]
Ho, J., Gupta, J. K., and Ermon, S. Model-free imitation learning with policy optimization. In Proceedings of the 33rd International Conference on Machine Learning, pp. 2760-2769, New York City, NY, 2016.
[17]
Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. Reward learning from human preferences and demonstrations in atari. In Advances in Neural Information Processing Systems 31, pp. 8022-8034, Montréal, Canada, 2018.
[18]
Jiang, S., Pang, J., and Yu, Y. Offline imitation learning with a misspecified simulator. Advances in neural information processing systems, 33:8510-8520, 2020.
[19]
Kostrikov, I., Agrawal, K. K., Dwibedi, D., Levine, S., and Tompson, J. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, 2019.
[20]
Lee, K., Smith, L. M., and Abbeel, P. PEBBLE: feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In Proceedings of the 38th International Conference on Machine Learning, pp. 6152-6163, virtual, 2021a.
[21]
Lee, K., Smith, L. M., Dragan, A. D., and Abbeel, P. B-pref: Benchmarking preference-based reinforcement learning. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, virtual, 2021b.
[22]
Li, Z., Xu, T., Qin, Z., Yu, Y., and Luo, Z.-Q. Imitation learning from imperfection: Theoretical justifications and algorithms. Advances in Neural Information Processing Systems, 36, 2024.
[23]
Liu, X.-H., Xu, F., Zhang, X., Liu, T., Jiang, S., Chen, R., Zhang, Z., and Yu, Y. How to guide your learner: Imitation learning with active adaptive expert involvement. arXiv preprint arXiv:2303.02073, 2023.
[24]
Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, 2018.
[25]
Ng, A. Y. and Russell, S. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pp. 663-670, Stanford, CA, 2000.
[26]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
[27]
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2): 1-179, 2018.
[28]
Peng, X. B., Coumans, E., Zhang, T., Lee, T. E., Tan, J., and Levine, S. Learning agile robotic locomotion skills by imitating animals. In Proceedings of Robotics: Science and Systems XVI, virtual / Corvalis, OR, 2020.
[29]
Pomerleau, D. A. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88-97, 1991.
[30]
Sasaki, F. and Yamashina, R. Behavioral cloning from noisy demonstrations. In Proceedings of the 9th International Conference on Learning Representations, virtual, 2021.
[31]
Sekhari, A., Sridharan, K., Sun, W., and Wu, R. Contextual bandits and imitation learning via preference-based active queries. CoRR, abs/2307.12926, 2023.
[32]
Shiarlis, K., Messias, J. V., and Whiteson, S. Inverse reinforcement learning from failure. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 1060-1068, Singapore, 2016.
[33]
Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT Press, 1998.
[34]
Taranovic, A., Kupcsik, A. G., Freymuth, N., and Neumann, G. Adversarial imitation learning with preferences. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.
[35]
Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T. P., and Riedmiller, M. A. Deepmind control suite. CoRR, abs/1801.00690, 2018.
[36]
Thor, M. and Manoonpong, P. Versatile modular neural locomotion control with fast learning. Nature Machine Intelligence, 4(2):169-179, 2022.
[37]
Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine for model-based control. In Proceedings of the 25th International Conference on Intelligent Robots and Systems, pp. 5026-5033, Algarve, Portugal, 2012.
[38]
Tunyasuvunakool, S., Muldal, A., Doron, Y., Liu, S., Bohez, S., Merel, J., Erez, T., Lillicrap, T., Heess, N., and Tassa, Y. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020.
[39]
Valko, M., Ghavamzadeh, M., and Lazaric, A. Semi-supervised apprenticeship learning. In Proceedings of the 10th European Workshop on Reinforcement Learning, pp. 131-142, Scotland, UK, 2012.
[40]
Wang, Y., Xu, C., and Du, B. Robust adversarial imitation learning via adaptively-selected demonstrations. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, pp. 3155-3161, virtual / Montreal, Canada, 2021a.
[41]
Wang, Y., Xu, C., Du, B., and Lee, H. Learning to weight imperfect demonstrations. In Proceedings of the 38th International Conference on Machine Learning, pp. 10961-10970, virtual, 2021b.
[42]
Wilson, A., Fern, A., and Tadepalli, P. A bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems 25, pp. 1142-1150, Lake Tahoe, NV, 2012.
[43]
Wirth, C., Akrour, R., Neumann, G., and Fürnkranz, J. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18:136:1-136:46, 2017.
[44]
Wu, Y., Charoenphakdee, N., Bao, H., Tangkaratt, V., and Sugiyama, M. Imitation learning from imperfect demonstration. In Proceedings of the 36th International Conference on Machine Learning, pp. 6818-6827, Long Beach, CA, 2019.
[45]
Xu, T., Li, Z., Yu, Y., and Luo, Z.-Q. Provably efficient adversarial imitation learning with unknown transitions. In Uncertainty in Artificial Intelligence, pp. 2367-2378. PMLR, 2023.
[46]
Yang, H., Yu, C., Chen, S., et al. Hybrid policy optimization from imperfect demonstrations. In Advances in Neural Information Processing Systems 36, New Orleans, LA, 2023.
[47]
Zhang, S., Cao, Z., Sadigh, D., and Sui, Y. Confidence-aware imitation learning from demonstrations with varying optimality. In Advances in Neural Information Processing Systems 34, pp. 12340-12350, virtual, 2021.
[48]
Zhang, Z., Sun, Y., Ye, J., Liu, T.-S., Zhang, J., and Yu, Y. Flow to better: Offline preference-based reinforcement learning via preferred trajectory generation. In The Twelfth International Conference on Learning Representations, 2023.
[49]
Zhao, T. Z., Kumar, V., Levine, S., and Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. In Robotics: Science and Systems XIX, Daegu, Republic of Korea, 2023.
[50]
Zheng, J., Liu, S., and Ni, L. M. Robust bayesian inverse reinforcement learning with sparse behavior noise. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, pp. 2198-2205, Québec, Canada, 2014.
[51]
Zhou, R., Gao, C.-X., Zhang, Z., and Yu, Y. Generalizable task representation learning for offline meta-reinforcement learning with data limitations. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 17132-17140, 2024.
[52]
Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 1433-1438, Chicago, IL, 2008.
[53]
Ziebart, B. D., Bagnell, J. A., and Dey, A. K. Modeling interaction via the principle of maximum causal entropy. In Proceedings of the 27th International Conference on Machine Learning, pp. 1255-1262, Haifa, Israel, 2010.
[54]
Zolna, K., Novikov, A., Konyushkova, K., Gülçehre, Ç,., Wang, Z., Aytar, Y., Denil, M., de Freitas, N., and Reed, S. E. Offline learning from demonstrations and unlabeled experience. CoRR, abs/2011.13885, 2020.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICML'24: Proceedings of the 41st International Conference on Machine Learning
July 2024
63010 pages

Publisher

JMLR.org

Publication History

Published: 21 July 2024

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media