More Web Proxy on the site http://driver.im/

research-article

Limited preference aided imitation learning from imperfect demonstrations

AUTHORs:

Yang YuAuthors Info & Claims

ICML'24: Proceedings of the 41st International Conference on Machine Learning

Article No.: 217, Pages 5584 - 5607

Published: 21 July 2024 Publication History

Abstract

Imitation learning mimics high-quality policies from expert data for sequential decision-making tasks. However, its efficacy is hindered in scenarios where optimal demonstrations are unavailable, and only imperfect demonstrations are present. To address this issue, introducing additional limited human preferences is a suitable approach as it can be obtained in a human-friendly manner, offering a promising way to learn the policy that exceeds the performance of imperfect demonstrations. In this paper, we propose a novel imitation learning (IL) algorithm, Preference Aided Imitation Learning from imperfect demonstrations (PAIL). Specifically, PAIL learns a preference reward by querying experts for limited preferences from imperfect demonstrations. This serves two purposes during training: 1) Reweighting imperfect demonstrations with the preference reward for higher quality. 2) Selecting explored trajectories with high cumulative preference rewards to augment imperfect demonstrations. The dataset with continuously improving quality empowers the performance of PAIL to transcend the initial demonstrations. Comprehensive empirical results across a synthetic task and two locomotion benchmarks show that PAIL surpasses baselines by 73.2% and breaks through the performance bottleneck of imperfect demonstrations.

References

[1]

Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., and Munos, R. A general theoretical paradigm to understand learning from human preferences. CoRR, abs/2310.12036, 2023.

[2]

Bloem, M. and Bambos, N. Infinite time horizon maximum causal entropy inverse reinforcement learning. In Proceedings of the 53rd IEEE Conference on Decision and Control, pp. 4911-4916, Los Angeles, CA, 2014.

[3]

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324-345, 1952.

[4]

Brown, D. S., Goo, W., Nagarajan, P., and Niekum, S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Proceedings of the 36th International Conference on Machine Learning, pp. 783-792, Long Beach, CA, 2019a.

[5]

Brown, D. S., Goo, W., and Niekum, S. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In Proceedings of the 3rd Annual Conference on Robot Learning, pp. 330-359, Osaka, Japan, 2019b.

[6]

Cai, X.-Q., Zhang, Y.-J., Chiang, C.-K., and Sugiyama, M. Imitation learning from vague feedback. Advances in Neural Information Processing Systems, 36:48275-48292, 2023.

[7]

Chen, L., Paleja, R. R., and Gombolay, M. C. Learning from suboptimal demonstration via self-supervised reward regression. In Proceedings of the 4th Conference on Robot Learning, pp. 1262-1277, virtual/Cambridge, MA, 2020.

[8]

Choi, S., Lee, K., and Oh, S. Robust learning from demonstrations with mixed qualities using leveraged gaussian processes. IEEE Transactions on Robotics, 35(3):564-576, 2019.

[9]

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30, pp. 4299-4307, Long Beach, CA, 2017.

[10]

Degrave, J., Felici, F., Buchli, J., Neunert, M., Tracey, B. D., Carpanese, F., Ewalds, T., Hafner, R., Abdolmaleki, A., de Las Casas, D., Donner, C., Fritz, L., Galperti, C., Huber, A., Keeling, J., Tsimpoukelli, M., Kay, J., Merle, A., Moret, J., Noury, S., Pesamosca, F., Pfau, D., Sauter, O., Sommariva, C., Coda, S., Duval, B., Fasoli, A., Kohli, P., Kavukcuoglu, K., Hassabis, D., and Riedmiller, M. A. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414-419, 2022.

[11]

Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. CoRR, abs/1710.11248, 2017.

[12]

Garg, D., Chakraborty, S., Cundy, C., Song, J., and Ermon, S. IQ-Learn: Inverse soft-Q learning for imitation. In Advances in Neural Information Processing Systems 34, pp. 4028-4039, virtual, 2021.

[13]

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pp. 1861-1870, Stockholm, Sweden, 2018a.

[14]

Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., and Levine, S. Soft actor-critic algorithms and applications. CoRR, abs/1812.05905, 2018b.

[15]

Hejna, J. and Sadigh, D. Inverse preference learning: Preference-based RL without a reward function. CoRR, abs/2305.15363, 2023.

[16]

Ho, J., Gupta, J. K., and Ermon, S. Model-free imitation learning with policy optimization. In Proceedings of the 33rd International Conference on Machine Learning, pp. 2760-2769, New York City, NY, 2016.

[17]

Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. Reward learning from human preferences and demonstrations in atari. In Advances in Neural Information Processing Systems 31, pp. 8022-8034, Montréal, Canada, 2018.

[18]

Jiang, S., Pang, J., and Yu, Y. Offline imitation learning with a misspecified simulator. Advances in neural information processing systems, 33:8510-8520, 2020.

[19]

Kostrikov, I., Agrawal, K. K., Dwibedi, D., Levine, S., and Tompson, J. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, 2019.

[20]

Lee, K., Smith, L. M., and Abbeel, P. PEBBLE: feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In Proceedings of the 38th International Conference on Machine Learning, pp. 6152-6163, virtual, 2021a.

[21]

Lee, K., Smith, L. M., Dragan, A. D., and Abbeel, P. B-pref: Benchmarking preference-based reinforcement learning. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, virtual, 2021b.

[22]

Li, Z., Xu, T., Qin, Z., Yu, Y., and Luo, Z.-Q. Imitation learning from imperfection: Theoretical justifications and algorithms. Advances in Neural Information Processing Systems, 36, 2024.

[23]

Liu, X.-H., Xu, F., Zhang, X., Liu, T., Jiang, S., Chen, R., Zhang, Z., and Yu, Y. How to guide your learner: Imitation learning with active adaptive expert involvement. arXiv preprint arXiv:2303.02073, 2023.

[24]

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, 2018.

Digital Library

[25]

Ng, A. Y. and Russell, S. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pp. 663-670, Stanford, CA, 2000.

Digital Library

[26]

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.

[27]

Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2): 1-179, 2018.

[28]

Peng, X. B., Coumans, E., Zhang, T., Lee, T. E., Tan, J., and Levine, S. Learning agile robotic locomotion skills by imitating animals. In Proceedings of Robotics: Science and Systems XVI, virtual / Corvalis, OR, 2020.

[29]

Pomerleau, D. A. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88-97, 1991.

[30]

Sasaki, F. and Yamashina, R. Behavioral cloning from noisy demonstrations. In Proceedings of the 9th International Conference on Learning Representations, virtual, 2021.

[31]

Sekhari, A., Sridharan, K., Sun, W., and Wu, R. Contextual bandits and imitation learning via preference-based active queries. CoRR, abs/2307.12926, 2023.

[32]

Shiarlis, K., Messias, J. V., and Whiteson, S. Inverse reinforcement learning from failure. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 1060-1068, Singapore, 2016.

Digital Library

[33]

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT Press, 1998.

Digital Library

[34]

Taranovic, A., Kupcsik, A. G., Freymuth, N., and Neumann, G. Adversarial imitation learning with preferences. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.

[35]

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T. P., and Riedmiller, M. A. Deepmind control suite. CoRR, abs/1801.00690, 2018.

[36]

Thor, M. and Manoonpong, P. Versatile modular neural locomotion control with fast learning. Nature Machine Intelligence, 4(2):169-179, 2022.

[37]

Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine for model-based control. In Proceedings of the 25th International Conference on Intelligent Robots and Systems, pp. 5026-5033, Algarve, Portugal, 2012.

[38]

Tunyasuvunakool, S., Muldal, A., Doron, Y., Liu, S., Bohez, S., Merel, J., Erez, T., Lillicrap, T., Heess, N., and Tassa, Y. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020.

[39]

Valko, M., Ghavamzadeh, M., and Lazaric, A. Semi-supervised apprenticeship learning. In Proceedings of the 10th European Workshop on Reinforcement Learning, pp. 131-142, Scotland, UK, 2012.

[40]

Wang, Y., Xu, C., and Du, B. Robust adversarial imitation learning via adaptively-selected demonstrations. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, pp. 3155-3161, virtual / Montreal, Canada, 2021a.

[41]

Wang, Y., Xu, C., Du, B., and Lee, H. Learning to weight imperfect demonstrations. In Proceedings of the 38th International Conference on Machine Learning, pp. 10961-10970, virtual, 2021b.

[42]

Wilson, A., Fern, A., and Tadepalli, P. A bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems 25, pp. 1142-1150, Lake Tahoe, NV, 2012.

[43]

Wirth, C., Akrour, R., Neumann, G., and Fürnkranz, J. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18:136:1-136:46, 2017.

[44]

Wu, Y., Charoenphakdee, N., Bao, H., Tangkaratt, V., and Sugiyama, M. Imitation learning from imperfect demonstration. In Proceedings of the 36th International Conference on Machine Learning, pp. 6818-6827, Long Beach, CA, 2019.

[45]

Xu, T., Li, Z., Yu, Y., and Luo, Z.-Q. Provably efficient adversarial imitation learning with unknown transitions. In Uncertainty in Artificial Intelligence, pp. 2367-2378. PMLR, 2023.

[46]

Yang, H., Yu, C., Chen, S., et al. Hybrid policy optimization from imperfect demonstrations. In Advances in Neural Information Processing Systems 36, New Orleans, LA, 2023.

[47]

Zhang, S., Cao, Z., Sadigh, D., and Sui, Y. Confidence-aware imitation learning from demonstrations with varying optimality. In Advances in Neural Information Processing Systems 34, pp. 12340-12350, virtual, 2021.

[48]

Zhang, Z., Sun, Y., Ye, J., Liu, T.-S., Zhang, J., and Yu, Y. Flow to better: Offline preference-based reinforcement learning via preferred trajectory generation. In The Twelfth International Conference on Learning Representations, 2023.

[49]

Zhao, T. Z., Kumar, V., Levine, S., and Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. In Robotics: Science and Systems XIX, Daegu, Republic of Korea, 2023.

[50]

Zheng, J., Liu, S., and Ni, L. M. Robust bayesian inverse reinforcement learning with sparse behavior noise. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, pp. 2198-2205, Québec, Canada, 2014.

[51]

Zhou, R., Gao, C.-X., Zhang, Z., and Yu, Y. Generalizable task representation learning for offline meta-reinforcement learning with data limitations. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 17132-17140, 2024.

[52]

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 1433-1438, Chicago, IL, 2008.

Digital Library

[53]

Ziebart, B. D., Bagnell, J. A., and Dey, A. K. Modeling interaction via the principle of maximum causal entropy. In Proceedings of the 27th International Conference on Machine Learning, pp. 1255-1262, Haifa, Israel, 2010.

Digital Library

[54]

Zolna, K., Novikov, A., Konyushkova, K., Gülçehre, Ç,., Wang, Z., Aytar, Y., Denil, M., de Freitas, N., and Reed, S. E. Offline learning from demonstrations and unlabeled experience. CoRR, abs/2011.13885, 2020.

Index Terms

Limited preference aided imitation learning from imperfect demonstrations
1. Computing methodologies
  1. Artificial intelligence
    1. Planning and scheduling
      1. Planning under uncertainty
  2. Machine learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Active learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Anomaly Guided Policy Learning from Imperfect Demonstrations
AAMAS '22: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems

Learning from Demonstrations (LfD) refers to using expert demonstrations combined with the reward information given by the environment to jointly guide the learning of policy in Reinforcement Learning. Previous LfD methods usually assume that provided ...
Unlabeled imperfect demonstrations in adversarial imitation learning
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

Adversarial imitation learning has become a widely used imitation learning framework. The discriminator is often trained by taking expert demonstrations and policy trajectories as examples respectively from two categories (positive vs. negative) and the ...
Imitation Learning to Outperform Demonstrators by Directly Extrapolating Demonstrations
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

We consider the problem of imitation learning from suboptimal demonstrations that aims to learn a better policy than demonstrators. Previous methods usually learn a reward function to encode the underlying intention of the demonstrators and use standard ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'24: Proceedings of the 41st International Conference on Machine Learning

July 2024

63010 pages

Copyright © 2024.

Publisher

JMLR.org

Publication History

Published: 21 July 2024

Qualifiers

Research-article
Research
Refereed limited

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents