Abstract
Scalable robot learning from human-robot interaction is critical if robots are to solve a multitude of tasks in the real world. Current approaches to imitation learning suffer from one of two drawbacks. On the one hand, they rely solely on off-policy human demonstration, which in some cases leads to a mismatch in train-test distribution. On the other, they burden the human to label every state the learner visits, rendering it impractical in many applications. We argue that learning interactively from expert interventions enjoys the best of both worlds. Our key insight is that any amount of expert feedback, whether by intervention or non-intervention, provides information about the quality of the current state, the quality of the action, or both. We formalize this as a constraint on the learner’s value function, which we can efficiently learn using no regret, online learning techniques. We call our approach Expert Intervention Learning (EIL), and evaluate it on a real and simulated driving task with a human expert, where it learns collision avoidance from scratch with just a few hundred samples (about one minute) of expert control.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
While we assume \(Q_\theta (\cdot )\) is convex to prove regret guarantees, the update can be applied to non-convex function classes like neural networks as done in similar works (Sun et al. 2017)
Fréchet distance is a distance metric commonly used to compare trajectories of potentially uneven length. Informally, given a person walking along one trajectory and a dog following the other without either backtracking, the Fréchet distance is the length of the shortest possible leash for both to make it from start to finish.
We modify the action space to have a low constant acceleration and no braking so that the action space was just a discrete set of possible steering angles \([-1,0,1]\) to more closely match that of the original DAgger experiment. We pre-process the 96x96 rgb pixel observation space to LAB color values, using the A,B channels to form a single channel binary thresholded image with all relevant features. We downscale that image to an 8x8 float image, and reshape that into the final state vector \(s\in {\mathbb {R}}^{64}\). The expert network is a DQN of dims 64, (8), 3 with tanh activation at the hidden layer. We use the 8 hidden layer outputs as our feature vector. The learner function class \(\scriptstyle \varvec{q}(s,a)\) is the set of 27 weights and biases for the output layer.
References
Abbeel, P., & Ng, A.Y. (2004). Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the twenty-first International Conference on Machine learning (ICML)
Alt, H., & Godau, M. (1995). Computing the Fréchet distance between two polygonal curves. International Journal of Computational Geometry & Applications, 5, 75–91.
Amershi, S., Cakmak, M., Knox, W. B., & Kulesza, T. (2014). Power to the people: The role of humans in interactive machine learning. AI Magazine, 35, 105–120.
Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems.
Bajcsy, A., Losey, D.P., O’Malley, M.K., & Dragan, A.D. (2018). Learning from physical human corrections, one feature at a time. In: Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction (HRI)
Bajcsy, A., Losey, D.P., O’Malley, M.K., & Dragan, A.D. (2017). Learning robot objectives from physical human interaction. In: Proceedings of the 1st Annual Conference on Robot Learning (CoRL). PMLR
Bi, J., Dhiman, V., Xiao, T., & Xu, C. (2020). Learning from interventions using hierarchical policies for safe learning. Proceedings of the AAAI Conference on Artificial Intelligence, 34–06, 10352–10360.
Bi, J., Xiao, T., Sun, Q., & Xu, C. (2018). Navigation by imitation in a pedestrian-rich environment. arXiv preprint arXiv:1811.00506
Celemin, C., & Ruiz-del Solar, J. (2019). An interactive framework for learning continuous actions policies based on corrective feedback. Journal of Intelligent & Robotic Systems, 95, 77–97.
Chen, M., Nikolaidis, S., Soh, H., Hsu, D., & Srinivasa, S. (2018). Planning with trust for human-robot collaboration. In: Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction (HRI)
Chernova, S., & Veloso, M. (2009). Interactive policy learning through confidence-based autonomy. Journal of Artificial Intelligence Research, 34, 1–25.
Choudhury, S., Dugar, V., Maeta, S., MacAllister, B., Arora, S., Althoff, D., & Scherer, S. (2019). High performance and safe flight of full-scale helicopters from takeoff to landing with an ensemble of planners. Journal of Field Robotics (JFR), 36(8), 1275–1332.
Daumé III, H., Langford, J., & Marcu, D. (2009). Search-based structured prediction. Machine Learning Journal (MLJ), 75(3), 297–325.
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., & Zhokhov, P. (2017). Openai baselines. https://github.com/openai/baselines
Fisac, J.F., Gates, M.A., Hamrick, J.B., Liu, C., Hadfield-Menell, D., Palaniappan, M., Malik, D., Sastry, S.S., Griffiths, T.L., & Dragan, A.D. (2019). Pragmatic-pedagogic value alignment. Robotics Research p. 49-57.
Goecks, V. G., Gremillion, G. M., Lawhern, V. J., Valasek, J., & Waytowich, N. R. (2019). Efficiently combining human demonstrations and interventions for safe training of autonomous systems in real-time. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 2462–2470.
Grollman, D.H., & Jenkins, O.C. (2007). Dogged learning for robots. In: Proceedings 2007 IEEE International Conference on Robotics and Automation (ICRA).
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2017). Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Hadfield-Menell, D., Russell, S.J., Abbeel, P., & Dragan, A. (2016). Cooperative inverse reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS).
Jain, A., Wojcik, B., Joachims, T., & Saxena, A. (2013). Learning trajectory preferences for manipulators via iterative improvement. In: Advances in Neural Information Processing Systems (NeurIPS).
Judah, K., Fern, A.P., & Dietterich, T.G. (2012). Active imitation learning via reduction to iid active learning. In: 2012 AAAI Fall Symposium Series.
Kelly, M., Sidrane, C., Driggs-Campbell, K., & Kochenderfer, M.J. (2019). Hg-dagger: Interactive imitation learning with human experts. In: 2019 International Conference on Robotics and Automation (ICRA).
Kim, B., Farahmand, A., Pineau, J., & Precup, D. (2013). Learning from limited demonstrations. In: Advances in Neural Information Processing Systems (NeurIPS).
Kim, B., & Pineau, J. (2013). Maximum mean discrepancy imitation learning. In: Robotics: Science and Systems (RSS)
Kollmitz, M., Koller, T., Boedecker, J., & Burgard, W. (2020). Learning human-aware robot navigation from physical interaction via inverse reinforcement learning. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 11025–11031. IEEE
Laskey, M., Chuck, C., Lee, J., Mahler, J., Krishnan, S., Jamieson, K., Dragan, A., & Goldberg, K. (2017). Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations. In: IEEE International Conference on Robotics and Automation (ICRA).
Laskey, M., Lee, J., Hsieh, W., Liaw, R., Mahler, J., Fox, R., & Goldberg, K. (2017). Iterative noise injection for scalable imitation learning. arXiv preprint arXiv:1703.09327
Laskey, M., Staszak, S., Hsieh, W.Y.S., Mahler, J., Pokorny, F.T., Dragan, A.D., & Goldberg, K. (2016). SHIV: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces. In: 2016 IEEE International Conference on Robotics and Automation (ICRA).
Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., & Quillen, D. (2018). Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research (IJRR).
Loftin, R., Peng, B., MacGlashan, J., Littman, M. L., Taylor, M. E., Huang, J., & Roberts, D. L. (2016). Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Autonomous Agents and Multi-Agent Systems.
MacGlashan, J., Ho, M.K., Loftin, R., Peng, B., Wang, G., Roberts, D.L., Taylor, M.E., & Littman, M.L. (2017). Interactive learning from policy-dependent human feedback. In: Proceedings of the 34th International Conference on Machine Learning (ICML).
McPherson, D.L., Scobee, D.R., Menke, J., Yang, A.Y., & Sastry, S.S. (2018). Modeling supervisor safe sets for improving collaboration in human-robot teams. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 861–868. IEEE.
Menda, K., Driggs-Campbell, K.R., & Kochenderfer, M.J. (2018). EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning. arXiv preprint arXiv:1807.08364
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J.A., Abbeel, P., & Peters, J. (2018). An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2), 1–179.
Packard, B., & Ontañón, S. (2017). Policies for active learning from demonstration. In: 2017 AAAI Spring Symposium Series
Pomerleau, D.A. (1989). Alvinn: An autonomous land vehicle in a neural network. In: Advances in Neural Information Processing Systems (NeurIPS)
Ross, S., Gordon, G., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AIStats).
Ross, S., Melik-Barkhudarov, N., Shankar, K.S., Wendel, A., Dey, D., Bagnell, J.A., & Hebert, M. (2013). Learning monocular reactive UAV control in cluttered natural environments. In: IEEE International Conference on Robotics and Automation (ICRA).
Sadat, A., Ren, M., Pokrovsky, A., Lin, Y.C., Yumer, E., & Urtasun, R. (2019). Jointly learnable behavior and trajectory planning for self-driving vehicles. arXiv preprint arXiv:1910.04586
Sadigh, D., Sastry, S.S., Seshia, S.A., & Dragan, A. (2016). Information gathering actions over human internal state. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Saunders, W., Sastry, G., Stuhlmueller, A., & Evans, O. (2017). Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173
Shalev-Shwartz, S. (2012). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107–194.
Spencer, J., Choudhury, S., Barnes, M., Schmittle, M., Chiang, M., Ramadge, P., & Srinivasa, S. (2020). Learning from interventions: Human-robot interaction as both explicit and implicit feedback. In: Robotics: Science and Systems (RSS).
Spencer, J., Choudhury, S., Venkatraman, A., Ziebart, B., & Bagnell, J.A. (2021). Feedback in imitation learning: The three regimes of covariate shift. arXiv preprint arXiv:2102.02872
Srinivasa, S.S., Lancaster, P., Michalove, J., Schmittle, M., Summers, C., Rockett, M., Smith, J.R., Choudhury, S., Mavrogiannis, C., & Sadeghi, F. (2019). MuSHR: A Low-Cost, Open-Source Robotic Racecar for Education and Research. arXiv preprint arXiv:1908.08031
Sun, W., Venkatraman, A., Gordon, G.J., Boots, B., & Bagnell, J.A. (2017). Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction. In: Proceedings of the 34th International Conference on Machine Learning (ICML).
Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning (ICML).
Acknowledgements
This work was (partially) funded by the DARPA Dispersed Computing program, NIH R01 (R01EB019335), NSF CPS (#1544797), NSF NRI (#1637748), the Office of Naval Research, RCTA, Amazon, and Honda Research Institute USA.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This is one of the several papers published in Autonomous Robots comprising the Special Issue on Robotics: Science and Systems 2020.
Appendices
Proofs
1.1 Reduction to no-regret, online learning
The general non i.i.d optimization we wish to solve is
We’ll directly prove the general setting here rather than proving individually for \(\ell _C\) and \(\ell _B\).
We reduce this optimization problem to a sequence of convex losses \(\ell _i(\theta )\) where the i-th loss is a function of the distribution at that iteration, \(\ell _i(\theta )={\mathbb {E}}_{(s,a) \sim d^I_i} \ell _C(s,a,\theta ) + \lambda {\mathbb {E}}_{(s,a) \sim d_i} \ell _B(s,a,\theta )\) In our algorithm, the learner at iteration i applies Follow-the-Leader (FTL)
Since FTL is a no-regret algorithm, we have the average regret
go to 0 as \(N\rightarrow \infty \), with \({\tilde{O}}(\tfrac{1}{N})\) for strongly convex \(\ell _i\), (See Theorem 2.4 and Corollary 2.2 in Shalev-Shwartz (2012))
In this framework, we restate and prove Thm. 1.
Theorem 2
Let \(\ell _i(\theta ) = {\mathbb {E}}_{(s,a)\sim d_{\pi _{\theta _i}}} \ell (s,a,\theta )\). Also let \(\epsilon _N = \min _{\theta }\frac{1}{N} \sum _{i=1}^N \ell _i(\theta )\) be the loss of the best parameter in hindsight after N iterations. Let \(\gamma _N\) be the average regret of \(\theta _{1:N}\). There exists a \(\theta \in \theta _{1:N}\) s.t.
Proof
The performance of the best learner in the sequence \(\theta _1,\cdots ,\theta _N\) must be smaller than the average loss of each learner on its own induced distribution (min smaller than average)
Using (20) we have
\(\square \)
This proof can be extended for finite sample cases following the original DAgger proofs. This theorem applies to each portion of the objective individually, yielding regret terms \(\gamma _N^B\) and \(\gamma _N^I\) which each individually go to zero as \(N\rightarrow \infty \), thus we are guaranteed that the combined objective as well as each individual objective is zero regret.
HG-DAgger counter example
We construct a counter-example for HG-DAgger approaches (Kelly et al. 2019; Goecks et al. 2019; Bi et al. 2018) in Fig. 9. Recall that in HG-DAgger, we only use the intervention loss \(\ell _C(.)\).
The MDP is such that the learner can choose between two actions - Left (L) and Right (R) only at states \(s_0\) and \(s_1\). Unknown to the learner, but known to the expert, some of the edges are associated with costs. The expert deems a “good enough” state as having value of \(-9\). Hence whenver the learner enters \(s_1\), the expert takes over to intervene and demonstrates \((s_1, L)\).
HG-DAgger only keeps this intervention data and uses it as classification loss. Let’s say it is using a tabular policy. If it learns the policy \((s_0,L)\) and \((s_1,L)\) - it will indeed achieve \(\ell _c(s,a,\theta )=0\). However, the expert will continue to intervene as this policy always exits the good enough state
Let’s look at all policies and their implicit bounds and intervention losses. Assume we get a penalty of 1 for every bad state or misclassified action. We have:
-
1.
Policy \((s_0, L), (s_1,L)\): Loss \(\ell _B = 2\), \(\ell _C=0\)
-
2.
Policy \((s_0, L), (s_1,R)\): Loss \(\ell _B = 2\), \(\ell _C=1\)
-
3.
Policy \((s_0, R), (s_1,L)\): Loss \(\ell _B = 0\), \(\ell _C=0\)
-
4.
Policy \((s_0, R), (s_1,R)\): Loss \(\ell _B = 0\), \(\ell _C=0\)
The last two policies have the same intervention loss because the induced distribution is such that these policies never result in interventions (even though one learns an incorrect intervention action).
HG-DAgger looks at only the last column and hence may not end up learning \((s_0, R)\). EIL on the other hand will.
Rights and permissions
About this article
Cite this article
Spencer, J., Choudhury, S., Barnes, M. et al. Expert Intervention Learning. Auton Robot 46, 99–113 (2022). https://doi.org/10.1007/s10514-021-10006-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10514-021-10006-9