Expert Intervention Learning

Jonathan Spencer ORCID: orcid.org/0000-0001-5894-132X¹,
Sanjiban Choudhury²,
Matthew Barnes²,
Matthew Schmittle²,
Mung Chiang¹,
Peter Ramadge¹ &
…
Sidd Srinivasa²

1563 Accesses
12 Citations
Explore all metrics

Abstract

Scalable robot learning from human-robot interaction is critical if robots are to solve a multitude of tasks in the real world. Current approaches to imitation learning suffer from one of two drawbacks. On the one hand, they rely solely on off-policy human demonstration, which in some cases leads to a mismatch in train-test distribution. On the other, they burden the human to label every state the learner visits, rendering it impractical in many applications. We argue that learning interactively from expert interventions enjoys the best of both worlds. Our key insight is that any amount of expert feedback, whether by intervention or non-intervention, provides information about the quality of the current state, the quality of the action, or both. We formalize this as a constraint on the learner’s value function, which we can efficiently learn using no regret, online learning techniques. We call our approach Expert Intervention Learning (EIL), and evaluate it on a real and simulated driving task with a human expert, where it learns collision avoidance from scratch with just a few hundred samples (about one minute) of expert control.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

The Open-Source TEXPLORE Code Release for Reinforcement Learning on Robots

Learning High-Level Navigation Strategies via Inverse Reinforcement Learning: A Comparative Analysis

I2RL: online inverse reinforcement learning under occlusion

Article 05 November 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

While we assume $Q_\theta (\cdot )$ is convex to prove regret guarantees, the update can be applied to non-convex function classes like neural networks as done in similar works (Sun et al. 2017)
Fréchet distance is a distance metric commonly used to compare trajectories of potentially uneven length. Informally, given a person walking along one trajectory and a dog following the other without either backtracking, the Fréchet distance is the length of the shortest possible leash for both to make it from start to finish.
We modify the action space to have a low constant acceleration and no braking so that the action space was just a discrete set of possible steering angles $[-1,0,1]$ to more closely match that of the original DAgger experiment. We pre-process the 96x96 rgb pixel observation space to LAB color values, using the A,B channels to form a single channel binary thresholded image with all relevant features. We downscale that image to an 8x8 float image, and reshape that into the final state vector $s\in {\mathbb {R}}^{64}$. The expert network is a DQN of dims 64, (8), 3 with tanh activation at the hidden layer. We use the 8 hidden layer outputs as our feature vector. The learner function class $\scriptstyle \varvec{q}(s,a)$ is the set of 27 weights and biases for the output layer.

References

Abbeel, P., & Ng, A.Y. (2004). Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the twenty-first International Conference on Machine learning (ICML)
Alt, H., & Godau, M. (1995). Computing the Fréchet distance between two polygonal curves. International Journal of Computational Geometry & Applications, 5, 75–91.
Article MathSciNet Google Scholar
Amershi, S., Cakmak, M., Knox, W. B., & Kulesza, T. (2014). Power to the people: The role of humans in interactive machine learning. AI Magazine, 35, 105–120.
Article Google Scholar
Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems.
Bajcsy, A., Losey, D.P., O’Malley, M.K., & Dragan, A.D. (2018). Learning from physical human corrections, one feature at a time. In: Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction (HRI)
Bajcsy, A., Losey, D.P., O’Malley, M.K., & Dragan, A.D. (2017). Learning robot objectives from physical human interaction. In: Proceedings of the 1st Annual Conference on Robot Learning (CoRL). PMLR
Bi, J., Dhiman, V., Xiao, T., & Xu, C. (2020). Learning from interventions using hierarchical policies for safe learning. Proceedings of the AAAI Conference on Artificial Intelligence, 34–06, 10352–10360.
Article Google Scholar
Bi, J., Xiao, T., Sun, Q., & Xu, C. (2018). Navigation by imitation in a pedestrian-rich environment. arXiv preprint arXiv:1811.00506
Celemin, C., & Ruiz-del Solar, J. (2019). An interactive framework for learning continuous actions policies based on corrective feedback. Journal of Intelligent & Robotic Systems, 95, 77–97.
Article Google Scholar
Chen, M., Nikolaidis, S., Soh, H., Hsu, D., & Srinivasa, S. (2018). Planning with trust for human-robot collaboration. In: Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction (HRI)
Chernova, S., & Veloso, M. (2009). Interactive policy learning through confidence-based autonomy. Journal of Artificial Intelligence Research, 34, 1–25.
Choudhury, S., Dugar, V., Maeta, S., MacAllister, B., Arora, S., Althoff, D., & Scherer, S. (2019). High performance and safe flight of full-scale helicopters from takeoff to landing with an ensemble of planners. Journal of Field Robotics (JFR), 36(8), 1275–1332.
Daumé III, H., Langford, J., & Marcu, D. (2009). Search-based structured prediction. Machine Learning Journal (MLJ), 75(3), 297–325.
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., & Zhokhov, P. (2017). Openai baselines. https://github.com/openai/baselines
Fisac, J.F., Gates, M.A., Hamrick, J.B., Liu, C., Hadfield-Menell, D., Palaniappan, M., Malik, D., Sastry, S.S., Griffiths, T.L., & Dragan, A.D. (2019). Pragmatic-pedagogic value alignment. Robotics Research p. 49-57.
Goecks, V. G., Gremillion, G. M., Lawhern, V. J., Valasek, J., & Waytowich, N. R. (2019). Efficiently combining human demonstrations and interventions for safe training of autonomous systems in real-time. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 2462–2470.
Article Google Scholar
Grollman, D.H., & Jenkins, O.C. (2007). Dogged learning for robots. In: Proceedings 2007 IEEE International Conference on Robotics and Automation (ICRA).
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2017). Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Hadfield-Menell, D., Russell, S.J., Abbeel, P., & Dragan, A. (2016). Cooperative inverse reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS).
Jain, A., Wojcik, B., Joachims, T., & Saxena, A. (2013). Learning trajectory preferences for manipulators via iterative improvement. In: Advances in Neural Information Processing Systems (NeurIPS).
Judah, K., Fern, A.P., & Dietterich, T.G. (2012). Active imitation learning via reduction to iid active learning. In: 2012 AAAI Fall Symposium Series.
Kelly, M., Sidrane, C., Driggs-Campbell, K., & Kochenderfer, M.J. (2019). Hg-dagger: Interactive imitation learning with human experts. In: 2019 International Conference on Robotics and Automation (ICRA).
Kim, B., Farahmand, A., Pineau, J., & Precup, D. (2013). Learning from limited demonstrations. In: Advances in Neural Information Processing Systems (NeurIPS).
Kim, B., & Pineau, J. (2013). Maximum mean discrepancy imitation learning. In: Robotics: Science and Systems (RSS)
Kollmitz, M., Koller, T., Boedecker, J., & Burgard, W. (2020). Learning human-aware robot navigation from physical interaction via inverse reinforcement learning. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 11025–11031. IEEE
Laskey, M., Chuck, C., Lee, J., Mahler, J., Krishnan, S., Jamieson, K., Dragan, A., & Goldberg, K. (2017). Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations. In: IEEE International Conference on Robotics and Automation (ICRA).
Laskey, M., Lee, J., Hsieh, W., Liaw, R., Mahler, J., Fox, R., & Goldberg, K. (2017). Iterative noise injection for scalable imitation learning. arXiv preprint arXiv:1703.09327
Laskey, M., Staszak, S., Hsieh, W.Y.S., Mahler, J., Pokorny, F.T., Dragan, A.D., & Goldberg, K. (2016). SHIV: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces. In: 2016 IEEE International Conference on Robotics and Automation (ICRA).
Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., & Quillen, D. (2018). Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research (IJRR).
Loftin, R., Peng, B., MacGlashan, J., Littman, M. L., Taylor, M. E., Huang, J., & Roberts, D. L. (2016). Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Autonomous Agents and Multi-Agent Systems.
MacGlashan, J., Ho, M.K., Loftin, R., Peng, B., Wang, G., Roberts, D.L., Taylor, M.E., & Littman, M.L. (2017). Interactive learning from policy-dependent human feedback. In: Proceedings of the 34th International Conference on Machine Learning (ICML).
McPherson, D.L., Scobee, D.R., Menke, J., Yang, A.Y., & Sastry, S.S. (2018). Modeling supervisor safe sets for improving collaboration in human-robot teams. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 861–868. IEEE.
Menda, K., Driggs-Campbell, K.R., & Kochenderfer, M.J. (2018). EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning. arXiv preprint arXiv:1807.08364
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J.A., Abbeel, P., & Peters, J. (2018). An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2), 1–179.
Packard, B., & Ontañón, S. (2017). Policies for active learning from demonstration. In: 2017 AAAI Spring Symposium Series
Pomerleau, D.A. (1989). Alvinn: An autonomous land vehicle in a neural network. In: Advances in Neural Information Processing Systems (NeurIPS)
Ross, S., Gordon, G., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AIStats).
Ross, S., Melik-Barkhudarov, N., Shankar, K.S., Wendel, A., Dey, D., Bagnell, J.A., & Hebert, M. (2013). Learning monocular reactive UAV control in cluttered natural environments. In: IEEE International Conference on Robotics and Automation (ICRA).
Sadat, A., Ren, M., Pokrovsky, A., Lin, Y.C., Yumer, E., & Urtasun, R. (2019). Jointly learnable behavior and trajectory planning for self-driving vehicles. arXiv preprint arXiv:1910.04586
Sadigh, D., Sastry, S.S., Seshia, S.A., & Dragan, A. (2016). Information gathering actions over human internal state. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Saunders, W., Sastry, G., Stuhlmueller, A., & Evans, O. (2017). Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173
Shalev-Shwartz, S. (2012). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107–194.
Article Google Scholar
Spencer, J., Choudhury, S., Barnes, M., Schmittle, M., Chiang, M., Ramadge, P., & Srinivasa, S. (2020). Learning from interventions: Human-robot interaction as both explicit and implicit feedback. In: Robotics: Science and Systems (RSS).
Spencer, J., Choudhury, S., Venkatraman, A., Ziebart, B., & Bagnell, J.A. (2021). Feedback in imitation learning: The three regimes of covariate shift. arXiv preprint arXiv:2102.02872
Srinivasa, S.S., Lancaster, P., Michalove, J., Schmittle, M., Summers, C., Rockett, M., Smith, J.R., Choudhury, S., Mavrogiannis, C., & Sadeghi, F. (2019). MuSHR: A Low-Cost, Open-Source Robotic Racecar for Education and Research. arXiv preprint arXiv:1908.08031
Sun, W., Venkatraman, A., Gordon, G.J., Boots, B., & Bagnell, J.A. (2017). Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction. In: Proceedings of the 34th International Conference on Machine Learning (ICML).
Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning (ICML).

Download references

Acknowledgements

This work was (partially) funded by the DARPA Dispersed Computing program, NIH R01 (R01EB019335), NSF CPS (#1544797), NSF NRI (#1637748), the Office of Naval Research, RCTA, Amazon, and Honda Research Institute USA.

Author information

Authors and Affiliations

Princeton University, New Jersey, USA
Jonathan Spencer, Mung Chiang & Peter Ramadge
University of Washington, Washington, USA
Sanjiban Choudhury, Matthew Barnes, Matthew Schmittle & Sidd Srinivasa

Authors

Jonathan Spencer
View author publications
You can also search for this author in PubMed Google Scholar
Sanjiban Choudhury
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Barnes
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Schmittle
View author publications
You can also search for this author in PubMed Google Scholar
Mung Chiang
View author publications
You can also search for this author in PubMed Google Scholar
Peter Ramadge
View author publications
You can also search for this author in PubMed Google Scholar
Sidd Srinivasa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jonathan Spencer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is one of the several papers published in Autonomous Robots comprising the Special Issue on Robotics: Science and Systems 2020.

Appendices

Proofs

1.1 Reduction to no-regret, online learning

The general non i.i.d optimization we wish to solve is

$$\begin{aligned}&\min _{\pi } {\mathbb {E}}_{(s,a) \sim d^I_\pi (s,a)} \ell _C(s,a,\theta ) \nonumber \\&\quad + \lambda {\mathbb {E}}_{(s,a) \sim d_\pi (s,a)} \ell _B(s,a,\theta ). \end{aligned}$$

(18)

We’ll directly prove the general setting here rather than proving individually for $\ell _C$ and $\ell _B$.

We reduce this optimization problem to a sequence of convex losses $\ell _i(\theta )$ where the i-th loss is a function of the distribution at that iteration, $\ell _i(\theta )={\mathbb {E}}_{(s,a) \sim d^I_i} \ell _C(s,a,\theta ) + \lambda {\mathbb {E}}_{(s,a) \sim d_i} \ell _B(s,a,\theta )$ In our algorithm, the learner at iteration i applies Follow-the-Leader (FTL)

$$\begin{aligned} \begin{aligned} \theta _{i+1}&= \arg \min _\theta \sum _{t=1}^i&\ell _t(\theta ) \\&= \arg \min _\theta \sum _{t=1}^i&{\mathbb {E}}_{(s,a) \sim d^I_t} \ell _C(s,a,\theta ) \\&+\lambda {\mathbb {E}}_{(s,a) \sim d_ts} \ell _B(s,a,\theta ) \end{aligned} \end{aligned}$$

(19)

Since FTL is a no-regret algorithm, we have the average regret

$$\begin{aligned} \frac{1}{N} \sum _{i=1}^N \ell _i(\theta _i) - \min _\theta \frac{1}{N} \sum _{i=1}^N \ell _i(\theta ) \le \gamma _N \end{aligned}$$

(20)

go to 0 as $N\rightarrow \infty $, with ${\tilde{O}}(\tfrac{1}{N})$ for strongly convex $\ell _i$, (See Theorem 2.4 and Corollary 2.2 in Shalev-Shwartz (2012))

In this framework, we restate and prove Thm. 1.

Theorem 2

Let $\ell _i(\theta ) = {\mathbb {E}}_{(s,a)\sim d_{\pi _{\theta _i}}} \ell (s,a,\theta )$. Also let $\epsilon _N = \min _{\theta }\frac{1}{N} \sum _{i=1}^N \ell _i(\theta )$ be the loss of the best parameter in hindsight after N iterations. Let $\gamma _N$ be the average regret of $\theta _{1:N}$. There exists a $\theta \in \theta _{1:N}$ s.t.

$$\begin{aligned} {\mathbb {E}}_{(s,a) \sim d_{\pi _{\theta }}} [\ell (s,a,\theta )] \le \epsilon _N + \gamma _N \end{aligned}$$

(21)

Proof

The performance of the best learner in the sequence $\theta _1,\cdots ,\theta _N$ must be smaller than the average loss of each learner on its own induced distribution (min smaller than average)

$$\begin{aligned}&\min _{\theta \in \theta _{1:N}} {\mathbb {E}}_{(s,a)\sim d_{\pi _\theta }} [\ell (s,a,\theta )]\nonumber \\&\quad \le \frac{1}{N} \sum _{i=1}^N {\mathbb {E}}_{(s,a)\sim d_{\pi _{\theta _i}}} [\ell (s,a,\theta _i)] \end{aligned}$$

(22)

Using (20) we have

$$\begin{aligned} \begin{aligned}&\frac{1}{N} \sum _{i=1}^N {\mathbb {E}}_{(s,a)\sim d_{\pi _{\theta _i}}} [\ell (s,a,\theta _i)] \\&\quad \le \gamma _N + \min _\theta \frac{1}{N} \sum _{i=1}^N {\mathbb {E}}_{(s,a)\sim d_{\pi _{\theta _i}}} [\ell (s,a,\theta )] \\&\quad \le \gamma _N + \epsilon _N \end{aligned} \end{aligned}$$

(23)

$\square $

This proof can be extended for finite sample cases following the original DAgger proofs. This theorem applies to each portion of the objective individually, yielding regret terms $\gamma _N^B$ and $\gamma _N^I$ which each individually go to zero as $N\rightarrow \infty $, thus we are guaranteed that the combined objective as well as each individual objective is zero regret.

HG-DAgger counter example

We construct a counter-example for HG-DAgger approaches (Kelly et al. 2019; Goecks et al. 2019; Bi et al. 2018) in Fig. 9. Recall that in HG-DAgger, we only use the intervention loss $\ell _C(.)$.

The MDP is such that the learner can choose between two actions - Left (L) and Right (R) only at states $s_0$ and $s_1$. Unknown to the learner, but known to the expert, some of the edges are associated with costs. The expert deems a “good enough” state as having value of $-9$. Hence whenver the learner enters $s_1$, the expert takes over to intervene and demonstrates $(s_1, L)$.

HG-DAgger only keeps this intervention data and uses it as classification loss. Let’s say it is using a tabular policy. If it learns the policy $(s_0,L)$ and $(s_1,L)$ - it will indeed achieve $\ell _c(s,a,\theta )=0$. However, the expert will continue to intervene as this policy always exits the good enough state

Let’s look at all policies and their implicit bounds and intervention losses. Assume we get a penalty of 1 for every bad state or misclassified action. We have:

1.
Policy $(s_0, L), (s_1,L)$: Loss $\ell _B = 2$, $\ell _C=0$
2.
Policy $(s_0, L), (s_1,R)$: Loss $\ell _B = 2$, $\ell _C=1$
3.
Policy $(s_0, R), (s_1,L)$: Loss $\ell _B = 0$, $\ell _C=0$
4.
Policy $(s_0, R), (s_1,R)$: Loss $\ell _B = 0$, $\ell _C=0$

The last two policies have the same intervention loss because the induced distribution is such that these policies never result in interventions (even though one learns an incorrect intervention action).

HG-DAgger looks at only the last column and hence may not end up learning $(s_0, R)$. EIL on the other hand will.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Spencer, J., Choudhury, S., Barnes, M. et al. Expert Intervention Learning. Auton Robot 46, 99–113 (2022). https://doi.org/10.1007/s10514-021-10006-9

Download citation

Received: 31 January 2021
Accepted: 30 June 2021
Published: 19 October 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s10514-021-10006-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The Open-Source TEXPLORE Code Release for Reinforcement Learning on Robots

Learning High-Level Navigation Strategies via Inverse Reinforcement Learning: A Comparative Analysis

I2RL: online inverse reinforcement learning under occlusion

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proofs

1.1 Reduction to no-regret, online learning

Theorem 2

Proof

HG-DAgger counter example

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation