[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Dynamic Shielding for Reinforcement Learning in Black-Box Environments

  • Conference paper
  • First Online:
Automated Technology for Verification and Analysis (ATVA 2022)

Abstract

It is challenging to use reinforcement learning (RL) in cyber-physical systems due to the lack of safety guarantees during learning. Although there have been various proposals to reduce undesired behaviors during learning, most of these techniques require prior system knowledge, and their applicability is limited. This paper aims to reduce undesired behaviors during learning without requiring any prior system knowledge. We propose dynamic shielding: an extension of a model-based safe RL technique called shielding using automata learning. The dynamic shielding technique constructs an approximate system model in parallel with RL using a variant of the RPNI algorithm and suppresses undesired explorations due to the shield constructed from the learned model. Through this combination, potentially unsafe actions can be foreseen before the agent experiences them. Experiments show that our dynamic shield significantly decreases the number of undesired events during training.

S. Pruekprasert and T. Takisaka—The work was done during the employment of S.P. and T.T. at NII, Tokyo.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 51.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 64.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The shield we use in this paper is the variant called preemptive shield in [1]. It is straightforward to apply our framework to the classic shield called post-posed shield.

  2. 2.

    The artifact is publicly available at https://doi.org/10.5281/zenodo.6906673.

References

  1. Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the AAAI 2018, pp. 2669–2678. AAAI Press (2018)

    Google Scholar 

  2. Avni, G., Bloem, R., Chatterjee, K., Henzinger, T.A., Könighofer, B., Pranger, S.: Run-time optimization for learned controllers through quantitative games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 630–649. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_36

    Chapter  Google Scholar 

  3. Bharadwaj, S., Bloem, R., Dimitrova, R., Könighofer, B., Topcu, U.: Synthesis of minimum-cost shields for multi-agent systems. In: Proceedings of the ACC 2019, pp. 1048–1055. IEEE (2019)

    Google Scholar 

  4. Bloem, R., Jensen, P.G., Könighofer, B., Larsen, K.G., Lorber, F., Palmisano, A.: It’s time to play safe: shield synthesis for timed systems. CoRR abs/2006.16688 (2020)

    Google Scholar 

  5. Bloem, R., Könighofer, B., Könighofer, R., Wang, C.: Shield synthesis: runtime enforcement for reactive systems. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 533–548. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46681-0_51

    Chapter  MATH  Google Scholar 

  6. Bouton, M., Karlsson, J., Nakhaei, A., Fujimura, K., Kochenderfer, M.J., Tumova, J.: Reinforcement learning with probabilistic guarantees for autonomous driving. CoRR abs/1904.07189 (2019)

    Google Scholar 

  7. Brockman, G., et al.: OpenAI Gym. CoRR abs/1606.01540 (2016)

    Google Scholar 

  8. Cheng, R., Orosz, G., Murray, R.M., Burdick, J.W.: End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In: Proceedings of the AAAI 2019, pp. 3387–3395. AAAI Press (2019)

    Google Scholar 

  9. Chevalier-Boisvert, M.: Gym-MiniWorld Environment for OpenAI Gym (2018). https://github.com/maximecb/gym-miniworld

  10. García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015)

    MathSciNet  MATH  Google Scholar 

  11. Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: Seghrouchni, A.E.F., Sukthankar, G., An, B., Yorke-Smith, N. (eds.) Proceedings of the AAMAS 2020, pp. 483–491. IFAAMS (2020)

    Google Scholar 

  12. Hunt, N., Fulton, N., Magliacane, S., Hoang, T.N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. In: Bogomolov, S., Jungers, R.M. (eds.) Proceedings of the HSCC 2021, pp. 14:1–14:11. ACM (2021)

    Google Scholar 

  13. Isberner, M., Howar, F., Steffen, B.: The open-source LearnLib - a framework for active automata learning. In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 487–495. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21690-4_32

    Chapter  Google Scholar 

  14. Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe reinforcement learning using probabilistic shields (invited paper). In: Konnov, I., Kovács, L. (eds.) Proceedings of the CONCUR 2020. LIPIcs, vol. 171, pp. 3:1–3:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)

    Google Scholar 

  15. Kupferman, O., Lampert, R.: On the construction of fine automata for safety properties. In: Graf, S., Zhang, W. (eds.) ATVA 2006. LNCS, vol. 4218, pp. 110–124. Springer, Heidelberg (2006). https://doi.org/10.1007/11901914_11

    Chapter  MATH  Google Scholar 

  16. Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: Honavar, V., Slutzki, G. (eds.) ICGI 1998. LNCS, vol. 1433, pp. 1–12. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0054059

    Chapter  Google Scholar 

  17. López, D., García, P.: On the inference of finite state automata from positive and negative data. In: Heinz, J., Sempere, J.M. (eds.) Topics in Grammatical Inference, pp. 73–112. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-48395-4_4

    Chapter  Google Scholar 

  18. Mao, H., Chen, Y., Jaeger, M., Nielsen, T.D., Larsen, K.G., Nielsen, B.: Learning Markov decision processes for model checking. In: Fahrenberg, U., Legay, A., Thrane, C.R. (eds.) Proceedings of the QFM 2012. EPTCS, vol. 103, pp. 49–63 (2012)

    Google Scholar 

  19. Mnih, V., et al.: Playing Atari with deep reinforcement learning. CoRR abs/1312.5602 (2013)

    Google Scholar 

  20. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  21. Oncina, J., García, P.: Identifying regular languages in polynomial time. Series in Machine Perception and Artificial Intelligence, pp. 99–108 (1993)

    Google Scholar 

  22. Plappert, M.: Keras-RL (2016). https://github.com/keras-rl/keras-rl

  23. Pranger, S., Könighofer, B., Tappler, M., Deixelberger, M., Jansen, N., Bloem, R.: Adaptive shielding under uncertainty. In: Proceedings of the ACC 2021, pp. 3467–3474. IEEE (2021)

    Google Scholar 

  24. Raffin, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., Dormann, N.: Stable baselines3 (2019). https://github.com/DLR-RM/stable-baselines3

  25. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR abs/1707.06347 (2017)

    Google Scholar 

  26. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)

    Article  Google Scholar 

  27. Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction. Adaptive Computation and Machine Learning. MIT Press (1998)

    Google Scholar 

  28. Wu, M., Wang, J., Deshmukh, J., Wang, C.: Shield synthesis for real: Enforcing safety in cyber-physical systems. In: Barrett, C.W., Yang, J. (eds.) Proceedings of the FMCAD 2019, pp. 129–137. IEEE (2019)

    Google Scholar 

Download references

Acknowledgements

This work is partially supported by JST ERATO HASUO Metamathematics for Systems Design Project (No. JPMJER1603). Masaki Waga is also supported by JST ACT-X Grant No. JPMJAX200U. Stefan Klikovits is also supported by JSPS Grant-in-Aid No. 20K23334. Sasinee Pruekprasert is also supported by JSPS Grant-in-Aid No. 21K14191. Toru Takisaka is also supported by NSFC Research Fund for International Young Scientists No. 62150410437.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Masaki Waga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Waga, M., Castellano, E., Pruekprasert, S., Klikovits, S., Takisaka, T., Hasuo, I. (2022). Dynamic Shielding for Reinforcement Learning in Black-Box Environments. In: Bouajjani, A., Holík, L., Wu, Z. (eds) Automated Technology for Verification and Analysis. ATVA 2022. Lecture Notes in Computer Science, vol 13505. Springer, Cham. https://doi.org/10.1007/978-3-031-19992-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19992-9_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19991-2

  • Online ISBN: 978-3-031-19992-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics