Abstract
It is challenging to use reinforcement learning (RL) in cyber-physical systems due to the lack of safety guarantees during learning. Although there have been various proposals to reduce undesired behaviors during learning, most of these techniques require prior system knowledge, and their applicability is limited. This paper aims to reduce undesired behaviors during learning without requiring any prior system knowledge. We propose dynamic shielding: an extension of a model-based safe RL technique called shielding using automata learning. The dynamic shielding technique constructs an approximate system model in parallel with RL using a variant of the RPNI algorithm and suppresses undesired explorations due to the shield constructed from the learned model. Through this combination, potentially unsafe actions can be foreseen before the agent experiences them. Experiments show that our dynamic shield significantly decreases the number of undesired events during training.
S. Pruekprasert and T. Takisaka—The work was done during the employment of S.P. and T.T. at NII, Tokyo.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The shield we use in this paper is the variant called preemptive shield in [1]. It is straightforward to apply our framework to the classic shield called post-posed shield.
- 2.
The artifact is publicly available at https://doi.org/10.5281/zenodo.6906673.
References
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the AAAI 2018, pp. 2669–2678. AAAI Press (2018)
Avni, G., Bloem, R., Chatterjee, K., Henzinger, T.A., Könighofer, B., Pranger, S.: Run-time optimization for learned controllers through quantitative games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 630–649. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_36
Bharadwaj, S., Bloem, R., Dimitrova, R., Könighofer, B., Topcu, U.: Synthesis of minimum-cost shields for multi-agent systems. In: Proceedings of the ACC 2019, pp. 1048–1055. IEEE (2019)
Bloem, R., Jensen, P.G., Könighofer, B., Larsen, K.G., Lorber, F., Palmisano, A.: It’s time to play safe: shield synthesis for timed systems. CoRR abs/2006.16688 (2020)
Bloem, R., Könighofer, B., Könighofer, R., Wang, C.: Shield synthesis: runtime enforcement for reactive systems. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 533–548. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-46681-0_51
Bouton, M., Karlsson, J., Nakhaei, A., Fujimura, K., Kochenderfer, M.J., Tumova, J.: Reinforcement learning with probabilistic guarantees for autonomous driving. CoRR abs/1904.07189 (2019)
Brockman, G., et al.: OpenAI Gym. CoRR abs/1606.01540 (2016)
Cheng, R., Orosz, G., Murray, R.M., Burdick, J.W.: End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In: Proceedings of the AAAI 2019, pp. 3387–3395. AAAI Press (2019)
Chevalier-Boisvert, M.: Gym-MiniWorld Environment for OpenAI Gym (2018). https://github.com/maximecb/gym-miniworld
García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437–1480 (2015)
Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: Seghrouchni, A.E.F., Sukthankar, G., An, B., Yorke-Smith, N. (eds.) Proceedings of the AAMAS 2020, pp. 483–491. IFAAMS (2020)
Hunt, N., Fulton, N., Magliacane, S., Hoang, T.N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. In: Bogomolov, S., Jungers, R.M. (eds.) Proceedings of the HSCC 2021, pp. 14:1–14:11. ACM (2021)
Isberner, M., Howar, F., Steffen, B.: The open-source LearnLib - a framework for active automata learning. In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 487–495. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21690-4_32
Jansen, N., Könighofer, B., Junges, S., Serban, A., Bloem, R.: Safe reinforcement learning using probabilistic shields (invited paper). In: Konnov, I., Kovács, L. (eds.) Proceedings of the CONCUR 2020. LIPIcs, vol. 171, pp. 3:1–3:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2020)
Kupferman, O., Lampert, R.: On the construction of fine automata for safety properties. In: Graf, S., Zhang, W. (eds.) ATVA 2006. LNCS, vol. 4218, pp. 110–124. Springer, Heidelberg (2006). https://doi.org/10.1007/11901914_11
Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: Honavar, V., Slutzki, G. (eds.) ICGI 1998. LNCS, vol. 1433, pp. 1–12. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0054059
López, D., García, P.: On the inference of finite state automata from positive and negative data. In: Heinz, J., Sempere, J.M. (eds.) Topics in Grammatical Inference, pp. 73–112. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-48395-4_4
Mao, H., Chen, Y., Jaeger, M., Nielsen, T.D., Larsen, K.G., Nielsen, B.: Learning Markov decision processes for model checking. In: Fahrenberg, U., Legay, A., Thrane, C.R. (eds.) Proceedings of the QFM 2012. EPTCS, vol. 103, pp. 49–63 (2012)
Mnih, V., et al.: Playing Atari with deep reinforcement learning. CoRR abs/1312.5602 (2013)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Oncina, J., García, P.: Identifying regular languages in polynomial time. Series in Machine Perception and Artificial Intelligence, pp. 99–108 (1993)
Plappert, M.: Keras-RL (2016). https://github.com/keras-rl/keras-rl
Pranger, S., Könighofer, B., Tappler, M., Deixelberger, M., Jansen, N., Bloem, R.: Adaptive shielding under uncertainty. In: Proceedings of the ACC 2021, pp. 3467–3474. IEEE (2021)
Raffin, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., Dormann, N.: Stable baselines3 (2019). https://github.com/DLR-RM/stable-baselines3
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR abs/1707.06347 (2017)
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction. Adaptive Computation and Machine Learning. MIT Press (1998)
Wu, M., Wang, J., Deshmukh, J., Wang, C.: Shield synthesis for real: Enforcing safety in cyber-physical systems. In: Barrett, C.W., Yang, J. (eds.) Proceedings of the FMCAD 2019, pp. 129–137. IEEE (2019)
Acknowledgements
This work is partially supported by JST ERATO HASUO Metamathematics for Systems Design Project (No. JPMJER1603). Masaki Waga is also supported by JST ACT-X Grant No. JPMJAX200U. Stefan Klikovits is also supported by JSPS Grant-in-Aid No. 20K23334. Sasinee Pruekprasert is also supported by JSPS Grant-in-Aid No. 21K14191. Toru Takisaka is also supported by NSFC Research Fund for International Young Scientists No. 62150410437.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Waga, M., Castellano, E., Pruekprasert, S., Klikovits, S., Takisaka, T., Hasuo, I. (2022). Dynamic Shielding for Reinforcement Learning in Black-Box Environments. In: Bouajjani, A., Holík, L., Wu, Z. (eds) Automated Technology for Verification and Analysis. ATVA 2022. Lecture Notes in Computer Science, vol 13505. Springer, Cham. https://doi.org/10.1007/978-3-031-19992-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-19992-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19991-2
Online ISBN: 978-3-031-19992-9
eBook Packages: Computer ScienceComputer Science (R0)