Abstract
Reactive synthesis algorithms allow automatic construction of policies to control an environment modeled as a Markov Decision Process (MDP) that are optimal with respect to high-level temporal logic specifications. However, they assume that the MDP model is known a priori. Reinforcement Learning (RL) algorithms, in contrast, are designed to learn an optimal policy when the transition probabilities of the MDP are unknown, but require the user to associate local rewards with transitions. The appeal of high-level temporal logic specifications has motivated research to develop RL algorithms for synthesis of policies from specifications. To understand the techniques, and nuanced variations in their theoretical guarantees, in the growing body of resulting literature, we develop a formal framework for defining transformations among RL tasks with different forms of objectives. We define the notion of a sampling-based reduction to transform a given MDP into another one which can be simulated even when the transition probabilities of the original MDP are unknown. We formalize the notions of preservation of optimal policies, convergence, and robustness of such reductions. We then use our framework to restate known results, establish new results to fill in some gaps, and identify open problems. In particular, we show that certain kinds of reductions from LTL specifications to reward-based ones do not exist, and prove the non-existence of RL algorithms with PAC-MDP guarantees for safety specifications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
A distribution \(\eta \) over initial states can be modeled by adding a new state \(s_0\) from which taking any action leads to a state sampled from \(\eta \).
References
Abel, D., et al.: On the expressivity of Markov reward. In: Advances in Neural Information Processing Systems 34 (2021)
Abounadi, J., Bertsekas, D., Borkar, V.S.: Learning algorithms for Markov decision processes with average cost. SIAM J. Control. Optim. 40(3), 681–698 (2001)
Aksaray, D., Jones, A., Kong, Z., Schwager, M., Belta, C.: Q-learning for robust satisfaction of signal temporal logic specifications. In: Conference on Decision and Control (CDC), pp. 6565–6570. IEEE (2016)
Alur, R., Bansal, S., Bastani, O., Jothimurugan, K.: A framework for transforming specifications in reinforcement learning. https://arxiv.org/abs/2111.00272 (2021)
Ashok, P., Křetínský, J., Weininger, M.: PAC statistical model checking for Markov decision processes and stochastic games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 497–519. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_29
Baier, C., de Alfaro, L., Forejt, V., Kwiatkowska, M.: Model checking probabilistic systems. In: Handbook of Model Checking, pp. 963–999. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-10575-8_28
Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear temporal logic specifications using model-free reinforcement learning. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 10349–10355. IEEE (2020)
Brafman, R., De Giacomo, G., Patrizi, F.: LTLf/LDLf non-Markovian rewards. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Camacho, A., Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: LTL and beyond: formal languages for reward function specification in reinforcement learning. In: International Joint Conference on Artificial Intelligence, pp. 6065–6073 (2019)
Daca, P., Henzinger, T.A., Křetínskỳ, J., Petrov, T.: Faster statistical model checking for unbounded temporal properties. ACM Trans. Comput. Logic (TOCL) 18(2), 1–25 (2017)
De Giacomo, G., Iocchi, L., Favorito, M., Patrizi, F.: Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 29, pp. 128–136 (2019)
Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems (2014)
Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Omega-regular objectives in model-free reinforcement learning. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 395–412 (2019)
Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Faithful and effective reward schemes for model-free reinforcement learning of omega-regular objectives. In: Hung, D.V., Sokolsky, O. (eds.) ATVA 2020. LNCS, vol. 12302, pp. 108–124. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59152-6_6
Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Model-free reinforcement learning for stochastic parity games. In: 31st International Conference on Concurrency Theory (CONCUR 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2020)
Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Model-free reinforcement learning for lexicographic omega-regular objectives. In: Huisman, M., Păsăreanu, C., Zhan, N. (eds.) FM 2021. LNCS, vol. 13047, pp. 142–159. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-90870-6_8
Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Conference on Decision and Control (CDC), pp. 5338–5343 (2019)
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)
Icarte, R.T., Klassen, T., Valenzano, R., McIlraith, S.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: International Conference on Machine Learning, pp. 2107–2116. PMLR (2018)
Icarte, R.T., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Reward machines: exploiting reward function structure in reinforcement learning. arXiv preprint arXiv:2010.03950 (2020)
Jiang, Y., Bharadwaj, S., Wu, B., Shah, R., Topcu, U., Stone, P.: Temporal-logic-based reward shaping for continuing learning tasks (2020)
Jothimurugan, K., Alur, R., Bastani, O.: A composable specification language for reinforcement learning tasks. In: Advances in Neural Information Processing Systems, vol. 32, pp. 13041–13051 (2019)
Jothimurugan, K., Bansal, S., Bastani, O., Alur, R.: Compositional reinforcement learning from logical specifications. In: Advances in Neural Information Processing Systems (2021)
Jothimurugan, K., Bansal, S., Bastani, O., Alur, R.: Specification-guided learning of Nash equilibria with high social welfare (2022)
Kakade, S.M.: On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom) (2003)
Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2), 209–232 (2002)
Li, X., Vasile, C.I., Belta, C.: Reinforcement learning with temporal logic rewards. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839. IEEE (2017)
Littman, M.L., Topcu, U., Fu, J., Isbell, C., Wen, M., MacGlashan, J.: Environment-independent task specifications via GLTL (2017)
Littman, M.L., Topcu, U., Fu, J., Isbell, C., Wen, M., MacGlashan, J.: Environment-independent task specifications via GLTL. arXiv preprint arXiv:1704.04341 (2017)
Pnueli, A.: The temporal logic of programs. In: 18th Annual Symposium on Foundations of Computer Science, pp. 46–57. IEEE (1977)
Sistla, A.P., Clarke, E.M.: The complexity of propositional linear temporal logics. J. ACM (JACM) 32(3), 733–749 (1985)
Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888 (2006)
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Xu, Z., Topcu, U.: Transfer of temporal logic formulas in reinforcement learning. In: International Joint Conference on Artificial Intelligence, pp. 4010–4018 (2019)
Yang, C., Littman, M., Carbin, M.: Reinforcement learning for general LTL objectives is intractable. arXiv preprint arXiv:2111.12679 (2021)
Yuan, L.Z., Hasanbeig, M., Abate, A., Kroening, D.: Modular deep reinforcement learning with temporal logic specifications. arXiv preprint arXiv:1909.11591 (2019)
Acknowledgements
We would like to thank Michael Littman, Sheila McIlraith, Ufuk Topcu, Ashutosh Trivedi and the anonymous reviewers for their feedback on an early version of this paper. This work was supported in part by CRA/NSF Computing Innovations Fellow Award, DARPA Assured Autonomy project under Contract No. FA8750–18-C–0090, NSF Awards CCF–1910769 and CCF–1917852, ARO Award W911NF–20–1–0080 and ONR award N00014–20–1–2115.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Alur, R., Bansal, S., Bastani, O., Jothimurugan, K. (2022). A Framework for Transforming Specifications in Reinforcement Learning. In: Raskin, JF., Chatterjee, K., Doyen, L., Majumdar, R. (eds) Principles of Systems Design. Lecture Notes in Computer Science, vol 13660. Springer, Cham. https://doi.org/10.1007/978-3-031-22337-2_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-22337-2_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22336-5
Online ISBN: 978-3-031-22337-2
eBook Packages: Computer ScienceComputer Science (R0)