A Framework for Transforming Specifications in Reinforcement Learning

Rajeev Alur¹¹,
Suguman Bansal¹¹,
Osbert Bastani¹¹ &
…
Kishor Jothimurugan¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13660))

1082 Accesses

Abstract

Reactive synthesis algorithms allow automatic construction of policies to control an environment modeled as a Markov Decision Process (MDP) that are optimal with respect to high-level temporal logic specifications. However, they assume that the MDP model is known a priori. Reinforcement Learning (RL) algorithms, in contrast, are designed to learn an optimal policy when the transition probabilities of the MDP are unknown, but require the user to associate local rewards with transitions. The appeal of high-level temporal logic specifications has motivated research to develop RL algorithms for synthesis of policies from specifications. To understand the techniques, and nuanced variations in their theoretical guarantees, in the growing body of resulting literature, we develop a formal framework for defining transformations among RL tasks with different forms of objectives. We define the notion of a sampling-based reduction to transform a given MDP into another one which can be simulated even when the transition probabilities of the original MDP are unknown. We formalize the notions of preservation of optimal policies, convergence, and robustness of such reductions. We then use our framework to restate known results, establish new results to fill in some gaps, and identify open problems. In particular, we show that certain kinds of reductions from LTL specifications to reward-based ones do not exist, and prove the non-existence of RL algorithms with PAC-MDP guarantees for safety specifications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 63.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 79.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Shield Synthesis for Reinforcement Learning

Safety-Constrained Reinforcement Learning for MDPs

LCRL: Certified Policy Synthesis via Logically-Constrained Reinforcement Learning

Notes

1.
A distribution \(\eta \) over initial states can be modeled by adding a new state \(s_0\) from which taking any action leads to a state sampled from \(\eta \).

References

Abel, D., et al.: On the expressivity of Markov reward. In: Advances in Neural Information Processing Systems 34 (2021)
Google Scholar
Abounadi, J., Bertsekas, D., Borkar, V.S.: Learning algorithms for Markov decision processes with average cost. SIAM J. Control. Optim. 40(3), 681–698 (2001)
Article MathSciNet MATH Google Scholar
Aksaray, D., Jones, A., Kong, Z., Schwager, M., Belta, C.: Q-learning for robust satisfaction of signal temporal logic specifications. In: Conference on Decision and Control (CDC), pp. 6565–6570. IEEE (2016)
Google Scholar
Alur, R., Bansal, S., Bastani, O., Jothimurugan, K.: A framework for transforming specifications in reinforcement learning. https://arxiv.org/abs/2111.00272 (2021)
Ashok, P., Křetínský, J., Weininger, M.: PAC statistical model checking for Markov decision processes and stochastic games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 497–519. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_29
Chapter Google Scholar
Baier, C., de Alfaro, L., Forejt, V., Kwiatkowska, M.: Model checking probabilistic systems. In: Handbook of Model Checking, pp. 963–999. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-10575-8_28
Chapter MATH Google Scholar
Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear temporal logic specifications using model-free reinforcement learning. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 10349–10355. IEEE (2020)
Google Scholar
Brafman, R., De Giacomo, G., Patrizi, F.: LTLf/LDLf non-Markovian rewards. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Camacho, A., Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: LTL and beyond: formal languages for reward function specification in reinforcement learning. In: International Joint Conference on Artificial Intelligence, pp. 6065–6073 (2019)
Google Scholar
Daca, P., Henzinger, T.A., Křetínskỳ, J., Petrov, T.: Faster statistical model checking for unbounded temporal properties. ACM Trans. Comput. Logic (TOCL) 18(2), 1–25 (2017)
Article MathSciNet MATH Google Scholar
De Giacomo, G., Iocchi, L., Favorito, M., Patrizi, F.: Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 29, pp. 128–136 (2019)
Google Scholar
Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems (2014)
Google Scholar
Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Omega-regular objectives in model-free reinforcement learning. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 395–412 (2019)
Google Scholar
Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Faithful and effective reward schemes for model-free reinforcement learning of omega-regular objectives. In: Hung, D.V., Sokolsky, O. (eds.) ATVA 2020. LNCS, vol. 12302, pp. 108–124. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59152-6_6
Chapter Google Scholar
Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Model-free reinforcement learning for stochastic parity games. In: 31st International Conference on Concurrency Theory (CONCUR 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2020)
Google Scholar
Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Model-free reinforcement learning for lexicographic omega-regular objectives. In: Huisman, M., Păsăreanu, C., Zhan, N. (eds.) FM 2021. LNCS, vol. 13047, pp. 142–159. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-90870-6_8
Chapter Google Scholar
Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Conference on Decision and Control (CDC), pp. 5338–5343 (2019)
Google Scholar
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)
Icarte, R.T., Klassen, T., Valenzano, R., McIlraith, S.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: International Conference on Machine Learning, pp. 2107–2116. PMLR (2018)
Google Scholar
Icarte, R.T., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Reward machines: exploiting reward function structure in reinforcement learning. arXiv preprint arXiv:2010.03950 (2020)
Jiang, Y., Bharadwaj, S., Wu, B., Shah, R., Topcu, U., Stone, P.: Temporal-logic-based reward shaping for continuing learning tasks (2020)
Google Scholar
Jothimurugan, K., Alur, R., Bastani, O.: A composable specification language for reinforcement learning tasks. In: Advances in Neural Information Processing Systems, vol. 32, pp. 13041–13051 (2019)
Google Scholar
Jothimurugan, K., Bansal, S., Bastani, O., Alur, R.: Compositional reinforcement learning from logical specifications. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Jothimurugan, K., Bansal, S., Bastani, O., Alur, R.: Specification-guided learning of Nash equilibria with high social welfare (2022)
Google Scholar
Kakade, S.M.: On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom) (2003)
Google Scholar
Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2), 209–232 (2002)
Article MATH Google Scholar
Li, X., Vasile, C.I., Belta, C.: Reinforcement learning with temporal logic rewards. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839. IEEE (2017)
Google Scholar
Littman, M.L., Topcu, U., Fu, J., Isbell, C., Wen, M., MacGlashan, J.: Environment-independent task specifications via GLTL (2017)
Google Scholar
Littman, M.L., Topcu, U., Fu, J., Isbell, C., Wen, M., MacGlashan, J.: Environment-independent task specifications via GLTL. arXiv preprint arXiv:1704.04341 (2017)
Pnueli, A.: The temporal logic of programs. In: 18th Annual Symposium on Foundations of Computer Science, pp. 46–57. IEEE (1977)
Google Scholar
Sistla, A.P., Clarke, E.M.: The complexity of propositional linear temporal logics. J. ACM (JACM) 32(3), 733–749 (1985)
Article MathSciNet MATH Google Scholar
Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888 (2006)
Google Scholar
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Article MATH Google Scholar
Xu, Z., Topcu, U.: Transfer of temporal logic formulas in reinforcement learning. In: International Joint Conference on Artificial Intelligence, pp. 4010–4018 (2019)
Google Scholar
Yang, C., Littman, M., Carbin, M.: Reinforcement learning for general LTL objectives is intractable. arXiv preprint arXiv:2111.12679 (2021)
Yuan, L.Z., Hasanbeig, M., Abate, A., Kroening, D.: Modular deep reinforcement learning with temporal logic specifications. arXiv preprint arXiv:1909.11591 (2019)

Download references

Acknowledgements

We would like to thank Michael Littman, Sheila McIlraith, Ufuk Topcu, Ashutosh Trivedi and the anonymous reviewers for their feedback on an early version of this paper. This work was supported in part by CRA/NSF Computing Innovations Fellow Award, DARPA Assured Autonomy project under Contract No. FA8750–18-C–0090, NSF Awards CCF–1910769 and CCF–1917852, ARO Award W911NF–20–1–0080 and ONR award N00014–20–1–2115.

Author information

Authors and Affiliations

University of Pennsylvania, Philadelphia, USA
Rajeev Alur, Suguman Bansal, Osbert Bastani & Kishor Jothimurugan

Authors

Rajeev Alur
View author publications
You can also search for this author in PubMed Google Scholar
Suguman Bansal
View author publications
You can also search for this author in PubMed Google Scholar
Osbert Bastani
View author publications
You can also search for this author in PubMed Google Scholar
Kishor Jothimurugan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kishor Jothimurugan .

Editor information

Editors and Affiliations

Université Libre de Bruxelles, Brussels, Belgium
Jean-François Raskin
Institute of Science and Technology Austria, Klosterneuburg, Austria
Krishnendu Chatterjee
École Normale Supérieure Paris-Saclay, Gif-sur-Yvette, France
Laurent Doyen
MPI for Software Systems, Kaiserslautern, Germany
Rupak Majumdar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Alur, R., Bansal, S., Bastani, O., Jothimurugan, K. (2022). A Framework for Transforming Specifications in Reinforcement Learning. In: Raskin, JF., Chatterjee, K., Doyen, L., Majumdar, R. (eds) Principles of Systems Design. Lecture Notes in Computer Science, vol 13660. Springer, Cham. https://doi.org/10.1007/978-3-031-22337-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-22337-2_29
Published: 29 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22336-5
Online ISBN: 978-3-031-22337-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Framework for Transforming Specifications in Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Shield Synthesis for Reinforcement Learning

Safety-Constrained Reinforcement Learning for MDPs

LCRL: Certified Policy Synthesis via Logically-Constrained Reinforcement Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Framework for Transforming Specifications in Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Shield Synthesis for Reinforcement Learning

Safety-Constrained Reinforcement Learning for MDPs

LCRL: Certified Policy Synthesis via Logically-Constrained Reinforcement Learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation