[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

A Framework for Transforming Specifications in Reinforcement Learning

  • Chapter
  • First Online:
Principles of Systems Design

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13660))

  • 1082 Accesses

Abstract

Reactive synthesis algorithms allow automatic construction of policies to control an environment modeled as a Markov Decision Process (MDP) that are optimal with respect to high-level temporal logic specifications. However, they assume that the MDP model is known a priori. Reinforcement Learning (RL) algorithms, in contrast, are designed to learn an optimal policy when the transition probabilities of the MDP are unknown, but require the user to associate local rewards with transitions. The appeal of high-level temporal logic specifications has motivated research to develop RL algorithms for synthesis of policies from specifications. To understand the techniques, and nuanced variations in their theoretical guarantees, in the growing body of resulting literature, we develop a formal framework for defining transformations among RL tasks with different forms of objectives. We define the notion of a sampling-based reduction to transform a given MDP into another one which can be simulated even when the transition probabilities of the original MDP are unknown. We formalize the notions of preservation of optimal policies, convergence, and robustness of such reductions. We then use our framework to restate known results, establish new results to fill in some gaps, and identify open problems. In particular, we show that certain kinds of reductions from LTL specifications to reward-based ones do not exist, and prove the non-existence of RL algorithms with PAC-MDP guarantees for safety specifications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 63.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 79.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    A distribution \(\eta \) over initial states can be modeled by adding a new state \(s_0\) from which taking any action leads to a state sampled from \(\eta \).

References

  1. Abel, D., et al.: On the expressivity of Markov reward. In: Advances in Neural Information Processing Systems 34 (2021)

    Google Scholar 

  2. Abounadi, J., Bertsekas, D., Borkar, V.S.: Learning algorithms for Markov decision processes with average cost. SIAM J. Control. Optim. 40(3), 681–698 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  3. Aksaray, D., Jones, A., Kong, Z., Schwager, M., Belta, C.: Q-learning for robust satisfaction of signal temporal logic specifications. In: Conference on Decision and Control (CDC), pp. 6565–6570. IEEE (2016)

    Google Scholar 

  4. Alur, R., Bansal, S., Bastani, O., Jothimurugan, K.: A framework for transforming specifications in reinforcement learning. https://arxiv.org/abs/2111.00272 (2021)

  5. Ashok, P., Křetínský, J., Weininger, M.: PAC statistical model checking for Markov decision processes and stochastic games. In: Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 497–519. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25540-4_29

    Chapter  Google Scholar 

  6. Baier, C., de Alfaro, L., Forejt, V., Kwiatkowska, M.: Model checking probabilistic systems. In: Handbook of Model Checking, pp. 963–999. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-10575-8_28

    Chapter  MATH  Google Scholar 

  7. Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear temporal logic specifications using model-free reinforcement learning. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 10349–10355. IEEE (2020)

    Google Scholar 

  8. Brafman, R., De Giacomo, G., Patrizi, F.: LTLf/LDLf non-Markovian rewards. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  9. Camacho, A., Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: LTL and beyond: formal languages for reward function specification in reinforcement learning. In: International Joint Conference on Artificial Intelligence, pp. 6065–6073 (2019)

    Google Scholar 

  10. Daca, P., Henzinger, T.A., Křetínskỳ, J., Petrov, T.: Faster statistical model checking for unbounded temporal properties. ACM Trans. Comput. Logic (TOCL) 18(2), 1–25 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  11. De Giacomo, G., Iocchi, L., Favorito, M., Patrizi, F.: Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 29, pp. 128–136 (2019)

    Google Scholar 

  12. Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems (2014)

    Google Scholar 

  13. Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Omega-regular objectives in model-free reinforcement learning. In: Tools and Algorithms for the Construction and Analysis of Systems, pp. 395–412 (2019)

    Google Scholar 

  14. Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Faithful and effective reward schemes for model-free reinforcement learning of omega-regular objectives. In: Hung, D.V., Sokolsky, O. (eds.) ATVA 2020. LNCS, vol. 12302, pp. 108–124. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59152-6_6

    Chapter  Google Scholar 

  15. Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Model-free reinforcement learning for stochastic parity games. In: 31st International Conference on Concurrency Theory (CONCUR 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2020)

    Google Scholar 

  16. Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Model-free reinforcement learning for lexicographic omega-regular objectives. In: Huisman, M., Păsăreanu, C., Zhan, N. (eds.) FM 2021. LNCS, vol. 13047, pp. 142–159. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-90870-6_8

    Chapter  Google Scholar 

  17. Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Conference on Decision and Control (CDC), pp. 5338–5343 (2019)

    Google Scholar 

  18. Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)

  19. Icarte, R.T., Klassen, T., Valenzano, R., McIlraith, S.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: International Conference on Machine Learning, pp. 2107–2116. PMLR (2018)

    Google Scholar 

  20. Icarte, R.T., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Reward machines: exploiting reward function structure in reinforcement learning. arXiv preprint arXiv:2010.03950 (2020)

  21. Jiang, Y., Bharadwaj, S., Wu, B., Shah, R., Topcu, U., Stone, P.: Temporal-logic-based reward shaping for continuing learning tasks (2020)

    Google Scholar 

  22. Jothimurugan, K., Alur, R., Bastani, O.: A composable specification language for reinforcement learning tasks. In: Advances in Neural Information Processing Systems, vol. 32, pp. 13041–13051 (2019)

    Google Scholar 

  23. Jothimurugan, K., Bansal, S., Bastani, O., Alur, R.: Compositional reinforcement learning from logical specifications. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  24. Jothimurugan, K., Bansal, S., Bastani, O., Alur, R.: Specification-guided learning of Nash equilibria with high social welfare (2022)

    Google Scholar 

  25. Kakade, S.M.: On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom) (2003)

    Google Scholar 

  26. Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2), 209–232 (2002)

    Article  MATH  Google Scholar 

  27. Li, X., Vasile, C.I., Belta, C.: Reinforcement learning with temporal logic rewards. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839. IEEE (2017)

    Google Scholar 

  28. Littman, M.L., Topcu, U., Fu, J., Isbell, C., Wen, M., MacGlashan, J.: Environment-independent task specifications via GLTL (2017)

    Google Scholar 

  29. Littman, M.L., Topcu, U., Fu, J., Isbell, C., Wen, M., MacGlashan, J.: Environment-independent task specifications via GLTL. arXiv preprint arXiv:1704.04341 (2017)

  30. Pnueli, A.: The temporal logic of programs. In: 18th Annual Symposium on Foundations of Computer Science, pp. 46–57. IEEE (1977)

    Google Scholar 

  31. Sistla, A.P., Clarke, E.M.: The complexity of propositional linear temporal logics. J. ACM (JACM) 32(3), 733–749 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  32. Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888 (2006)

    Google Scholar 

  33. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)

    Article  MATH  Google Scholar 

  34. Xu, Z., Topcu, U.: Transfer of temporal logic formulas in reinforcement learning. In: International Joint Conference on Artificial Intelligence, pp. 4010–4018 (2019)

    Google Scholar 

  35. Yang, C., Littman, M., Carbin, M.: Reinforcement learning for general LTL objectives is intractable. arXiv preprint arXiv:2111.12679 (2021)

  36. Yuan, L.Z., Hasanbeig, M., Abate, A., Kroening, D.: Modular deep reinforcement learning with temporal logic specifications. arXiv preprint arXiv:1909.11591 (2019)

Download references

Acknowledgements

We would like to thank Michael Littman, Sheila McIlraith, Ufuk Topcu, Ashutosh Trivedi and the anonymous reviewers for their feedback on an early version of this paper. This work was supported in part by CRA/NSF Computing Innovations Fellow Award, DARPA Assured Autonomy project under Contract No. FA8750–18-C–0090, NSF Awards CCF–1910769 and CCF–1917852, ARO Award W911NF–20–1–0080 and ONR award N00014–20–1–2115.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kishor Jothimurugan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Alur, R., Bansal, S., Bastani, O., Jothimurugan, K. (2022). A Framework for Transforming Specifications in Reinforcement Learning. In: Raskin, JF., Chatterjee, K., Doyen, L., Majumdar, R. (eds) Principles of Systems Design. Lecture Notes in Computer Science, vol 13660. Springer, Cham. https://doi.org/10.1007/978-3-031-22337-2_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-22337-2_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-22336-5

  • Online ISBN: 978-3-031-22337-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics