[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Differentially Private Reinforcement Learning with Linear Function Approximation

Published: 28 February 2022 Publication History

Abstract

Motivated by the wide adoption of reinforcement learning (RL) in real-world personalized services, where users' sensitive and private information needs to be protected, we study regret minimization in finite-horizon Markov decision processes (MDPs) under the constraints of differential privacy (DP). Compared to existing private RL algorithms that work only on tabular finite-state, finite-actions MDPs, we take the first step towards privacy-preserving learning in MDPs with large state and action spaces. Specifically, we consider MDPs with linear function approximation (in particular linear mixture MDPs) under the notion of joint differential privacy (JDP), where the RL agent is responsible for protecting users' sensitive data. We design two private RL algorithms that are based on value iteration and policy optimization, respectively, and show that they enjoy sub-linear regret performance while guaranteeing privacy protection. Moreover, the regret bounds are independent of the number of states, and scale at most logarithmically with the number of actions, making the algorithms suitable for privacy protection in nowadays large-scale personalized services. Our results are achieved via a general procedure for learning in linear mixture MDPs under changing regularizers, which not only generalizes previous results for non-private learning, but also serves as a building block for general private reinforcement learning.

References

[1]
and Szepesvári]abbasi2011improvedYasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24: 2312--2320, 2011.
[2]
Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In International Conference on Machine Learning, pages 32--40. PMLR, 2017.
[3]
Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463--474. PMLR, 2020.
[4]
Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. In International Conference on Machine Learning, pages 2130--2138. PMLR, 2016.
[5]
Debabrota Basu, Christos Dimitrakakis, and Aristide Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? arXiv preprint arXiv:1905.12298, 2019.
[6]
Justin A Boyan. Least-squares temporal difference learning. In ICML, pages 49--56, 1999.
[7]
Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22 (1): 33--57, 1996.
[8]
Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635--658. Springer, 2016.
[9]
Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283--1294. PMLR, 2020.
[10]
Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, and Lorenzo Rosasco. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, pages 533--557. PMLR, 2019.
[11]
T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14 (3): 1--24, 2011.
[12]
Albert Cheu, Adam Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. Distributed differential privacy via shuffling. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 375--403. Springer, 2019.
[13]
Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. arXiv preprint arXiv:2112.10599, 2021.
[14]
Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304--3312. PMLR, 2021.
[15]
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pages 1329--1338. PMLR, 2016.
[16]
Abhimanyu Dubey. No-regret algorithms for private gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics, pages 2062--2070. PMLR, 2021.
[17]
John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429--438. IEEE, 2013.
[18]
Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pages 1--19. Springer, 2008.
[19]
Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. 2014.
[20]
Yonathan Efroni, Lior Shani, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback. arXiv preprint arXiv:2002.08243, 2020.
[21]
Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differentially private regret minimization in reinforcement learning. arXiv preprint arXiv:2010.07778, 2020.
[22]
Goren Gordon, Samuel Spaulding, Jacqueline Kory Westlund, Jin Joo Lee, Luke Plummer, Marayna Martinez, Madhurima Das, and Cynthia Breazeal. Affective personalization of a social robot tutor for children's second language skills. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
[23]
Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26: 2733--2741, 2013.
[24]
He, Zhou, and Gu]he2021logarithmicJiafan He, Dongruo Zhou, and Quanquan Gu. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171--4180. PMLR, 2021 a .
[25]
He, Zhou, and Gu]he2021nearlyJiafan He, Dongruo Zhou, and Quanquan Gu. Nearly optimal regret for learning adversarial mdps with linear function approximation. arXiv preprint arXiv:2102.08940, 2021 b .
[26]
Justin Hsu, Zhiyi Huang, Aaron Roth, Tim Roughgarden, and Zhiwei Steven Wu. Private matchings and allocations. SIAM Journal on Computing, 45 (6): 1953--1984, 2016.
[27]
Bingshan Hu, Zhiming Huang, and Nishant A Mehta. Optimal algorithms for private online learning in a stochastic environment. arXiv preprint arXiv:2102.07929, 2021.
[28]
Zeyu Jia, Lin Yang, Csaba Szepesvari, and Mengdi Wang. Model-based reinforcement learning with value-targeted regression. In Learning for Dynamics and Control, pages 666--686. PMLR, 2020.
[29]
Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137--2143, 2020.
[30]
Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
[31]
Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 403--410, 2014.
[32]
Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008--1014. Citeseer, 2000.
[33]
Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662--5670. PMLR, 2020.
[34]
Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302--1338, 2000.
[35]
Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001.
[36]
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661--670, 2010.
[37]
Chonghua Liao, Jiafan He, and Quanquan Gu. Locally differentially private reinforcement learning for linear mixture markov decision processes. arXiv preprint arXiv:2110.10133, 2021.
[38]
Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.
[39]
Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592--601, 2015.
[40]
nd Krause(2019)]mutny2019efficientMojm'ir Mutnỳ and Andreas Krause. Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. Advances in Neural Information Processing Systems 31, pages 9005--9016, 2019.
[41]
Hajime Ono and Tsubasa Takahashi. Locally private distributed reinforcement learning. arXiv preprint arXiv:2001.11718, 2020.
[42]
Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B Shroff. Multi-armed bandits with local differential privacy. arXiv preprint arXiv:2007.03121, 2020.
[43]
Touqir Sajed and Or Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pages 5579--5588. PMLR, 2019.
[44]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889--1897. PMLR, 2015.
[45]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
[46]
Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. arXiv preprint arXiv:1810.00068, 2018.
[47]
Akanksha Rai Sharma and Pranav Kaushik. Literature survey of statistical, deep and reinforcement learning in natural language processing. In 2017 International Conference on Computing, Communication and Automation (ICCCA), pages 350--354. IEEE, 2017.
[48]
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550 (7676): 354--359, 2017.
[49]
Aristide Tossou and Christos Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
[50]
Aristide Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
[51]
Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
[52]
Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pages 9754--9764. PMLR, 2020.
[53]
Baoxiang Wang and Nidhi Hegde. Privacy-preserving q-learning with functional noise in continuous state spaces. arXiv preprint arXiv:1901.10634, 2019.
[54]
Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.
[55]
William Yang Wang, Jiwei Li, and Xiaodong He. Deep reinforcement learning for nlp. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 19--21, 2018.
[56]
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 (3--4): 229--256, 1992.
[57]
Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995--7004. PMLR, 2019.
[58]
Zhuoran Yang, Chi Jin, Zhaoran Wang, Mengdi Wang, and Michael I Jordan. On function approximation in reinforcement learning: Optimism in the face of large state spaces. arXiv preprint arXiv:2011.04622, 2020.
[59]
Yufan Zhao, Michael R Kosorok, and Donglin Zeng. Reinforcement learning design for cancer clinical trials. Statistics in medicine, 28 (26): 3294--3315, 2009.
[60]
Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, and Liwei Wang. Locally differentially private (contextual) bandits learning. arXiv preprint arXiv:2006.00701, 2020.
[61]
Zhou, Gu, and Szepesvari]zhou2021nearlyDongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532--4576. PMLR, 2021 a .
[62]
Zhou, He, and Gu]zhou2021provablyDongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pages 12793--12802. PMLR, 2021 b .
[63]
Xingyu Zhou and Jian Tan. Local differential privacy for bayesian optimization. arXiv preprint arXiv:2010.06709, 2020.

Cited By

View all
  • (2024)Differentially private no-regret exploration in adversarial Markov decision processesProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702687(235-272)Online publication date: 15-Jul-2024
  • (2024)Security and Privacy Issues in Deep Reinforcement Learning: Threats and CountermeasuresACM Computing Surveys10.1145/364031256:6(1-39)Online publication date: 12-Jan-2024
  • (2024)Privacy-Preserving Deep Reinforcement Learning based on Differential Privacy2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650755(1-8)Online publication date: 30-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 6, Issue 1
POMACS
March 2022
695 pages
EISSN:2476-1249
DOI:10.1145/3522731
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 February 2022
Published in POMACS Volume 6, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. differential privacy
  2. linear function approximations
  3. reinforcement learning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)4
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Differentially private no-regret exploration in adversarial Markov decision processesProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702687(235-272)Online publication date: 15-Jul-2024
  • (2024)Security and Privacy Issues in Deep Reinforcement Learning: Threats and CountermeasuresACM Computing Surveys10.1145/364031256:6(1-39)Online publication date: 12-Jan-2024
  • (2024)Privacy-Preserving Deep Reinforcement Learning based on Differential Privacy2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650755(1-8)Online publication date: 30-Jun-2024
  • (2024)Differentially Private Reward Functions for Markov Decision Processes2024 IEEE Conference on Control Technology and Applications (CCTA)10.1109/CCTA60707.2024.10666610(631-636)Online publication date: 21-Aug-2024
  • (2023)Privacy-Engineered Value Decomposition Networks for Cooperative Multi-Agent Reinforcement Learning2023 62nd IEEE Conference on Decision and Control (CDC)10.1109/CDC49753.2023.10384184(8038-8044)Online publication date: 13-Dec-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media