More Web Proxy on the site http://driver.im/

research-article

Free access

Gradient-free online learning in games with delayed rewards

AUTHORs:

Amélie Héliou,

Panayotis Mertikopoulos,

Zhengyuan ZhouAuthors Info & Claims

ICML'20: Proceedings of the 37th International Conference on Machine Learning

Article No.: 390, Pages 4172 - 4181

Published: 13 July 2020 Publication History

PDF eReader Publisher Site

Abstract

Motivated by applications to online advertising and recommender systems, we consider a gametheoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium (NE) with probability 1, even if the delay between choosing an action and receiving the corresponding reward is unbounded.

Supplementary Material

Additional material (3524938.3525328_supp.pdf)

Supplemental material.

Download
432.73 KB

References

[1]

Balandat, M., Krichene, W., Tomlin, C., and Bayen, A. Minimizing regret on reflexive Banach spaces and Nash equilibria in continuous zero-sum games. In NIPS '16: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016.

Digital Library

[2]

Bauschke, H. H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York, NY, USA, 2 edition, 2017.

[3]

Bervoets, S., Bravo, M., and Faure, M. Learning with minimal information in continuous games. https://arxiv.org/abs/1806.11506, 2018.

[4]

Bistritz, I., Zhou, Z., Chen, X., Bambos, N., and Blanchet, J. Online exp3 learning in adversarial bandits with delayed feedback. In Advances in Neural Information Processing Systems, pp. 11345-11354, 2019.

[5]

Bravo, M., Leslie, D. S., and Mertikopoulos, P. Bandit learning in concave N-person games. In NIPS '18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.

Digital Library

[6]

Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1-122, 2012.

[7]

Bubeck, S. and Eldan, R. Multi-scale exploration of convex functions and bandit convex optimization. In COLT '16: Proceedings of the 29th Annual Conference on Learning Theory, 2016.

[8]

Bubeck, S. and Eldan, R. Kernel-based methods for bandit convex optimization. In STOC '17: Proceedings of the 49th annual ACM SIGACT symposium on the Theory of Computing, 2017.

Digital Library

[9]

Chapelle, O. Modeling delayed feedback in display advertising. In ACM SIGKDD '14 Proceedings of the 20th international conference on Knowledge discovery and data mining, 2014.

Digital Library

[10]

Debreu, G. A social equilibrium existence theorem. Proceedings of the National Academy of Sciences of the USA, 38(10): 886-893, October 1952.

[11]

Facchinei, F. and Kanzow, C. Generalized Nash equilibrium problems. 4OR, 5(3):173-210, September 2007.

[12]

Flaxman, A. D., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA '05: Proceedings of the 16th annual ACM-SIAM Symposium on Discrete Algorithms, pp. 385- 394, 2005.

Digital Library

[13]

Fudenberg, D. and Tirole, J. Game Theory. The MIT Press, 1991.

[14]

Goodman, J. C. Note on existence and uniqueness of equilibrium points for concave N-person games. Econometrica, 48(1): 251, 1980.

[15]

Joulani, P., Gyorgy, A., and Szepesvári, C. Online learning under delayed feedback. In ICML '13: Proceedings of the 30th International Conference on Machine Learning, 2013.

[16]

Joulani, P., György, A., and Szepesvári, C. Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms. In AAAI '16: Proceedings of the 30th Conference on Artificial Intelligence, 2016.

[17]

Juditsky, A., Nemirovski, A. S., and Tauvel, C. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17-58, 2011.

[18]

Kelly, F. P., Maulloo, A. K., and Tan, D. K. H. Rate control for communication networks: shadow prices, proportional fairness and stability. Journal of the Operational Research Society, 49(3):237-252, March 1998.

[19]

Kleinberg, R. D. Nearly tight bounds for the continuum-armed bandit problem. In NIPS' 04: Proceedings of the 18th Annual Conference on Neural Information Processing Systems, 2004.

[20]

Krichene, S., Krichene, W., Dong, R., and Bayen, A. Convergence of heterogeneous distributed learning in stochastic routing games. In Allerton '15: Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing, 2015.

Digital Library

[21]

Laraki, R., Renault, J., and Sorin, S. Mathematical Foundations of Game Theory. Universitext. Springer, 2019.

[22]

Lin, T., Zhou, Z., Mertikopoulos, P., and Jordan, M. I. Finite-time last-iterate convergence for multi-agent learning in games. arXiv preprint arXiv:2002.09806, 2020.

[23]

Mertikopoulos, P. and Zhou, Z. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1-2):465-507, January 2019.

Digital Library

[24]

Mertikopoulos, P., Belmega, E. V., Negrel, R., and Sanguinetti, L. Distributed stochastic optimization via matrix exponential learning. IEEE Trans. Signal Process., 65(9):2277-2290, May 2017.

Digital Library

[25]

Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR '19: Proceedings of the 2019 International Conference on Learning Representations, 2019.

[26]

Nisan, N., Roughgarden, T., Tardos, É., and Vazirani, V. V. (eds.). Algorithmic Game Theory. Cambridge University Press, 2007.

[27]

Pang, J.-S., Scutari, G., Palomar, D. P., and Facchinei, F. Design of cognitive radio systems under temperature-interference constraints: A variational inequality approach. IEEE Trans. Signal Process., 58(6):3251-3271, June 2010.

Digital Library

[28]

Pike-Burke, C., Shipra, A., Szepesvári, C., and Grunewalder, S. Bandits with delayed, aggregated anonymous feedback. In ICML '18: Proceedings of the 35th International Conference on Machine Learning, 2018.

[29]

Quanrud, K. and Khashabi, D. Online learning with adversarial delays. In NIPS '15: Proceedings of the 29th International Conference on Neural Information Processing Systems, 2015.

[30]

Rosen, J. B. Existence and uniqueness of equilibrium points for concave N-person games. Econometrica, 33(3):520-534, 1965.

[31]

Sandholm, W. H. Population games and deterministic evolutionary dynamics. In Young, H. P. and Zamir, S. (eds.), Handbook of Game Theory IV, pp. 703-778. Elsevier, 2015.

[32]

Scutari, G., Facchinei, F., Palomar, D. P., and Pang, J.-S. Convex optimization, game theory, and variational inequality theory in multiuser communication systems. IEEE Signal Process. Mag., 27(3):35-49, May 2010.

[33]

Spall, J. C. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1):109-112, 1997.

Digital Library

[34]

Thune, T. S., Cesa-Bianchi, N., and Seldin, Y. Nonstochastic multiarmed bandits with unrestricted delays. In NeurIPS '19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.

Digital Library

[35]

Vernade, C., Cappé, O., and Perchet, V. Stochastic banit models for delayed conversions. In UAI' 17: Proceedings of the 33rd Annual Conference on Uncertainty in Artificial Intelligence, 2017.

[36]

Viossat, Y. and Zapechelnyuk, A. No-regret dynamics and fictitious play. Journal of Economic Theory, 148(2):825-842, March 2013.

[37]

Zhou, Z., Mertikopoulos, P., Bambos, N., Glynn, P.W., and Tomlin, C. Countering feedback delays in multi-agent learning. In NIPS '17: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017a.

Digital Library

[38]

Zhou, Z., Mertikopoulos, P., Moustakas, A. L., Bambos, N., and Glynn, P. W. Mirror descent learning in continuous games. In CDC '17: Proceedings of the 56th IEEE Annual Conference on Decision and Control, 2017b.

[39]

Zhou, Z., Mertikopoulos, P., Athey, S., Bambos, N., Glynn, P. W., and Ye, Y. Learning in games with lossy feedback. In NIPS '18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.

Digital Library

[40]

Zhou, Z., Xu, R., and Blanchet, J. Learning in generalized linear contextual bandits with stochastic delays. In Advances in Neural Information Processing Systems, pp. 5198-5209, 2019.

[41]

Zhou, Z., Mertikopoulos, P., Moustakas, A. L., Bambos, N., and Glynn, P. W. Robust power management via learning and game design. Operations Research, 2020.

[42]

Zimmert, J. and Seldin, Y. An optimal algorithm for adversarial bandits with arbitrary delays. In AISTATS '20: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2020.

[43]

Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In ICML '03: Proceedings of the 20th International Conference on Machine Learning, pp. 928-936, 2003.

Digital Library

Cited By

Qian YZhang ZZhao PZhou Z(2024)Learning with Asynchronous LabelsACM Transactions on Knowledge Discovery from Data10.1145/366218618:8(1-27)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.1145/3662186
Lotidis KMertikopoulos PBambos NBlanchet JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Payoff-based learning with matrix multiplicative weights in quantum gamesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667774(37942-37974)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667774

Recommendations

Playing with Streakiness in Online Games: How Players Perceive and React to Winning and Losing Streaks in League of Legends
CHI '18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems

Streakiness refers to observed tendency towards consecutive appearances of particular patterns. In video games, streakiness is oftentimes inevitable, where a player keeps winning or losing for a short period. However, the phenomenon remains understudied ...
Stationary Anonymous Sequential Games with Undiscounted Rewards

Stationary anonymous sequential games with undiscounted rewards are a special class of games that combine features from both population games (infinitely many players) with stochastic games. We extend the theory for these games to the cases of total ...
Massive online games and loot distribution: an elusive problem
SIMUTOOLS '12: Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques

Massively Multi-player Online Role-Playing Games (MMORPGs) and Massively Multiplayer Online games (MMOs) are complex socio-technical distributed systems. In these environments, a huge amount of players interact to have fun and develop their characters. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'20: Proceedings of the 37th International Conference on Machine Learning

July 2020

11702 pages

Editors:
Hal Daumé,
Aarti Singh

Copyright © 2020.

Publisher

JMLR.org

Publication History

Published: 13 July 2020

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
52
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)10

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qian YZhang ZZhao PZhou Z(2024)Learning with Asynchronous LabelsACM Transactions on Knowledge Discovery from Data10.1145/366218618:8(1-27)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.1145/3662186
Lotidis KMertikopoulos PBambos NBlanchet JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Payoff-based learning with matrix multiplicative weights in quantum gamesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667774(37942-37974)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667774

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten