[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3524938.3525328guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article
Free access

Gradient-free online learning in games with delayed rewards

Published: 13 July 2020 Publication History

Abstract

Motivated by applications to online advertising and recommender systems, we consider a gametheoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium (NE) with probability 1, even if the delay between choosing an action and receiving the corresponding reward is unbounded.

Supplementary Material

Additional material (3524938.3525328_supp.pdf)
Supplemental material.

References

[1]
Balandat, M., Krichene, W., Tomlin, C., and Bayen, A. Minimizing regret on reflexive Banach spaces and Nash equilibria in continuous zero-sum games. In NIPS '16: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016.
[2]
Bauschke, H. H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York, NY, USA, 2 edition, 2017.
[3]
Bervoets, S., Bravo, M., and Faure, M. Learning with minimal information in continuous games. https://arxiv.org/abs/1806.11506, 2018.
[4]
Bistritz, I., Zhou, Z., Chen, X., Bambos, N., and Blanchet, J. Online exp3 learning in adversarial bandits with delayed feedback. In Advances in Neural Information Processing Systems, pp. 11345-11354, 2019.
[5]
Bravo, M., Leslie, D. S., and Mertikopoulos, P. Bandit learning in concave N-person games. In NIPS '18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
[6]
Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1-122, 2012.
[7]
Bubeck, S. and Eldan, R. Multi-scale exploration of convex functions and bandit convex optimization. In COLT '16: Proceedings of the 29th Annual Conference on Learning Theory, 2016.
[8]
Bubeck, S. and Eldan, R. Kernel-based methods for bandit convex optimization. In STOC '17: Proceedings of the 49th annual ACM SIGACT symposium on the Theory of Computing, 2017.
[9]
Chapelle, O. Modeling delayed feedback in display advertising. In ACM SIGKDD '14 Proceedings of the 20th international conference on Knowledge discovery and data mining, 2014.
[10]
Debreu, G. A social equilibrium existence theorem. Proceedings of the National Academy of Sciences of the USA, 38(10): 886-893, October 1952.
[11]
Facchinei, F. and Kanzow, C. Generalized Nash equilibrium problems. 4OR, 5(3):173-210, September 2007.
[12]
Flaxman, A. D., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA '05: Proceedings of the 16th annual ACM-SIAM Symposium on Discrete Algorithms, pp. 385- 394, 2005.
[13]
Fudenberg, D. and Tirole, J. Game Theory. The MIT Press, 1991.
[14]
Goodman, J. C. Note on existence and uniqueness of equilibrium points for concave N-person games. Econometrica, 48(1): 251, 1980.
[15]
Joulani, P., Gyorgy, A., and Szepesvári, C. Online learning under delayed feedback. In ICML '13: Proceedings of the 30th International Conference on Machine Learning, 2013.
[16]
Joulani, P., György, A., and Szepesvári, C. Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms. In AAAI '16: Proceedings of the 30th Conference on Artificial Intelligence, 2016.
[17]
Juditsky, A., Nemirovski, A. S., and Tauvel, C. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17-58, 2011.
[18]
Kelly, F. P., Maulloo, A. K., and Tan, D. K. H. Rate control for communication networks: shadow prices, proportional fairness and stability. Journal of the Operational Research Society, 49(3):237-252, March 1998.
[19]
Kleinberg, R. D. Nearly tight bounds for the continuum-armed bandit problem. In NIPS' 04: Proceedings of the 18th Annual Conference on Neural Information Processing Systems, 2004.
[20]
Krichene, S., Krichene, W., Dong, R., and Bayen, A. Convergence of heterogeneous distributed learning in stochastic routing games. In Allerton '15: Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing, 2015.
[21]
Laraki, R., Renault, J., and Sorin, S. Mathematical Foundations of Game Theory. Universitext. Springer, 2019.
[22]
Lin, T., Zhou, Z., Mertikopoulos, P., and Jordan, M. I. Finite-time last-iterate convergence for multi-agent learning in games. arXiv preprint arXiv:2002.09806, 2020.
[23]
Mertikopoulos, P. and Zhou, Z. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1-2):465-507, January 2019.
[24]
Mertikopoulos, P., Belmega, E. V., Negrel, R., and Sanguinetti, L. Distributed stochastic optimization via matrix exponential learning. IEEE Trans. Signal Process., 65(9):2277-2290, May 2017.
[25]
Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR '19: Proceedings of the 2019 International Conference on Learning Representations, 2019.
[26]
Nisan, N., Roughgarden, T., Tardos, É., and Vazirani, V. V. (eds.). Algorithmic Game Theory. Cambridge University Press, 2007.
[27]
Pang, J.-S., Scutari, G., Palomar, D. P., and Facchinei, F. Design of cognitive radio systems under temperature-interference constraints: A variational inequality approach. IEEE Trans. Signal Process., 58(6):3251-3271, June 2010.
[28]
Pike-Burke, C., Shipra, A., Szepesvári, C., and Grunewalder, S. Bandits with delayed, aggregated anonymous feedback. In ICML '18: Proceedings of the 35th International Conference on Machine Learning, 2018.
[29]
Quanrud, K. and Khashabi, D. Online learning with adversarial delays. In NIPS '15: Proceedings of the 29th International Conference on Neural Information Processing Systems, 2015.
[30]
Rosen, J. B. Existence and uniqueness of equilibrium points for concave N-person games. Econometrica, 33(3):520-534, 1965.
[31]
Sandholm, W. H. Population games and deterministic evolutionary dynamics. In Young, H. P. and Zamir, S. (eds.), Handbook of Game Theory IV, pp. 703-778. Elsevier, 2015.
[32]
Scutari, G., Facchinei, F., Palomar, D. P., and Pang, J.-S. Convex optimization, game theory, and variational inequality theory in multiuser communication systems. IEEE Signal Process. Mag., 27(3):35-49, May 2010.
[33]
Spall, J. C. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1):109-112, 1997.
[34]
Thune, T. S., Cesa-Bianchi, N., and Seldin, Y. Nonstochastic multiarmed bandits with unrestricted delays. In NeurIPS '19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
[35]
Vernade, C., Cappé, O., and Perchet, V. Stochastic banit models for delayed conversions. In UAI' 17: Proceedings of the 33rd Annual Conference on Uncertainty in Artificial Intelligence, 2017.
[36]
Viossat, Y. and Zapechelnyuk, A. No-regret dynamics and fictitious play. Journal of Economic Theory, 148(2):825-842, March 2013.
[37]
Zhou, Z., Mertikopoulos, P., Bambos, N., Glynn, P.W., and Tomlin, C. Countering feedback delays in multi-agent learning. In NIPS '17: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017a.
[38]
Zhou, Z., Mertikopoulos, P., Moustakas, A. L., Bambos, N., and Glynn, P. W. Mirror descent learning in continuous games. In CDC '17: Proceedings of the 56th IEEE Annual Conference on Decision and Control, 2017b.
[39]
Zhou, Z., Mertikopoulos, P., Athey, S., Bambos, N., Glynn, P. W., and Ye, Y. Learning in games with lossy feedback. In NIPS '18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
[40]
Zhou, Z., Xu, R., and Blanchet, J. Learning in generalized linear contextual bandits with stochastic delays. In Advances in Neural Information Processing Systems, pp. 5198-5209, 2019.
[41]
Zhou, Z., Mertikopoulos, P., Moustakas, A. L., Bambos, N., and Glynn, P. W. Robust power management via learning and game design. Operations Research, 2020.
[42]
Zimmert, J. and Seldin, Y. An optimal algorithm for adversarial bandits with arbitrary delays. In AISTATS '20: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2020.
[43]
Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In ICML '03: Proceedings of the 20th International Conference on Machine Learning, pp. 928-936, 2003.

Cited By

View all
  • (2024)Learning with Asynchronous LabelsACM Transactions on Knowledge Discovery from Data10.1145/366218618:8(1-27)Online publication date: 3-May-2024
  • (2023)Payoff-based learning with matrix multiplicative weights in quantum gamesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667774(37942-37974)Online publication date: 10-Dec-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICML'20: Proceedings of the 37th International Conference on Machine Learning
July 2020
11702 pages

Publisher

JMLR.org

Publication History

Published: 13 July 2020

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)10
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Learning with Asynchronous LabelsACM Transactions on Knowledge Discovery from Data10.1145/366218618:8(1-27)Online publication date: 3-May-2024
  • (2023)Payoff-based learning with matrix multiplicative weights in quantum gamesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667774(37942-37974)Online publication date: 10-Dec-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media