More Web Proxy on the site http://driver.im/

article

Linearly Parameterized Bandits

Authors:

Paat Rusmevichientong,

John N. TsitsiklisAuthors Info & Claims

Mathematics of Operations Research, Volume 35, Issue 2

Pages 395 - 411

https://doi.org/10.1287/moor.1100.0446

Published: 01 May 2010 Publication History

Abstract

We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an r-dimensional random vector Z ∈ R^r, where r ≥ 2. The objective is to minimize the cumulative regret and Bayes risk. When the set of arms corresponds to the unit sphere, we prove that the regret and Bayes risk is of order Θ(r √T), by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is obtained through a policy that alternates between exploration and exploitation phases. The phase-based policy is also shown to be effective if the set of arms satisfies a strong convexity condition. For the case of a general set of arms, we describe a near-optimal policy whose regret and Bayes risk admit upper bounds of the form O(r √T log^3/2T).

References

[1]

Abe, N. and Long, P. M., "Associative reinforcement learning using linear probabilistic concepts," Proc. 16th Internat. Conf. Machine Learn., Morgan Kaufman, San Francisco, pp. 3-11, 1999.

[2]

Agrawal, R., "Sample mean based index policies with O(log n) regret for the multi-armed bandit problem," Adv. Appl. Probab., v27, pp. 1054-1078, 1995.

[3]

Agrawal, R., Teneketzis, D. and Anantharam, V., "Asymptotically efficient adaptive allocation schemes for controlled i.i.d. processes: Finite parameter space," IEEE Trans. Automatic Control, v34, pp. 258-267, 1989.

[4]

Auer, P., "Using confidence bounds for exploitation-exploration trade-offs," J. Machine Learn. Res., v3, pp. 397-422, 2002.

Digital Library

[5]

Auer, P., Cesa-Bianchi, N. and Fischer, P., "Finite-time analysis of the multi-armed bandit problem," Machine Learn., v47, pp. 235-256, 2002.

[6]

Berry, D. and Fristedt, B., Bandit Problems: Sequential Allocation of Experiments, Chapman and Hall, London, 1985.

[7]

Bertsekas, D., Dynamic Programming and Optimal Controls, v1, Athena Scientific, Belmont, MA, 1995.

[8]

Bertsekas, D. and Tsitsiklis, J. N., Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996.

Digital Library

[9]

Bertsimas, D. and Tsitsiklis, J. N., Introduction to Linear Optimization, Athena Scientific, Belmont, MA, 1997.

Digital Library

[10]

Blum, J. R., "Multidimensional stochastic approximation methods," Ann. Math. Statist., v25, pp. 737-744, 1954.

[11]

Cicek, D., Broadie, M. and Zeevi, A., "General bounds and finite-time performance improvement for the Kiefer-Wolfowitz stochastic approximation algorithm," 2009.

[12]

Dani, V., Hayes, T. P. and Kakade, S. M., "Stochastic linear optimization under bandit feedback," Proc. 21st Annual Conf. Learn. Theory (COLT 2008), Helsinki, Finland, pp. 355-366, 2008a.

[13]

Dani, V., Hayes, T. P. and Kakade, S. M., "Stochastic linear optimization under bandit feedback," 2008b.

[14]

Feldman, D., Contributions to the “two-armed bandit” problem, Ann. Math. Statist., v33, pp. 847-856, 1962.

[15]

Fiedler, M. and Pták, V., "A new positive definite geometric mean of two positive definite matrices," Linear Algebra Its Appl., v251, pp. 1-20, 1997.

[16]

Ginebra, J. and Clayton, M. K., "Response surface bandits," J. Roy. Statist. Soc. Ser. B (Methodological), v57, pp. 771-784, 1995.

[17]

Goldenshluger, A. and Zeevi, A., "Performance limitations in bandit problems with side observations," 2008.

[18]

Goldenshluger, A. and Zeevi, A., "Woodroofe's one-armed bandit problem revisited," Ann. Appl. Probab., v19, pp. 1603-1633, 2009.

[19]

Keener, R., Further contributions to the “two-armed bandit” problem, Ann. Statist., v13, pp. 418-422, 1985.

[20]

Kiefer, J. and Wolfowitz, J., "Stochastic estimation of the maximum of a regression function," Ann. Math. Statist., v23, pp. 462-466, 1952.

[21]

Lai, T., "Stochastic approximation (invited paper)," Ann. Statist., v31, pp. 391-406, 2003.

[22]

Lai, T. L., "Adaptive treatment allocation and the multi-armed bandit problem," Ann. Statist., v15, pp. 1091-1114, 1987.

[23]

Lai, T. L. and Robbins, H., "Asymptotically efficient adaptive allocation rules," Adv. Appl. Math., v6, pp. 4-22, 1985.

Digital Library

[24]

Mersereau, A. J., Rusmevichientong, P. and Tsitsiklis, J. N., "A structured multi-armed bandit problem and the greedy policy," IEEE Trans. Automatic Control, v54, pp. 2787-2802, 2009.

[25]

Pandey, S., Chakrabarti, D. and Agrawal, D., "Multi-armed bandit problems with dependent arms," Proc. 24th Internat. Conf. Machine Learn., pp. 721-728, 2007.

[26]

Polovinkin, E. S., "Strongly convex analysis," Sbornik: Math., v187, pp. 259-286, 1996.

[27]

Pressman, E. L. and Sonin, I. N., Sequential Control with Incomplete Information, Academic Press, London, 1990.

[28]

Robbins, H., "Some aspects of the sequential design of experiments," Bull. Amer. Math. Soc., v58, pp. 527-535, 1952.

[29]

Robbins, H. and Monro, S., "A stochastic approximation method," Ann. Math. Statist., v22, pp. 400-407, 1951.

[30]

Rusmevichientong, P. and Tsitsiklis, J. N., "Linearly parameterized bandits (extended version)," 2010.

[31]

Thompson, W. R., "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples," Biometrika, v25, pp. 285-294, 1933.

[32]

Wang, C.-C., Kulkarni, S. R. and Poor, H. V., "Bandit problems with side observations," IEEE Trans. Automatic Control, v50, pp. 338-355, 2005a.

[33]

Wang, C.-C., Kulkarni, S. R. and Poor, H. V., "Arbitrary side observations in bandit problems," Adv. Appl. Math., v34, pp. 903-938, 2005b.

Digital Library

Cited By

Birge JChen HKeskin NWard A(2024)To Interfere or Not To InterfereOperations Research10.1287/opre.2023.036372:6(2391-2412)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1287/opre.2023.0363
Jia HShi CShen S(2024)Online Learning and Pricing for Service Systems with Reusable ResourcesOperations Research10.1287/opre.2022.238172:3(1203-1241)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1287/opre.2022.2381
den Boer AChen BWang Y(2024)Pricing and Positioning of Horizontally Differentiated Products with Incomplete Demand InformationOperations Research10.1287/opre.2021.009372:6(2446-2466)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1287/opre.2021.0093
Show More Cited By

Linearly Parameterized Bandits
1. Computing methodologies

Recommendations

Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

The nonstochastic multi-armed bandit problem, first studied by Auer, Cesa-Bianchi, Freund, and Schapire in 1995, is a game of repeatedly choosing one decision from a set of decisions (''experts''), under partial observation: In each round t, only the ...
Bandits with switching costs: T^2/3 regret
STOC '14: Proceedings of the forty-sixth annual ACM symposium on Theory of computing

We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's T-round minimax regret in this setting is [EQUATION], thereby closing a fundamental gap in our ...
Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
We study the linear contextual bandit problem with finite action sets. When the problem dimension is <inline-formula> <tex-math notation="LaTeX">$d$ </tex-math></inline-formula>, the time horizon is <inline-formula> <tex-math notation="LaTeX">$T$ </tex-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Mathematics of Operations Research

Mathematics of Operations Research Volume 35, Issue 2

May 2010

241 pages

ISSN:0364-765X

Issue’s Table of Contents

Publisher

INFORMS

Linthicum, MD, United States

Publication History

Published: 01 May 2010

Received: 19 January 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

139
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Birge JChen HKeskin NWard A(2024)To Interfere or Not To InterfereOperations Research10.1287/opre.2023.036372:6(2391-2412)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1287/opre.2023.0363
Jia HShi CShen S(2024)Online Learning and Pricing for Service Systems with Reusable ResourcesOperations Research10.1287/opre.2022.238172:3(1203-1241)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1287/opre.2022.2381
den Boer AChen BWang Y(2024)Pricing and Positioning of Horizontally Differentiated Products with Incomplete Demand InformationOperations Research10.1287/opre.2021.009372:6(2446-2466)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1287/opre.2021.0093
Cao JSun W(2024)Tiered AssortmentManagement Science10.1287/mnsc.2023.494070:8(5481-5501)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1287/mnsc.2023.4940
Van Parys BGolrezaei N(2024)Optimal Learning for Structured BanditsManagement Science10.1287/mnsc.2020.0210870:6(3951-3998)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1287/mnsc.2020.02108
Feng YTan VCautis BBaeza-Yates RBonchi F(2024)Influence Maximization via Graph Neural BanditsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671983(771-781)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671983
Wang XXie HLui J(2024)Analyzing Queueing Problems via Bandits With Linear Reward & Nonlinear Workload FairnessIEEE Transactions on Mobile Computing10.1109/TMC.2023.327615823:4(3410-3423)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TMC.2023.3276158
Li YWang YZhou Y(2024)Nearly Minimax-Optimal Regret for Linearly Parameterized BanditsIEEE Transactions on Information Theory10.1109/TIT.2023.326773270:1(372-388)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIT.2023.3267732
Singh RLiu FSun YShroff N(2024)Multi-armed bandits with dependent armsMachine Language10.1007/s10994-023-06457-z113:1(45-71)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s10994-023-06457-z
Do ANguyen-Tang TArora ROh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Multi-agent learning with heterogeneous linear contextual banditsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669567(78768-78790)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669567
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents