article

Free access

How to use expert advice

Authors:

Manfred K. WarmuthAuthors Info & Claims

Journal of the ACM (JACM), Volume 44, Issue 3

Pages 427 - 485

https://doi.org/10.1145/258128.258179

Published: 01 May 1997 Publication History

PDF eReader

Abstract

We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictins. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show how this leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently know in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.

References

[1]

BIRGE, L., AND MASSART, P. 1993. Rates of convergence for minimum contrast estimators. Prob. Theory Rel. Fields 97, 113-150.

Crossref

Google Scholar

[2]

BLUMER, A., EHRENFEUCHT, A., HAUSSLER, D., AND WARMUTH, M.K. 1989. Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36, 4, 929-965.

Digital Library

Google Scholar

[3]

CESA-BIANCHI, N., FREUND, Y., HELMBOLD, D. P., HAUSSLER, D., SCHAPIRE, R. E., AND WARMUTH, M.K. 1993. How to use expert advice. In Proceedings of the 25th Annual ACM Symposium on the Theory of Computing (San Diego, Calif., May 16-18). ACM, New York, pp. 382-391.

Crossref

Google Scholar

[4]

CESA-BIANCHI, N., FREUND, Y., HELMBOLD, D. P., AND WARMUTH, M.K. 1996. On-line prediction and conversion strategies. Mach. Learn., to appear.

Crossref

Google Scholar

[5]

CHUNG, T.H. 1994. Approximate methods for sequential decision making using expert advice. In Proceedings of the 7th Annual ACM Conference on Computational Learning Theory (New Brunswick, N.J., July 12-15). ACM, New York, pp. 183-189.

Crossref

Google Scholar

[6]

COVER, T.M. 1965. Behaviour of sequential predictors of binary sequences. In Transactions of the 4th Prague Conference on Information Theory, Statistical Decision Functions, Random Processes. Publishing House of the Czechoslovak Academy of Sciences.

Google Scholar

[7]

COVER, T. M., AND SHANAR, A. 1977. Compound Bayes predictors for sequences with apparent Markov structure. IEEE Trans. Syst. Man Cybernet. SMC-7, 6 (June), 421-424.

Crossref

Google Scholar

[8]

DAWID, A. P. 1984. Statistical theory: The prequential approach. J. Roy. Stat. Soc., Series A, 278-292.

Crossref

Google Scholar

[9]

DAWlD, A. 1991. Prequential analysis, stochastic complexity and Bayesian inference. In Bayesian Statistics, vol. 4. Oxford University Press, pp. 109-125.

Google Scholar

[10]

DAWlD, A.P. 1996. Prequential data analysis. Curt. Iss. Stat. Inference, to appear.

Google Scholar

[11]

DESANTIS, A., MARKOWSKI, G., AND WEGMAN, M. N. 1988. Learning probabilistic prediction functions. In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan- Kaufmann, San Mateo, Calif., pp. 312-328.

Crossref

Google Scholar

[12]

FEDER, M., MERHAV, N., AND GUTMAN, M. 1992. Universal prediction of individual sequences. IEEE Trans. Inf. Theory 38, 1258-1270.

Crossref

Google Scholar

[13]

FIAT, A., FOSTER, D., KARLOFF, H., RABANI, Y., RAVID, Y., AND VISWANATHAN, S. 1991a. Competitive algorithms for layered graph traversal. In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science. IEEE, New York, pp. 288-297.

Digital Library

Google Scholar

[14]

FIAT, A., KARP, R., LUBY, M., MCGEOCH, L., SLEATOR, D., AND YOUNG, N. 1991b. Competitive paging algorithms. J. Algorithms 12, 688-699.

Digital Library

Google Scholar

[15]

FIAT, A., RABANI, Y., AND RAVID, Y. 1994. Competitive k-server algorithms. J. Comput Syst. Sci. 48, 3, 410-428.

Digital Library

Google Scholar

[16]

GALAMBOS, J. 1987. The Asymptotic Theory of Extreme Order Statistics, 2nd ed. R. E. Kreiger.

Google Scholar

[17]

HANNAN, J. 1957. Approximation to Bayes risk in repeated play. In Contributions to the Theory of Games, vol. 3. Princeton University Press, Princeton, N.J., pp. 97-139.

Google Scholar

[18]

HAUSSLER, D., AND BARRON, A. 1992. How well do Bayes methods work for on-line prediction of { + 1, -1 } values? In Proceedings of the 3rd NEC Symposium on Computation and Cognition. SIAM.

Google Scholar

[19]

HAUSSLER, D., KEARNS, M., LITTLESTONE, N., AND WARMUTH, M.K. 1991. Equivalence of models for polynomial learnability. Inf. Comput. 95, 129-161.

Digital Library

Google Scholar

[20]

HAUSSLER, D., KEARNS, M., AND SCHAPIRE, R. 1994. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimensions. Mach. Learn. 14, 84-114.

Crossref

Google Scholar

[21]

HAUSSLER, D., KIVINEN, J., AND WARMUTH, M. K. 1995. Tight worst-case loss bounds for predicting with expert advice. In Computational Learning Theory: Second European Conference, EuroCOLT '95. Springer-Verlag, New York, pp. 69-83.

Crossref

Google Scholar

[22]

HAUSSLER, D., LITTLESTONE, N., AND WARMUTH, M. K. 1994. Predicting {0, 1}-functions on randomly drawn points. Inf. Comput. 115, 2, 248-292.

Digital Library

Google Scholar

[23]

HAYKIN, S. 1994. Neural Networks: A Comprehensive Foundation. MacMillan, New York.

Digital Library

Google Scholar

[24]

HELMBOLD, U., AND WARMUTH, M.K. 1995. On weak learning. J. Comput. Syst. Sci. 50, 3 (June), 551-573.

Digital Library

Google Scholar

[25]

KEARNS, M. J., AND SCHAPIRE, R. E. 1994. Efficient distribution-free learning of probabilistic concepts. J. Comput. Syst. Sci. 48, 3, 464-497.

Digital Library

Google Scholar

[26]

KEARNS, M. J., SCHAPIRE, R. E., AND NELLIE, L. M. 1994. Toward efficient agnostic learning. Mach. Learn. 17, 115-141.

Crossref

Google Scholar

[27]

KIVINEN, J., AND WARMUTH, M.K. 1994. Using experts for predicting continuous outcomes. In Computational Learning Theory: EuroCOLT '93. Springer-Verlag, New York, pp. 109-120.

Crossref

Google Scholar

[28]

LITTLESTONE, N. 1989. From on-line to batch learning. In Proceedings of the 2nd Annual Workshop ~on Computational Learning Theory. Morgan-Kaufmann, San Mateo, Calif., pp. 269-284.

Crossref

Google Scholar

[29]

LITTLESTONE, N., LONG, P. M., AND WARMUTH, M.K. 1995. On-line learning of linear functions. Computat. Complexity 5, 1, 1-23.

Digital Library

Google Scholar

[30]

LITTLESTONE, N., AND WARMUTH, M.K. 1994. The weighted majority algorithm. Inf. Comput. 108, 2, 212-261.

Digital Library

Google Scholar

[31]

MERHAV, N., AND FEDER, M. 1993. Universal schemes for sequential decision for individual data sequences. IEEE Trans. Inf. Theory, 39, 4, 1280-1292.

Crossref

Google Scholar

[32]

RISSANEN, J. 1978. Modeling by shortest data description. Automatica 14, 465-471.

Digital Library

Google Scholar

[33]

RISSANEN, J. 1986. Stochastic complexity and modeling. Ann. Star. 14, 3, 1080-1100.

Crossref

Google Scholar

[34]

RISSANEN, J., AND LANGDON, G. G., JR. 1981. Universal modeling and coding. IEEE Trans. Inf. Theory IT-27, 1 (Jan.), 12-23.

Digital Library

Google Scholar

[35]

SEUNG, n. S., SOMPOLINSKY, n., AND TISHBY, N. 1992. Statistical mechanics of learning from examples. Phys. Rev A 45, 8, 6056-6091.

Crossref

Google Scholar

[36]

SHTARKOV, YU. M. 1975. Coding of discrete sources with unknown statistics. In Topics in Information Theory. North-Holland, Amsterdam, The Netherlands, pp. 559-574.

Google Scholar

[37]

SHTARKOV, YU. M. 1987. Universal sequential coding of single messages. Prob. Inf. Transm. 23 (July-September), 175-186.

Google Scholar

[38]

SOMPOLINSKY, H., TISHBY, N., AND SEUNG, H. S. 1992. Learning from examples in large neural networks. Phys. Rev. Lett. 65, 1683-1686.

Crossref

Google Scholar

[39]

STONE, C. J. 1977. Cross-validation: A review. Math. Operationforsch. Statist. Ser Statist. 9, 127-139.

Crossref

Google Scholar

[40]

TALAGRAND, M. 1994. Sharper bounds for Gaussian and empirical processes. Ann. Prob. 22, 1, 28-76.

Crossref

Google Scholar

[41]

VALIANT, L.G. 1984. A theory of the learnable. Commun. ACM 27, 11 (Nov.), 1134-1142.

Digital Library

Google Scholar

[42]

VAPNIK, V.N. 1982. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York.

Crossref

Google Scholar

[43]

VAPNIK, V. 1992. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, vol. 4. John E. Moody, Steve J. Hanson, and Richard P. Lippman, eds. Morgan Kaufmann, San Mateo, Calif.

Google Scholar

[44]

VOVK, V.G. 1990. Aggregating strategies. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory. Morgan-Kaufmann, San Mateo, Calif., pp. 371-383.

Crossref

Google Scholar

[45]

VOVK, V.G. 1992. Universal forecasting algorithms. Inf. Comput. 96, 2 (Feb.), 245-277.

Digital Library

Google Scholar

[46]

VOVK, V.G. 1993. A logic of probability, with application to the foundations of statistics. J. Roy. Statis Soc, Ser. B-Methodolical 55, 2, 317-351.

Google Scholar

[47]

YAMANISHI, K. 1995. A loss bound model for on-line stochastic prediction algorithms. Inf. Comput. 119, 1, 39-54.

Digital Library

Google Scholar

Cited By

View all

Johnson MWest M(2025)Bayesian Predictive Synthesis with Outcome-Dependent PoolsStatistical Science10.1214/24-STS95440:1Online publication date: 1-Jan-2025
https://doi.org/10.1214/24-STS954
Wan YYao CSong MZhang LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Non-stationary online convex optimization with arbitrary delaysProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694115(49991-50011)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694115
Bechavod YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Monotone individual fairnessProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692200(3266-3283)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692200
Show More Cited By

Index Terms

How to use expert advice

Recommendations

HOW TO USE EXPERT ADVICE
How to Better Use Expert Advice

This work is concerned with online learning from expert advice. Extensive work on this problem generated numerous "expert advice algorithms" whose total loss is provably bounded above in terms of the loss incurred by the best expert in hindsight. Such ...
TIGHT WORST-CASE LOSS BOUNDS FOR PREDICTING WITH EXPERT ADVICE

Reviews

Reviewer: Giuseppina Carla Gini

While the title of this paper suggests the use of traditional expert systems technology, the research reported here concerns how to make predictions when a team of experts is available and their predictions are known. In particular, the authors define an algorithm to solve the following prediction problem: We want to predict whether an event will occur, and we may use the predictions of a finite set of experts; each expert can see a sequence of measurements, and gives his or her opinion on the probability of the event as a real number between 0 and 1. The problem is to build an algorithm that uses the experts' advice to give the best possible estimate. This requires minimization of the loss, where loss is defined as the absolute value of the difference between the predicted number and the real outcome. The authors define a family of algorithms, some of them taken from the literature, and study their performance. In particular, the optimal solution algorithms are discussed, and their lower and upper bounds are demonstrated. Since the optimal algorithms have complexity exponential in the number of experts, the authors provide other algorithms to generate the prediction, and demonstrate their lower and upper bounds. The algorithm P is based on computing and upgrading a nonnegative weight for each expert that reflects the expert's prediction ability. A parameter controls when to drop poorly performing experts. The choice of the parameter uses a priori knowledge. An iterative variation P* can make predictions without a priori knowledge of the best expert's loss. In the last part of the paper, the authors discuss an application of the algorithm in pattern recognition, where the problem is to predict the label (0 or 1) of examples taken at random from an arbitrary distribution. The authors compare their solution with other, similar results. The comparison indicates that the new approach is better. The proofs of the main theorems are difficult and require a few lemmas. This paper is important for everyone working in the field. It covers almost every aspect of a loss function that minimizes the number of mistakes and can use any kind of expert output.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Journal of the ACM Volume 44, Issue 3

May 1997

163 pages

ISSN:0004-5411

EISSN:1557-735X

DOI:10.1145/258128

Editor:
F. T. Leighton
Massachusetts Institute of Technology, Cambridge

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 1997

Published in JACM Volume 44, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

algorithms

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

432
Total Citations
View Citations
2,654
Total Downloads

Downloads (Last 12 months)319
Downloads (Last 6 weeks)39

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Johnson MWest M(2025)Bayesian Predictive Synthesis with Outcome-Dependent PoolsStatistical Science10.1214/24-STS95440:1Online publication date: 1-Jan-2025
https://doi.org/10.1214/24-STS954
Wan YYao CSong MZhang LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Non-stationary online convex optimization with arbitrary delaysProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694115(49991-50011)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694115
Bechavod YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Monotone individual fairnessProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692200(3266-3283)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692200
Kamble VLoiseau PWalrand J(2024)An Approximate Dynamic Programming Approach to Repeated Games with Vector LossesOperations Research10.1287/opre.2022.233472:1(373-388)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1287/opre.2022.2334
Aird AFarastu PSun JStefancová EAll CVoida AMattei NBurke R(2024)Dynamic Fairness-aware Recommendation Through Multi-agent Social ChoiceACM Transactions on Recommender Systems10.1145/36906533:2(1-35)Online publication date: 28-Sep-2024
https://dl.acm.org/doi/10.1145/3690653
Xu SXu DKorpeoglu EKumar SGuo SAchan KZhang Y(2024)Causal Structure Learning for Recommender SystemACM Transactions on Recommender Systems10.1145/36802963:1(1-23)Online publication date: 31-Jul-2024
https://dl.acm.org/doi/10.1145/3680296
Peng BRubinstein AMohar BShinkar IO'Donnell R(2024)Fast Swap Regret Minimization and Applications to Approximate Correlated EquilibriaProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649691(1223-1234)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3618260.3649691
Yu XRosing TGuo Y(2024)Evolve: Enhancing Unsupervised Continual Learning with Multiple Experts2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00236(2355-2366)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00236
Moothedath VChampati JGross J(2024)Getting the Best Out of Both Worlds: Algorithms for Hierarchical Inference at the EdgeIEEE Transactions on Machine Learning in Communications and Networking10.1109/TMLCN.2024.33665012(280-297)Online publication date: 2024
https://doi.org/10.1109/TMLCN.2024.3366501
Li YGuo JLi YWang TJia W(2024)Adversarial Bandits With Multi-User Delayed Feedback: Theory and ApplicationIEEE Transactions on Mobile Computing10.1109/TMC.2024.336223723:10(9383-9397)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TMC.2024.3362237
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations