[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3097983.3098024acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments

Published: 13 August 2017 Publication History

Abstract

Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested in web sites, mobile applications, desktop applications, services, and operating systems. One of the key challenges for organizations that run controlled experiments is to come up with the right set of metrics [1] [2] [3]. Having good metrics, however, is not enough.
In our experience of running thousands of experiments with many teams across Microsoft, we observed again and again how incorrect interpretations of metric movements may lead to wrong conclusions about the experiment's outcome, which if deployed could hurt the business by millions of dollars. Inspired by Steven Goodman's twelve p-value misconceptions [4], in this paper, we share twelve common metric interpretation pitfalls which we observed repeatedly in our experiments. We illustrate each pitfall with a puzzling example from a real experiment, and describe processes, metric design principles, and guidelines that can be used to detect and avoid the pitfall.
With this paper, we aim to increase the experimenters' awareness of metric interpretation issues, leading to improved quality and trustworthiness of experiment results and better data-driven decisions.

Supplementary Material

MP4 File (gupta_dirty_dozen.mp4)

References

[1]
A. Deng and S. Xiaolin, "Data-driven metric development for online controlled experiments: Seven lessons learned," in KDD, 2016.
[2]
W. Machmouchi and G. Buscher, "Principles for the Design of Online A/B Metrics," in Proceedings of the 39th International ACM SIGIR, 2016.
[3]
P. Dmitriev and W. Xian, "Measuring Metrics," 2016, Proceedings of the 25th ACM International on Conference on Information and Knowledge Management.
[4]
S. Goodman, "A Dirty Dozen: Twelve P-Value Misconceptions," in Seminars in Hematology, 2008.
[5]
R. Kohavi and R. Longbotham, "Online Controlled Experiments and A/B Tests.," in Encyclopedia of Machine Learning and Data Mining, 2017.
[6]
"Microsoft Experimentation Platform," [Online]. Available: http://www.exp-platform.com.
[7]
R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu and N. Pohlmann, "Online Controlled Experiments at Large Scale," in KDD, 2013.
[8]
R. Kohavi, A. Deng, R. Longbotham and Y. Xu, "Seven Rules of Thumb for Web Site Experimenters," in KDD, 2014.
[9]
R. Kohavi, R. Longbotham, D. Sommerfield and R. M. Henne, "Controlled experiments on the web: survey and practical guide," Data Mining and Knowledge Discovery, vol. 18, no. 1, pp. 140--181, February 2009.
[10]
A. Deng, Y. Xu, R. Kohavi and T. Walker, "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data," in Sixth ACM WSDM, Rome, Italy, 2013.
[11]
P. Dmitriev, B. Frasca, S. Gupta, R. Kohavi and G. Vaz, "Pitfalls of Long-Term Online Controlled Experiments," in IEEE International Conference on Big Data, 2016.
[12]
H. Hohnhold, D. O'Brien and D. Tang, "Focusing on the Long-term: It's Good for Users and Business," in KDD, 2015.
[13]
V. F. Ridgway, "Dysfunctional Consequences of Performance Measurements," Administrative Science Quarterly, 1956.
[14]
R. W. Schmenner and T. E. Vollmann, "Performance Measures: Gaps, False Alarms and the "Usual Suspects", International Journal of Operations & Production Management, 1994.
[15]
R. S. Kaplan and D. Norton, "The Balanced Scorecard - Measures that Drive Performance," Harvard Business Review, 1992.
[16]
J. R. Hauser and G. M. Katz, "Metrics: you are what you measure!," European Management Journal, 1998.
[17]
A. Deng, J. Lu and S. Chen, "Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing," in DSAA, 2016.
[18]
A. Deng, "Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments," in Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion), 2015.
[19]
R. Johari, L. Pekelis and D. J. Walsh, "Always valid inference: Bringing sequential analysis to A/B testing.," In submission. Preprint available at arxiv.org/pdf/1512.04922, 2015.
[20]
R. Kohavi, "Lessons from running thousands of A/B tests," 2014. [Online]. Available: http://bit.ly/expLesssonsCode.
[21]
Z. Zhao, M. Chen, D. Matheson and M. Stone, "Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation," in DSAA, 2016.
[22]
R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker and Y. Xu, "Trustworthy online controlled experiments: Five puzzling outcomes explained," in KDD, 2012.
[23]
Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Click-through_rate.
[24]
"Clickthrough rate (CTR): Definition," AdWords Google, 2016. [Online]. Available: https://support.google.com/adwords/answer/2615875.
[25]
R. Kohavi, Messner, S. Eliot, J. L. Ferres, R. Henne, V. Kannappan and J. Wang, "Tracking Users' Clicks and Submits: Tradeoffs between User Experience and Data Loss," October 2010. [Online]. Available: http://bit.ly/expTrackingClicks.
[26]
J. P. Ioannidis, "Why most discovered true associations are inflated," Epidemiology, vol. 19, no. 5, pp. 640--648, 2008.
[27]
R. H. Thaler, "Anomalies: The winner's curse," The Journal of Economic Perspectives, vol. 2, no. 1, pp. 191--202, 1988.
[28]
K. Peasron, "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling," Philosophical Magazine, vol. 50, no. 5, p. 157--175, 1900.
[29]
R. Fischer, Statistical Methods for Research Workers, Edinburgh: Oliver & Boyd, 1925.
[30]
R. L. W. a. N. A. Lazar, "The ASA's Statement on p-Values: Context, Process, and Purpose," The American Statistician, vol. 70, no. 2, pp. 129--133, 2016.
[31]
"Fisher's Method," [Online]. Available: https://en.wikipedia.org/wiki/Fisher%27s_method .
[32]
R. Kohavi, "Online Controlled Experiments: Lessons from Running A/B/n Tests for 12 years," 2015. [Online]. Available: http://bit.ly/KDD2015Kohavi.
[33]
R. Johari, P. Leo and J. W. David, "Always valid inference: Bringing sequential analysis to A/B testing," 2015. [Online]. Available: https://arxiv.org/abs/1512.04922.
[34]
A. Deng, J. Lu and S. Chen, "Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing," in DSAA, 2016.
[35]
A. Deng, P. Zhang, S. Chen, D. Kim and J. Lu, "Concise Summarization of Heterogeneous Treatment Effect Using Total Variation Regularized Regression," in In submission.
[36]
"Simpson's paradox," [Online]. Available: https://en.wikipedia.org/wiki/Simpson%27s_paradox.
[37]
"Multiple Comparisons problem," [Online]. Available: https://en.wikipedia.org/wiki/Multiple_comparisons_problem.
[38]
"Bonferroni correction," [Online]. Available: https://en.wikipedia.org/wiki/Bonferroni_correction].
[39]
"Mobile Patterns," [Online]. Available: https://mobilepatterns.wikispaces.com/Coach+Marks.

Cited By

View all
  • (2024)Powerful A/B-Testing Metrics and Where to Find ThemProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688036(816-818)Online publication date: 8-Oct-2024
  • (2024)Learning Metrics that Maximise Power for Accelerated A/B-TestsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671512(5183-5193)Online publication date: 25-Aug-2024
  • (2024)Cost-Effective A/B Testing: Leveraging Go and Python for Efficient Experimentation in Hermes Testing Platform2024 10th International Conference on Communication and Signal Processing (ICCSP)10.1109/ICCSP60870.2024.10543437(1048-1050)Online publication date: 12-Apr-2024
  • Show More Cited By

Index Terms

  1. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
        August 2017
        2240 pages
        ISBN:9781450348874
        DOI:10.1145/3097983
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 August 2017

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. a/b testing
        2. controlled experiments
        3. metrics
        4. online experiments

        Qualifiers

        • Research-article

        Conference

        KDD '17
        Sponsor:

        Acceptance Rates

        KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
        Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)98
        • Downloads (Last 6 weeks)7
        Reflects downloads up to 01 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Powerful A/B-Testing Metrics and Where to Find ThemProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688036(816-818)Online publication date: 8-Oct-2024
        • (2024)Learning Metrics that Maximise Power for Accelerated A/B-TestsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671512(5183-5193)Online publication date: 25-Aug-2024
        • (2024)Cost-Effective A/B Testing: Leveraging Go and Python for Efficient Experimentation in Hermes Testing Platform2024 10th International Conference on Communication and Signal Processing (ICCSP)10.1109/ICCSP60870.2024.10543437(1048-1050)Online publication date: 12-Apr-2024
        • (2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
        • (2024)Fairness issues, current approaches, and challenges in machine learning modelsInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-02083-215:8(3095-3125)Online publication date: 31-Jan-2024
        • (2023)Clustering-Based Imputation for Dropout Buyers in Large-Scale Online ExperimentationThe New England Journal of Statistics in Data Science10.51387/23-NEJSDS33(415-425)Online publication date: 24-May-2023
        • (2023)On the Understanding of the Role of Continuous Experimentation in Technology-Based StartupProceedings of the XXXVII Brazilian Symposium on Software Engineering10.1145/3613372.3613414(21-30)Online publication date: 25-Sep-2023
        • (2023)The Price is Right: Removing A/B Test Bias in a Marketplace of Expirable GoodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615502(4681-4687)Online publication date: 21-Oct-2023
        • (2023)A/B Integrations: 7 Lessons Learned from Enabling A/B Testing as a Product FeatureProceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice10.1109/ICSE-SEIP58684.2023.00033(304-314)Online publication date: 17-May-2023
        • (2023)Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing MethodologyThe American Statistician10.1080/00031305.2023.225723778:2(135-149)Online publication date: 18-Oct-2023
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media