More Web Proxy on the site http://driver.im/

research-article

Three key checklists and remedies for trustworthy analysis of online controlled experiments at scale

Authors:

Aleksander Fabijan,

Pavel Dmitriev,

Helena Holmström Olsson,

Dylan LewisAuthors Info & Claims

ICSE-SEIP '19: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice

Pages 1 - 10

https://doi.org/10.1109/ICSE-SEIP.2019.00009

Published: 27 May 2019 Publication History

Abstract

Online Controlled Experiments (OCEs) are transforming the decision-making process of data-driven companies into an experimental laboratory. Despite their great power in identifying what customers actually value, experimentation is very sensitive to data loss, skipped checks, wrong designs, and many other 'hiccups' in the analysis process. For this purpose, experiment analysis has traditionally been done by experienced data analysts and scientists that closely monitored experiments throughout their lifecycle. Depending solely on scarce experts, however, is neither scalable nor bulletproof. To democratize experimentation, analysis should be streamlined and meticulously performed by engineers, managers, or others responsible for the development of a product. In this paper, based on synthesized experience of companies that run thousands of OCEs per year, we examined how experts inspect online experiments. We reveal that most of the experiment analysis happens before OCEs are even started, and we summarize the key analysis steps in three checklists. The value of the checklists is threefold. First, they can increase the accuracy of experiment se-tup and decision-making process. Second, checklists can enable novice data scientists and software engineers to become more autonomous in setting-up and analyzing experiments. Finally, they can serve as a base to develop trustworthy platforms and tools for OCE set-up and analysis.

References

[1]

A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, "Online Controlled Experimentation at Scale: An Empirical Survey on the Current State of A/B Testing," in Proceedings of the 2018 44rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 2018.

[2]

E. Lindgren and J. Münch, "Software development as an experiment system: A qualitative survey on the state of the practice," in Lecture Notes in Business Information Processing, 2015, vol. 212, pp. 117--128.

[3]

F. Auer and M. Felderer, "Current State of Continuous Experimentation: A Systematic Mapping Study," in Proceedings of the 2018 44rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 2018.

[4]

R. Kohavi and S. Thomke, "The Surprising Power of Online Experiments," Harvard Business Review, no. October, 2017.

[5]

A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, "The Benefits of Controlled Experimentation at Scale," in Proceedings of the 2017 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 2017, pp. 18--26.

[6]

A. Deng, P. Zhang, S. Chen, D. W. Kim, and J. Lu, "Concise Summarization of Heterogeneous Treatment Effect Using Total Variation Regularized Regression," Submiss., Oct. 2016.

[7]

P. Dmitriev, S. Gupta, K. Dong Woo, and G. Vaz, "A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments," in Proceedings of the 23rd ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '17, 2017.

Digital Library

[8]

A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, "Effective Online Experiment Analysis at Large Scale," in Proceedings of the 2018 44rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 2018.

[9]

R. S. Kaplan and D. P. Norton, "The Balanced Scorecard: Translating Strategy Into Action," Harvard Business School Press. pp. 1--311, 1996.

[10]

S. Gupta, L. Ulanova, S. Bhardwaj, P. Dmitriev, P. Raff, and A. Fabijan, "The Anatomy of a Large-Scale Experimentation Platform," in 2018 IEEE International Conference on Software Architecture (ICSA), 2018, no. May, pp. 1--109.

[11]

T. Kluck and L. Vermeer, "Leaky Abstraction In Online Experimentation Platforms: A Conceptual Framework To Categorize Common Challenges," Oct. 2017.

[12]

D. I. Mattos, J. Bosch, and H. H. Olsson, "Challenges and Strategies for Undertaking Continuous Experimentation to Embedded Systems: Industry and Research Perspectives," in 19th International Conference on Agile Software Development, XP'18, 2018, no. March, pp. 1--15.

[13]

M. Kim, T. Zimmermann, R. DeLine, and A. Begel, "The emerging role of data scientists on software development teams," in Proceedings of the 38th International Conference on Software Engineering - ICSE '16, 2016, no. MSR-TR-2015-30, pp. 96--107.

Digital Library

[14]

S. Miller and D. Hughes, "The Quant Crunch: How the Demand For Data Science Skills is Disrupting the Job Market," Burning Glass Technologies, 2017.

[15]

G. Schermann, J. J. Cito, and P. Leitner, "Continuous Experimentation: Challenges, Implementation Techniques, and Current Research," IEEE Softw., vol. 35, no. 2, pp. 26--31, Mar. 2018.

[16]

E. Ries, The Lean Startup: How Today's Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. 2011.

[17]

S. Blank, "Why the lean start-up changes everything," Harvard Business Review, vol. 91, no. 5. John Wiley & Sons, p. 288, 2013.

[18]

J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. 2010.

Digital Library

[19]

D. G. Feitelson, E. Frachtenberg, and K. L. Beck, "Development and deployment at facebook," IEEE Internet Comput., vol. 17, no. 4, pp. 8--17, 2013.

Digital Library

[20]

J. F. Box, "R.A. Fisher and the Design of Experiments, 1922--1926," Am. Stat., vol. 34, no. 1, pp. 1--7, Feb. 1980.

[21]

S. D. Simon, "Is the randomized clinical trial the gold standard of research?," J. Androl., vol. 22, no. 6, pp. 938--943, Nov. 2001.

[22]

A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, "The Experiment Lifecycle," Accept. to Appear IEEE Softw., 2018.

[23]

A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, "The Evolution of Continuous Experimentation in Software Product Development," in Proceedings of the 39th International Conference on Software Engineering ICSE'17, 2017.

[24]

K. Kevic, B. Murphy, L. Williams, and J. Beckmann, "Characterizing Experimentation in Continuous Deployment: A Case Study on Bing," in 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), 2017, pp. 123--132.

Digital Library

[25]

F. Fagerholm, A. S. Guinea, H. Mäenpää, and J. Münch, "The RIGHT model for Continuous Experimentation," J. Syst. Softw., vol. 0, pp. 1--14, 2015.

[26]

R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu, "Trustworthy online controlled experiments," in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '12, 2012, p. 786.

Digital Library

[27]

R. L. Kaufman, J. Pitchforth, and L. Vermeer, "Democratizing online controlled experiments at Booking.com," arXiv Prepr. arXiv1710.08217, pp. 1--7, 2017.

[28]

Y. Xu, N. Chen, A. Fernandez, O. Sinno, and A. Bhasin, "From Infrastructure to Culture," in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '15, 2015, pp. 2227--2236.

Digital Library

[29]

D. Tang, A. Agarwal, D. O. Brien, M. Meyer, D. O'Brien, and M. Meyer, "Overlapping experiment infrastructure," in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '10, 2010, p. 17.

Digital Library

[30]

A. Fabijan, P. Dmitriev, C. McFarland, L. Vermeer, H. Holmström Olsson, and J. Bosch, "Experimentation growth: Evolving trustworthy A/B testing capabilities in online software companies," J. Softw. Evol. Process, p. e2113, Nov. 2018.

[31]

R. Power and B. Williams, "Checklists for improving rigour in qualitative research," BMJ, vol. 323, no. 7311, pp. 514--514, Sep. 2001.

[32]

D. Moher, A. R. Jadad, G. Nichol, M. Penman, P. Tugwell, and S. Walsh, "Assessing the quality of randomized controlled trials: An annotated bibliography of scales and checklists," Control. Clin. Trials, vol. 16, no. 1, pp. 62--73, 1995.

[33]

A. Gawande, Checklist manifesto, the (HB). Penguin Books India, 2010.

[34]

P. Runeson and M. Höst, "Guidelines for conducting and reporting case study research in software engineering," Empir. Softw. Eng., vol. 14, no. 2, pp. 131--164, 2008.

Digital Library

[35]

A. Fabijan and P. Dmitriev, "Experiment Analysis Questionaire," 2018. {Online}. Available: http://www.fabijan.info/papers/ICSE_ExP_Analysis_Questionnaire.pdf.

[36]

"Hypothesis Kit for A/B testing." {Online}. Available: http://www.experimentationhub.com/hypothesis-kit.html.

[37]

R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne, "Controlled experiments on the web: survey and practical guide," Data Min. Knowl. Discov., vol. 18, no. 1, pp. 140--181, Feb. 2009.

Digital Library

[38]

A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, "The Evolution of Continuous Experimentation in Software Product Development: From Data to a Data-Driven Organization at Scale," in 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), 2017, pp. 770--780.

[39]

K. Rodden, H. Hutchinson, and X. Fu, "Measuring the User Experience on a Large Scale: User-Centered Metrics for Web Applications," Proc. SIGCHI Conf. Hum. Factors Comput. Syst., pp. 2395--2398, 2010.

Digital Library

[40]

D. Yuan, S. Park, and Y. Zhou, "Characterizing logging practices in open-source software," in Proceedings - International Conference on Software Engineering, 2012, pp. 102--112.

Digital Library

[41]

T. Barik, R. DeLine, S. Drucker, and D. Fisher, "The bones of the system," in Proceedings of the 38th International Conference on Software Engineering Companion - ICSE '16, 2016, pp. 92--101.

Digital Library

[42]

D. W. Hubbard, How to measure anything: Finding the value of intangibles in business. John Wiley & Sons, 2014.

[43]

R. B. Bausell and Y.-F. Li, Power analysis for experimental research: a practical guide for the biological, medical and social sciences. Cambridge University Press, 2002.

[44]

J. Gupchup et al., "Trustworthy Experimentation Under Telemetry Loss," in to appear in: Proceedings of the 27th ACM International on Conference on Information and Knowledge Management - CIKM '18, 2018.

Digital Library

[45]

R. Kohavi, B. Frasca, T. Crook, R. Henne, and R. Longbotham, "Online experimentation at Microsoft," in Workshop on Data Mining Case Studies and Practice, 2009.

[46]

T. Crook, B. Frasca, R. Kohavi, and R. Longbotham, "Seven pitfalls to avoid when running controlled experiments on the web," in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '09, 2009, p. 1105.

[47]

A. Deng, J. Lu, and S. Chen, "Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing," in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016, pp. 243--252.

[48]

J. M. Hoenig and D. M. Heisey, "The abuse of power: The pervasive fallacy of power calculations for data analysis," Am. Stat., vol. 55, no. 1, pp. 19--24, 2001.

[49]

R. Kohavi, A. Deng, R. Longbotham, and Y. Xu, "Seven rules of thumb for web site experimenters," in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14, 2014, pp. 1857--1866.

Digital Library

[50]

K. V. Desouza, "Intrapreneurship - Managing ideas within your organization," Technol. Forecast. Soc. Change, vol. 91, pp. 352--353, 2014.

[51]

P. Dmitriev and X. Wu, "Measuring Metrics," in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM '16, 2016, pp. 429--437.

Digital Library

[52]

A. Deng, J. Lu, and J. Litz, "Trustworthy Analysis of Online A/B Tests," Proc. Tenth ACM Int. Conf. Web Search Data Min. - WSDM '17, pp. 641--649, 2017.

Digital Library

Cited By

Vermeer LAnderson KAcebal M(2022)Automated Sample Ratio Mismatch (SRM) detection and analysisProceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering10.1145/3530019.3534982(268-269)Online publication date: 13-Jun-2022
https://dl.acm.org/doi/10.1145/3530019.3534982
Auer FFelderer MLanubile F(2021)Important Experimentation CharacteristicsProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3484186(1-6)Online publication date: 11-Oct-2021
https://dl.acm.org/doi/10.1145/3475716.3484186
Diamantopoulos NWong JMattos DGerostathopoulos IWardrop MMao TMcFarland CRothermel GBae D(2020)Engineering for a science-centric experimentation platformProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice10.1145/3377813.3381349(191-200)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3377813.3381349
Show More Cited By

Recommendations

Online controlled experiments at large scale
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Web-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At ...
Trustworthy online controlled experiments: five puzzling outcomes explained
KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald ...
Online controlled experiments: introduction, learnings, and humbling statistics
RecSys '12: Proceedings of the sixth ACM conference on Recommender systems

The web provides an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments (e.g., A/B tests and their generalizations). Whether for front-end user-interface changes, or backend ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE-SEIP '19: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice

May 2019

339 pages

Conference Chairs:
Helen Sharp
The Open University, UK
,
Mike Whalen
Amazon Inc.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering
IEEE-CS: Computer Society

Publisher

IEEE Press

Publication History

Published: 27 May 2019

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICSE '19

Sponsor:

SIGSOFT
IEEE-CS

ICSE '19: 41st International Conference on Software Engineering

May 27, 2019

Quebec, Montreal, Canada

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
172
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Vermeer LAnderson KAcebal M(2022)Automated Sample Ratio Mismatch (SRM) detection and analysisProceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering10.1145/3530019.3534982(268-269)Online publication date: 13-Jun-2022
https://dl.acm.org/doi/10.1145/3530019.3534982
Auer FFelderer MLanubile F(2021)Important Experimentation CharacteristicsProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1145/3475716.3484186(1-6)Online publication date: 11-Oct-2021
https://dl.acm.org/doi/10.1145/3475716.3484186
Diamantopoulos NWong JMattos DGerostathopoulos IWardrop MMao TMcFarland CRothermel GBae D(2020)Engineering for a science-centric experimentation platformProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice10.1145/3377813.3381349(191-200)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3377813.3381349
Shi XDmitriev PGupta SFu XTeredesai AKumar VLi YRosales RTerzi EKarypis G(2019)Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled ExperimentsProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3292500.3332297(3189-3190)Online publication date: 25-Jul-2019
https://dl.acm.org/doi/10.1145/3292500.3332297

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents