[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3589806.3600042acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesacm-repConference Proceedingsconference-collections
short-paper

A Siren Song of Open Source Reproducibility, Examples from Machine Learning

Published: 28 June 2023 Publication History

Abstract

As reproducibility becomes a greater concern, conferences have largely converged to a strategy of asking reviewers to indicate whether code was attached to a submission. This represents a broader pattern of implementing actions based on presumed ideals, without studying whether those actions will produce positive results. We argue that focusing on code as a means of reproduction is misguided if we want to improve the state of reproducible and replicable research. In this study, we find this focus on code may be harmful — we should not force code to be submitted. Furthermore, there is a lack of evidence that conferences take effective actions to encourage and reward reproducibility. We argue that venues must take more action to advance reproducible machine learning research today.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467v2 (March 2016), 19. http://arxiv.org/abs/1603.04467 arXiv:1603.04467.
[2]
Kwangjun Ahn, Prateek Jain, Ziwei Ji, Satyen Kale, Praneeth Netrapalli, and Gil I. Shamir. 2022. Reproducibility in Optimization: Theoretical Framework and Limits. (2022), 1–51. http://arxiv.org/abs/2202.04598 arXiv:2202.04598.
[3]
Ethem Alpaydin. 1999. Combined 5 × 2 cv F Test for Comparing Supervised Classification Learning Algorithms. Neural Comput. 11, 9 (Nov. 1999), 1885–1892. https://doi.org/10.1162/089976699300016007 Publisher: MIT Press Place: Cambridge, MA, USA.
[4]
Björn Barz and Joachim Denzler. 2020. Do We Train on Test Data? Purging CIFAR of Near-Duplicates. Journal of Imaging 6, 6 (June 2020), 41. https://doi.org/10.3390/jimaging6060041 arXiv:1902.00423.
[5]
Alessio Benavoli, Giorgio Corani, and Francesca Mangili. 2016. Should We Really Use Post-Hoc Tests Based on Mean-Ranks?Journal of Machine Learning Research 17, 5 (2016), 1–10. http://jmlr.org/papers/v17/benavoli16a.html
[6]
Siddharth Bhat. 2019. Everything you know about word2vec is wrong. https://bollu.github.io/everything-you-know-about-word2vec-is-wrong.html
[7]
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, and Pascal Vincent. 2021. Accounting for Variance in Machine Learning Benchmarks. In Machine Learning and Systems (MLSys). http://arxiv.org/abs/2103.03098 arXiv:2103.03098.
[8]
Xavier Bouthillier, César Laurent, and Pascal Vincent. 2019. Unreproducible Research is Reproducible. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97. PMLR, 725–734. http://proceedings.mlr.press/v97/bouthillier19a.html Series Title: Proceedings of Machine Learning Research.
[9]
Andrew P. Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145–1159. https://doi.org/10.1016/S0031-3203(96)00142-2 Publisher: Pergamon.
[10]
Jeffrey C. Carver, Richard P. Kendall, Susan E. Squires, and Douglass E. Post. 2007. Software Development Environments for Scientific and Engineering Software: A Series of Case Studies. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 550–559. https://doi.org/10.1109/ICSE.2007.77
[11]
Gavin C Cawley and Nicola L C Talbot. 2010. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research 11 (Aug. 2010), 2079–2107. http://dl.acm.org/citation.cfm?id=1756006.1859921 Publisher: JMLR.org.
[12]
Jon F. Claerbout and Martin Karrenbach. 1992. Electronic documents give reproducible research a new meaning. In SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists, 601–604. https://doi.org/10.1190/1.1822162
[13]
Janez Demšar. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7 (Dec. 2006), 1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548 Publisher: JMLR.org.
[14]
Pedro Domingos. 2012. A Few Useful Things to Know About Machine Learning. Commun. ACM 55, 10 (Oct. 2012), 78–87. https://doi.org/10.1145/2347736.2347755 Publisher: ACM Place: New York, NY, USA.
[15]
Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1383–1392. https://doi.org/10.18653/v1/P18-1128
[16]
Rotem Dror, Segev Shlomov, and Roi Reichart. 2019. Deep Dominance - How to Properly Compare Deep Neural Models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2773–2785. https://doi.org/10.18653/v1/P19-1266
[17]
Chris Drummond. 2009. Replicability is not reproducibility: nor is it good science. In Proceedings of the Evaluation Methods for Machine Learning Workshop at the 26th ICML, Montreal, Canada,2009. Series Title: Evaluation Methods for Machine Learning Workshop, the 26th ICML, June 14-18, 2009, Montreal, Canada.
[18]
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Jacob Steinhardt, and Aleksander Madry. 2020. Identifying Statistical Bias in Dataset Replication. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. 2922–2932. http://proceedings.mlr.press/v119/engstrom20a.html Series Title: Proceedings of Machine Learning Research.
[19]
Jessica Forde, Tim Head, Chris Holdgraf, Yuvi Panda, Fernando Perez, Gladys Nalvarte, Benjamin Ragan-kelley, and Erik Sundell. 2018. Reproducible Research Environments with repo2docker. In Reproducibility in ML Workshop, ICML’18.
[20]
Jessica Zosa Forde, Matthias Bussonnier, Félix-Antoine Fortin, Brian E Granger, Timothy Daniel Head, Chris Holdgraf, Paul Ivanov, Kyle Kelley, Michael D Pacer, Yuvi Panda, Fernando Pérez, Gladys Nalvarte, Benjamin Ragan-Kelley, Zachary R Sailer, Steven Silvester, Erik Sundell, and Carol Willing. 2018. Reproducing Machine Learning Research on Binder. In Machine Learning Open Source Software 2018: Sustainable communities.
[21]
Josh Gardner, Christopher Brooks, and Ryan S Baker. 2018. Enabling End-To-End Machine Learning Replicability : A Case Study in Educational Data Mining. In Reproducibility in ML Workshop, ICML’18.
[22]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92. https://doi.org/10.1145/3458723 arXiv:1803.09010.
[23]
R Stuart Geiger, Dominique Cope, Jamie Ip, Marsha Lotosh, Aayush Shah, Jenny Weng, and Rebekah Tang. 2021. “Garbage in, garbage out” revisited: What do machine learning application papers report about human-labeled training data?Quantitative Science Studies 2, 3 (Nov. 2021), 795–827. https://doi.org/10.1162/qss_a_00144
[24]
Yoav Goldberg and Omer Levy. 2014. word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method. arXiv preprint arXiv:1402.3722 (2014). http://arxiv.org/abs/1402.3722 arXiv:1402.3722v1.
[25]
Jesús M González-Barahona and Gregorio Robles. 2012. On the Reproducibility of Empirical Software Engineering Studies Based on Data Retrieved from Development Repositories. Empirical Softw. Engg. 17, 1–2 (Feb. 2012), 75–89. https://doi.org/10.1007/s10664-011-9181-9 Publisher: Kluwer Academic Publishers Place: USA.
[26]
Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The unreasonable effectiveness of data. Intelligent Systems, IEEE 24, 2 (2009), 8–12. Publisher: IEEE.
[27]
Charles R Harris, K Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E Oliphant. 2020. Array programming with NumPy. Nature 585, 7825 (2020), 357–362. https://doi.org/10.1038/s41586-020-2649-2
[28]
L. Hatton. 1993. The quality and reliability of scientific software. Transactions on Information and Communications Technologies 4 (1993).
[29]
L. Hatton. 1997. The T experiments: errors in scientific software. IEEE Computational Science and Engineering 4, 2 (1997), 27–38. https://doi.org/10.1109/99.609829
[30]
L. Hatton and A. Roberts. 1994. How accurate is scientific software?IEEE Transactions on Software Engineering 20, 10 (1994), 785–797. https://doi.org/10.1109/32.328993
[31]
Anna A Ivanova, Shashank Srikant, Yotaro Sueoka, Hope H Kean, Riva Dhamala, Una-May O’Reilly, Marina U Bers, and Evelina Fedorenko. 2020. Comprehension of computer code relies primarily on domain-general executive brain regions. eLife 9 (2020), e58906. https://doi.org/10.7554/eLife.58906 Publisher: eLife Sciences Publications, Ltd.
[32]
Diederik P Kingma and Jimmy Lei Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference On Learning Representations. arXiv:1412.6980v2.
[33]
Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, Carol Willing, and Jupyter development team. 2016. Jupyter Notebooks - a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas, Fernando Loizides and Birgit Scmidt (Eds.). IOS Press, 87–90. https://eprints.soton.ac.uk/403913/
[34]
Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on Machine Learning, Eric P Xing and Tony Jebara (Eds.). Vol. 32. PMLR, Bejing, China, 1188–1196. https://proceedings.mlr.press/v32/le14.html Series Title: Proceedings of Machine Learning Research Issue: 2.
[35]
Stanley Letovsky. 1987. Cognitive processes in program comprehension. Journal of Systems and Software 7, 4 (1987), 325–339. https://doi.org/10.1016/0164-1212(87)90032-X
[36]
Zachary C. Lipton and Jacob Steinhardt. 2019. Troubling trends in machine-learning scholarship. Queue 17, 1 (2019), 1–33. https://doi.org/10.1145/3317287.3328534 arXiv:1807.03341.
[37]
Dong C Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45, 1 (1989), 503–528. https://doi.org/10.1007/BF01589116
[38]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR). https://github.com/loshchil/AdamW-and-SGDW
[39]
Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013) (2013), 1–12. https://doi.org/10.1162/153244303322533223 arXiv:1301.3781v3 ISBN: 1532-4435.
[40]
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA, 220–229. https://doi.org/10.1145/3287560.3287596 Series Title: FAT* ’19.
[41]
Andre T. Nguyen, Edward Raff, Charles Nicholas, and James Holt. 2021. Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints. In IJCAI-21 1st International Workshop on Adaptive Cyber Defense. http://arxiv.org/abs/2108.04081 arXiv:2108.04081.
[42]
Curtis G. Northcutt, Anish Athalye, and Jonas Mueller. 2021. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. http://arxiv.org/abs/2103.14749 arXiv:2103.14749.
[43]
E. Nurvitadhi, Wing Wah Leung, and C. Cook. 2003. Do class comments aid java program understanding?. In 33rd Annual Frontiers in Education, 2003. FIE 2003., Vol. 1. IEEE. https://doi.org/10.1109/FIE.2003.1263332
[44]
Michela Paganini and Jessica Zosa Forde. 2020. dagger: A Python Framework for Reproducible Machine Learning Experiment Orchestration. arXiv (2020). http://arxiv.org/abs/2006.07484 arXiv:2006.07484.
[45]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H Wallach, H Larochelle, A Beygelzimer, F d\textquotesingle Alché-Buc, E Fox, and R Garnett (Eds.). Curran Associates, Inc., 8024–8035.
[46]
D. N. Perkins and Rebecca Simmons. 1988. Patterns of Misunderstanding: An Integrative Model for Science, Math, and Programming. Review of Educational Research 58, 3 (Sept. 1988), 303–326. https://doi.org/10.3102/00346543058003303
[47]
Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. 2020. Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. Association for Computing Machinery, New York, NY, USA, 771–783. https://doi.org/10.1145/3324884.3416545 Series Title: ASE ’20.
[48]
Hans E Plesser. 2018. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Frontiers in neuroinformatics 11 (Jan. 2018), 76. https://doi.org/10.3389/fninf.2017.00076 Publisher: Frontiers Media S.A.
[49]
Russell A. Poldrack. 2019. The Costs of Reproducibility. Neuron 101, 1 (Jan. 2019), 11–14. https://doi.org/10.1016/j.neuron.2018.11.030
[50]
Edward Raff. 2019. A Step Toward Quantifying Independently Reproducible Machine Learning Research. In NeurIPS. http://arxiv.org/abs/1909.06674 arXiv:1909.06674.
[51]
Edward Raff. 2021. Research Reproducibility as a Survival Analysis. In The Thirty-Fifth AAAI Conference on Artificial Intelligence. http://arxiv.org/abs/2012.09932 arXiv:2012.09932.
[52]
Edward Raff. 2022. Does the Market of Citations Reward Reproducible Work?. In ML Evaluation Standards Workshop at ICLR 2022. https://doi.org/10.48550/arXiv.2204.03829
[53]
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2018. Do CIFAR-10 Classifiers Generalize to CIFAR-10?arXiv (2018), 1–25. http://arxiv.org/abs/1806.00451 arXiv:1806.00451.
[54]
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet Classifiers Generalize to ImageNet?arXiv (Feb. 2019). http://arxiv.org/abs/1902.10811 arXiv:1902.10811.
[55]
Carl Taswell. 1998. Reproducibility Standards for Wavelet Transform Algorithms. Technical Report. UCSD School of Medicine, La Jolla, CA. 1–26 pages.
[56]
David Tran, Alex Valtchanov, Keshav Ganapathy, Raymond Feng, Eric Slud, Micah Goldblum, and Tom Goldstein. 2020. An Open Review of OpenReview: A Critical Analysis of the Machine Learning Conference Review Process. arXiv (2020). http://arxiv.org/abs/2010.05137 arXiv:2010.05137.
[57]
Wei Yuan and Kai-xin Gao. [n. d.]. EAdam Optimizer: How $\epsilon$ Impact Adam. arXiv ([n. d.]). arXiv:2011.02150v1.
[58]
Matei A Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, and Corey Zumar. 2018. Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull. 41 (2018), 39–45.
[59]
Donglin Zhuang, Xingyao Zhang, Shuaiwen Leon Song, and Sara Hooker. 2021. Randomness In Neural Network Training: Characterizing The Impact of Tooling. arXiv (2021). http://arxiv.org/abs/2106.11872 arXiv:2106.11872.
[60]
Ozan İrsoy, Adrian Benton, and Karl Stratos. 2021. Corrected CBOW Performs as well as Skip-gram. In Proceedings of the Second Workshop on Insights from Negative Results in NLP. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 1–8. https://doi.org/10.18653/v1/2021.insights-1.1

Cited By

View all
  • (2024)Common Flaws in Running Human Evaluation Experiments in NLPComputational Linguistics10.1162/coli_a_0050850:2(795-805)Online publication date: 1-Jun-2024
  • (2024)LogFlux: A Software Suite for Replicating Results in Automated Log ParsingProceedings of the 2nd ACM Conference on Reproducibility and Replicability10.1145/3641525.3663625(64-74)Online publication date: 18-Jun-2024
  • (2024)Navigating the Landscape of Reproducible Research: A Predictive Modeling ApproachProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679831(24-33)Online publication date: 21-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ACM REP '23: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability
June 2023
127 pages
ISBN:9798400701764
DOI:10.1145/3589806
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2023

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

ACM REP '23
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)57
  • Downloads (Last 6 weeks)6
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Common Flaws in Running Human Evaluation Experiments in NLPComputational Linguistics10.1162/coli_a_0050850:2(795-805)Online publication date: 1-Jun-2024
  • (2024)LogFlux: A Software Suite for Replicating Results in Automated Log ParsingProceedings of the 2nd ACM Conference on Reproducibility and Replicability10.1145/3641525.3663625(64-74)Online publication date: 18-Jun-2024
  • (2024)Navigating the Landscape of Reproducible Research: A Predictive Modeling ApproachProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679831(24-33)Online publication date: 21-Oct-2024
  • (2024)Replicability of simulation studies for the investigation of statistical methods: the RepliSims projectRoyal Society Open Science10.1098/rsos.23100311:1Online publication date: 17-Jan-2024
  • (2024)Does Starting Deep Learning Homework Earlier Improve Grades?Artificial Intelligence. ECAI 2023 International Workshops10.1007/978-3-031-50485-3_38(381-396)Online publication date: 25-Jan-2024
  • (2023)Reproducibility in multiple instance learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666718(13530-13544)Online publication date: 10-Dec-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media