[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3377812.3390803acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
poster

Restoring reproducibility of Jupyter notebooks

Published: 01 October 2020 Publication History

Abstract

Jupyter notebooks---documents that contain live code, equations, visualizations, and narrative text---now are among the most popular means to compute, present, discuss and disseminate scientific findings. In principle, Jupyter notebooks should easily allow to reproduce and extend scientific computations and their findings; but in practice, this is not the case. The individual code cells in Jupyter notebooks can be executed in any order, with identifier usages preceding their definitions and results preceding their computations. In a sample of 936 published notebooks that would be executable in principle, we found that 73% of them would not be reproducible with straightforward approaches, requiring humans to infer (and often guess) the order in which the authors created the cells.
In this paper, we present an approach to (1) automatically satisfy dependencies between code cells to reconstruct possible execution orders of the cells; and (2) instrument code cells to mitigate the impact of non-reproducible statements (i.e., random functions) in Jupyter notebooks. Our Osiris prototype takes a notebook as input and outputs the possible execution schemes that reproduce the exact notebook results. In our sample, Osiris was able to reconstruct such schemes for 82.23% of all executable notebooks, which has more than three times better than the state-of-the-art; the resulting reordered code is valid program code and thus available for further testing and analysis.

References

[1]
Adam Brinckman, Kyle Chard, Niall Gaffney, Mihael Hategan, Matthew B Jones, Kacper Kowalik, Sivakumar Kulasekaran, Bertram Ludäscher, Bryce D Mecum, Jarek Nabrzyski, et al. 2019. Computing environments for reproducibility: Capturing the "Whole Tale". Future Generation Computer Systems 94 (2019), 854--867.
[2]
Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian E Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Corlay, et al. 2016. Jupyter Notebooks-a publishing format for reproducible computational workflows. In ELPUB. 87--90.
[3]
Jeffrey M. Perkel. 2018. Why Jupyter is data scientists' computational notebook of choice. Nature news 563 (2018), 145--146.
[4]
João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2019. A large-scale study about quality and reproducibility of jupyter notebooks. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE Press, 507--517.
[5]
Diomidis Spinellis. 2003. The decay and failures of web references. Commun. ACM 46, 1 (2003), 71--77.
[6]
Dan Toomey. 2017. Jupyter for data science: Exploratory analysis, statistical modeling, machine learning, and data visualization with Jupyter. Packt Publishing Ltd.
[7]
Jiawei Wang, Li Li, and Andreas Zeller. 2020. Better Code, Better Sharing:On the Need of Analyzing Jupyter Notebooks. In ICSE-NEIR 2020.

Cited By

View all
  • (2025)Detecting and Explaining Python Name ErrorsInformation and Software Technology10.1016/j.infsof.2024.107592178(107592)Online publication date: Feb-2025
  • (2024)Inner External DQN LoRa SF Allocation Scheme for Complex EnvironmentsElectronics10.3390/electronics1314276113:14(2761)Online publication date: 14-Jul-2024
  • (2024)Corpus-based discourse analysis: from meta-reflection to accountabilityCorpus Linguistics and Linguistic Theory10.1515/cllt-2023-010420:3(539-566)Online publication date: 16-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings
June 2020
357 pages
ISBN:9781450371223
DOI:10.1145/3377812
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

In-Cooperation

  • KIISE: Korean Institute of Information Scientists and Engineers
  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2020

Check for updates

Author Tags

  1. Jupyter notebooks
  2. Osiris
  3. Python
  4. reproducibility

Qualifiers

  • Poster

Conference

ICSE '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)4
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Detecting and Explaining Python Name ErrorsInformation and Software Technology10.1016/j.infsof.2024.107592178(107592)Online publication date: Feb-2025
  • (2024)Inner External DQN LoRa SF Allocation Scheme for Complex EnvironmentsElectronics10.3390/electronics1314276113:14(2761)Online publication date: 14-Jul-2024
  • (2024)Corpus-based discourse analysis: from meta-reflection to accountabilityCorpus Linguistics and Linguistic Theory10.1515/cllt-2023-010420:3(539-566)Online publication date: 16-Apr-2024
  • (2024)Reproducibility Debt: Challenges and Future PathwaysCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663778(462-466)Online publication date: 10-Jul-2024
  • (2023)Taming the Diversity of Computational NotebooksProceedings of the 27th ACM International Systems and Software Product Line Conference - Volume A10.1145/3579027.3608974(27-33)Online publication date: 28-Aug-2023
  • (2023)Shifting Left for Early Detection of Machine-Learning BugsFormal Methods10.1007/978-3-031-27481-7_33(584-597)Online publication date: 3-Mar-2023
  • (2022)Error identification strategies for Python Jupyter notebooksProceedings of the 30th IEEE/ACM International Conference on Program Comprehension10.1145/3524610.3529156(253-263)Online publication date: 16-May-2022
  • (2022)Eliciting Best Practices for Collaboration with Computational NotebooksProceedings of the ACM on Human-Computer Interaction10.1145/35129346:CSCW1(1-41)Online publication date: 7-Apr-2022
  • (2022)A static analysis framework for data science notebooksProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice10.1145/3510457.3513032(13-22)Online publication date: 21-May-2022
  • (2022)Provenance-enhanced Root Cause Analysis for Jupyter Notebooks2022 IEEE/ACM 15th International Conference on Utility and Cloud Computing (UCC)10.1109/UCC56403.2022.00058(327-333)Online publication date: Dec-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media