[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Multiverse Notebook: Shifting Data Scientists to Time Travelers

Published: 29 April 2024 Publication History

Abstract

Computational notebook environments are popular and de facto standard tools for programming in data science, whereas computational notebooks are notorious in software engineering. The criticism there stems from the characteristic of facilitating unrestricted dynamic patching of running programs, which makes exploratory coding quick but the resultant code messy and inconsistent. In this work, we first reveal that dynamic patching is a natural demand rather than a mere bad practice in data science programming on Kaggle. We then develop Multiverse Notebook, a computational notebook engine for time-traveling exploration. It enables users to time-travel to any past state and restart with new code from there under state isolation. We present an approach to efficiently implementing time-traveling exploration. We empirically evaluate Multiverse Notebook on ten real-world tasks from Kaggle. Our experiments show that time-traveling exploration on Multiverse Notebook is reasonably efficient.

References

[1]
Earl T. Barr and Mark Marron. 2014. Tardis: Affordable Time-Travel Debugging in Managed Runtimes. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’14). ACM, 67–82. https://doi.org/10.1145/2660193.2660209
[2]
Andrew Baumann, Jonathan Appavoo, Orran Krieger, and Timothy Roscoe. 2019. A Fork() in the Road. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS ’19). ACM, 14–22. https://doi.org/10.1145/3317550.3321435
[3]
Stephen M. Blackburn, Matthew Hertz, Kathryn S. Mckinley, J. Eliot B. Moss, and Ting Yang. 2007. Profile-Based Pretenuring. ACM Trans. Program. Lang. Syst., 29, 1 (2007), 1–57. https://doi.org/10.1145/1180475.1180477
[4]
Bob Boothe. 2000. Efficient Algorithms for Bidirectional Debugging. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). ACM, 299–310. https://doi.org/10.1145/349299.349339
[5]
Rodrigo Bruno, Luís Picciochi Oliveira, and Paulo Ferreira. 2017. NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management (ISMM ’17). ACM, 2–13. https://doi.org/10.1145/3092255.3092272
[6]
Souti Chattopadhyay, Ishita Prasad, Austin Z. Henley, Anita Sarma, and Titus Barik. 2020. What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). ACM, 1–12. https://doi.org/10.1145/3313831.3376729
[7]
Perry Cheng, Robert Harper, and Peter Lee. 1998. Generational Stack Collection and Profile-Driven Pretenuring. In roceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI ’98). ACM, 162–173. https://doi.org/10.1145/277650.277718
[8]
Mauríci Cordeiro. 2021. Why Data Scientists Should use Jupyter Notebooks with Moderation. https://towardsdatascience.com/why-data-scientists-should-use-jupyter-notebooks-with-moderation-808900a69eff
[9]
Taijara Loiola de Santana, Paulo Anselmo da Mota Silveira Neto, Eduardo Santana de Almeida, and Iftekhar Ahmed. 2024. Bug Analysis in Jupyter Notebook Projects: An Empirical Study. ACM Trans. Softw. Eng. Methodol., https://doi.org/10.1145/3641539 Just Accepted
[10]
Jake Edge. 2013. The failure of pysandbox. https://lwn.net/Articles/574215/
[11]
Eddie Elizondo. 2023. PEP683: Immortal Objects: Updates. https://discuss.python.org/t/pep683-immortal-objects-updates/23382
[12]
Stuart I. Feldman and Channing B. Brown. 1988. IGOR: A System for Program Debugging via Reversible Execution. In Proceedings of the 1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging (PADD ’88). ACM, 112–123. https://doi.org/10.1145/68210.69226
[13]
Laura Fink. 2021. Signal, where are you? https://www.kaggle.com/code/allunia/signal-where-are-you
[14]
Danyel Fisher, Badrish Chandramouli, Robert DeLIne, Jonathan Goldstein, Andrei Aron, Mike Barnett, John Platt, James Terwilliger, and John Wernsing. 2014. Tempe: An Interactive Data Science Environment for Exploration of Temporal and Streaming Data. Microsoft Research.
[15]
Joel Grus. 2018. I Don’t Like Notebooks. JupyterCon 2018. https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit?usp=sharing
[16]
Alena Guzharina. 2021. Revamped Reactive Mode and How It Makes Your Notebooks Reproducible. https://blog.jetbrains.com/datalore/2021/10/11/revamped-reactive-mode-and-how-it-makes-your-notebooks-reproducible/
[17]
Christopher M. Hayden, Karla Saur, Edward K. Smith, Michael Hicks, and Jeffrey S. Foster. 2014. Kitsune: Efficient, General-Purpose Dynamic Software Updating for C. ACM Trans. Program. Lang. Syst., 36, 4 (2014), 13:1–13:38. https://doi.org/10.1145/2629460
[18]
Christopher M. Hayden, Edward K. Smith, Michail Denchev, Michael Hicks, and Jeffrey S. Foster. 2012. Kitsune: Efficient, General-Purpose Dynamic Software Updating for C. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’12). ACM, 249–264. https://doi.org/10.1145/2384616.2384635
[19]
Andrew Head, Fred Hohman, Titus Barik, Steven M. Drucker, and Robert DeLine. 2019. Managing Messes in Computational Notebooks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). ACM, 1–12. https://doi.org/10.1145/3290605.3300500
[20]
Michael Hicks. 2001. Dynamic Software Updating. Ph. D. Dissertation. University of Pennsylvania.
[21]
Michael Hicks, Jonathan T. Moore, and Scott Nettles. 2001. Dynamic Software Updating. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation (PLDI ’01). ACM, 12–23. https://doi.org/10.1145/378795.378798
[22]
Michael Hicks and Scott Nettles. 2005. Dynamic Software Updating. ACM Trans. Program. Lang. Syst., 27, 6 (2005), 1049–1096. https://doi.org/10.1145/1108970.1108971
[23]
HN. 2020. Augmentations, Data Cleaning and Bounding Boxes. https://www.kaggle.com/code/reighns/augmentations-data-cleaning-and-bounding-boxes
[24]
Chadni Islam, Victor Prokhorenko, and M. Ali Babar. 2023. Runtime software patching: Taxonomy, survey and future directions. J. Syst. Softw., 111652:1–111652:22. https://doi.org/10.1016/j.jss.2023.111652
[25]
Naman Jaswani. 2020. [New Baseline] Pytorch | MoA. https://www.kaggle.com/code/namanj27/new-baseline-pytorch-moa
[26]
Malin Källén and Tobias Wrigstad. 2021. Jupyter Notebooks on GitHub: Characteristics and Code Clones. The Art, Science, and Engineering of Programming, 5, 3 (2021), 15:1–15:31. https://doi.org/10.22152/programming-journal.org/2021/5/15
[27]
Umut Karakulak. 2017. Using stage 1 test results for stage 2 training. https://www.kaggle.com/code/umutto/using-stage-1-test-results-for-stage-2-training
[28]
Alfons Kemper and Thomas Neumann. 2011. HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE ’11). 195–206. https://doi.org/10.1109/ICDE.2011.5767867
[29]
Mary Beth Kery, Bonnie E. John, Patrick O’Flaherty, Amber Horvath, and Brad A. Myers. 2019. Towards Effective Foraging by Data Scientists to Find Past Analysis Choices. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). ACM, 1–13. https://doi.org/10.1145/3290605.3300322
[30]
Mary Beth Kery and Brad A. Myers. 2017. Exploring Exploratory Programming. In Proc. the 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’17). IEEE, 25–29. https://doi.org/10.1109/VLHCC.2017.8103446
[31]
Mary Beth Kery and Brad A. Myers. 2018. Interactions for Untangling Messy History in a Computational Notebook. In Proceedings of the 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’18). IEEE, 147–155. https://doi.org/10.1109/VLHCC.2018.8506576
[32]
Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E. John, and Brad A. Myers. 2018. The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). ACM, 1–11. https://doi.org/10.1145/3173574.3173748
[33]
Andreas P. Koenzen, Neil A. Ernst, and Margaret-Anne D. Storey. 2020. Code Duplication and Reuse in Jupyter Notebooks. In Proceedings of the 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’20). ACM, 1–9. https://doi.org/10.1109/VL/HCC50065.2020.9127202
[34]
Sam Lau, Ian Drosos, Julia M. Markel, and Philip J. Guo. 2020. The Design Space of Computational Notebooks: An Analysis of 60 Systems in Academia and Industry. In Proceedings of the 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’20). IEEE, 1–11. https://doi.org/10.1109/VL/HCC50065.2020.9127201
[35]
Liang Li, Guoren Wang, Gang Wu, Ye Yuan, Lei Chen, and Xiang Lian. 2021. A Comparative Study of Consistent Snapshot Algorithms for Main-Memory Database Systems. IEEE Transactions on Knowledge and Data Engineering, 33, 2 (2021), 316–330. https://doi.org/10.1109/TKDE.2019.2930987
[36]
Zekun Li. 2017. Copy-on-write friendly Python garbage collection. https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf
[37]
Andrew Lukyanenko. 2020. Which bird is it? https://www.kaggle.com/code/artgor/which-bird-is-it
[38]
Anders Miltner, Sumit Gulwani, Vu Le, Alan Leung, Arjun Radhakrishna, Gustavo Soares, Ashish Tiwari, and Abhishek Udupa. 2019. On the Fly Synthesis of Edit Suggestions. Proc. ACM Program. Lang., 3, OOPSLA (2019), 143:1–143:29. https://doi.org/10.1145/3360569
[39]
Vitalii Mokin. 2020. Ion Switching - AdvFE, LGB, Wavenet, ConfMatrix. https://www.kaggle.com/code/vbmokin/ion-switching-advfe-lgb-wavenet-confmatrix
[40]
Rob Mulla. 2019. IEEE Fraud Detection - First Look and EDA. https://www.kaggle.com/code/robikscube/ieee-fraud-detection-first-look-and-eda
[41]
Rob Mulla. 2020. OpenVaccine: COVID-19 mRNA Starter EDA. https://www.kaggle.com/code/robikscube/openvaccine-covid-19-mrna-starter-eda
[42]
Peter Parente. 2024. Estimate of Public Jupyter Notebooks on GitHub. https://github.com/parente/nbestimate
[43]
Mark Peng. 2020. DeepInsight: Transforming Non-image data to Images. https://www.kaggle.com/code/markpeng/deepinsight-transforming-non-image-data-to-images
[44]
Jeffrey M. Perkel. 2018. Why Jupyter is data scientists’ computational notebook of choice. Nature, 563 (2018), 145–146. https://doi.org/10.1038/d41586-018-07196-1
[45]
João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2019. A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks. In Proceedings of the 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR ’19). IEEE, 507–517. https://doi.org/10.1109/MSR.2019.00077
[46]
João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2021. Understanding and improving the quality and reproducibility of Jupyter notebooks. Empir. Softw. Eng., 26, 4 (2021), 65:1–65:55. https://doi.org/10.1007/s10664-021-09961-9
[47]
Luís Pina, Anastasios Andronidis, Michael Hicks, and Cristian Cadar. 2019. MVEDSUA: Higher Availability Dynamic Software Updates via Multi-Version Execution. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). ACM, 573–585. https://doi.org/10.1145/3297858.3304063
[48]
Luís Pina, Luís Veiga, and Michael Hicks. 2014. Rubah: DSU for Java on a Stock JVM. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’14). ACM, 103–119. https://doi.org/10.1145/2660193.2660220
[49]
Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2021. KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. In Proceedings of the 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR ’21). IEEE, 550–554. https://doi.org/10.1109/MSR52588.2021.00072
[50]
Deepthi Raghunandan, Niklas Elmqvist, and Leilani Battle. 2023. Measuring How Data Science Notebooks Evolve Over Time. Interactions, 30, 1 (2023), 17–18. https://doi.org/10.1145/3572863
[51]
Deepthi Raghunandan, Aayushi Roy, Shenzhi Shi, Niklas Elmqvist, and Leilani Battle. 2023. Code Code Evolution: Understanding How People Change Data Science Notebooks Over Time. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). ACM, 863:1–863:12. https://doi.org/10.1145/3544548.3580997
[52]
Adam Rule, Aurélien Tabard, and James D. Hollan. 2018. Exploration and Explanation in Computational Notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). ACM, 1–12. https://doi.org/10.1145/3173574.3173606
[53]
Abhraneel Sarma, Alex Kale, Michael Jongho Moon, Nathan Taback, Fanny Chevalier, Jessica Hullman, and Matthew Kay. 2023. multiverse: Multiplexing Alternative Data Analyses in R Notebooks. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). ACM, 148:1–148:15. https://doi.org/10.1145/3544548.3580726
[54]
Shigeyuki Sato and Tomoki Nakamaru. 2024. Multiverse Notebook: Shifting Data Scientists to Time Travelers (Supplemental Material). https://doi.org/10.5281/zenodo.7656049
[55]
Eric Snow and Eddie Elizondo. 2022. PEP 683 –Immortal Objects, Using a Fixed Refcount. https://peps.python.org/pep-0683/
[56]
Sergey Titov, Yaroslav Golubev, and Timofey Bryksin. 2022. ReSplit: Improving the Structure of Jupyter Notebooks by Re-Splitting Their Cells. In Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’22). 492–496. https://doi.org/10.1109/SANER53432.2022.00066
[57]
Zijie J. Wang, Katie Dai, and W. Keith Edwards. 2022. StickyLand: Breaking the Linear Presentation of Computational Notebooks. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). ACM, 1–7. https://doi.org/10.1145/3491101.3519653
[58]
Alessandro Warth, Yoshiki Ohshima, Ted Kaehler, and Alan Kay. 2011. Worlds: Controlling the Scope of Side Effects. In ECOOP 2011 – Object-Oriented Programming (Lecture Notes in Computer Science, Vol. 6813). Springer, 179–203. https://doi.org/10.1007/978-3-642-22655-7_9
[59]
Nathaniel Weinman, Steven M. Drucker, Titus Barik, and Robert DeLine. 2021. Fork It: Supporting Stateful Alternatives in Computational Notebooks. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). ACM, 307:1–307:12. https://doi.org/10.1145/3411764.3445527
[60]
xhlulu. 2019. IEEE Fraud: XGBoost with GPU (Fit in 40s). https://www.kaggle.com/code/xhlulu/ieee-fraud-xgboost-with-gpu-fit-in-40s

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 8, Issue OOPSLA1
April 2024
1492 pages
EISSN:2475-1421
DOI:10.1145/3554316
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 April 2024
Published in PACMPL Volume 8, Issue OOPSLA1

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Computational notebook
  2. Exploratory programming
  3. Memory management

Qualifiers

  • Research-article

Funding Sources

  • ACT-X, Japan Science and Technology Agency

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 500
    Total Downloads
  • Downloads (Last 12 months)500
  • Downloads (Last 6 weeks)90
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media