[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3524842.3528447acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

A large-scale comparison of Python code in Jupyter notebooks and scripts

Published: 17 October 2022 Publication History

Abstract

In recent years, Jupyter notebooks have grown in popularity in several domains of software engineering, such as data science, machine learning, and computer science education. Their popularity has to do with their rich features for presenting and visualizing data, however, recent studies show that notebooks also share a lot of drawbacks: high number of code clones, low reproducibility, etc. In this work, we carry out a comparison between Python code written in Jupyter Notebooks and in traditional Python scripts. We compare the code from two perspectives: structural and stylistic. In the first part of the analysis, we report the difference in the number of lines, the usage of functions, as well as various complexity metrics. In the second part, we show the difference in the number of stylistic issues and provide an extensive overview of the 15 most frequent stylistic issues in the studied mediums. Overall, we demonstrate that notebooks are characterized by the lower code complexity, however, their code could be perceived as more entangled than in the scripts. As for the style, notebooks tend to have 1.4 times more stylistic issues, but at the same time, some of them are caused by specific coding practices in notebooks and should be considered as false positives. With this research, we want to pave the way to studying specific problems of notebooks that should be addressed by the development of notebook-specific tools, and provide various insights that can be useful in this regard.

References

[1]
[n.d.]. GitHub licenses. https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository. [Online. Accessed 25-March-2022].
[2]
[n.d.]. IntelliJ IDEA. https://www.jetbrains.com/idea/. [Online. Accessed 25-March-2022].
[3]
[n.d.]. Matroskin: a library for the large scale analysis of Jupyter notebooks. https://github.com/JetBrains-Research/Matroskin. [Online. Accessed 25-March-2022].
[4]
[n.d.]. VS Code. https://code.visualstudio.com/. [Online. Accessed 25-March-2022].
[5]
Anastasiia Birillo, Ilya Vlasov, Artyom Burylov, Vitalii Selishchev, Artyom Goncharov, Elena Tikhomirova, Nikolay Vyahhi, and Timofey Bryksin. 2022. Hyperstyle: A Tool for Assessing the Code Quality of Solutions to Programming Assignments. In Proceedings of the 53rd ACM Technical Symposium on Computer Science Education. 307--313.
[6]
Hudson Borges and Marco Tulio Valente. 2018. What's in a GitHub star? Understanding Repository Starring Practices in a Social Coding Platform. Journal of Systems and Software 146 (2018), 112--129.
[7]
Maria Caulo and Giuseppe Scanniello. 2020. A Taxonomy of Metrics for Software Fault Prediction. In 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 429--436.
[8]
Souti Chattopadhyay, Ishita Prasad, Austin Z Henley, Anita Sarma, and Titus Barik. 2020. What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1--12.
[9]
Don Coleman, Dan Ash, Bruce Lowther, and Paul Oman. 1994. Using Netrics to Evaluate Software System Maintainability. Computer 27, 8 (1994), 44--49.
[10]
Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling Projects in GitHub for MSR Studies. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021. IEEE, 560--564.
[11]
Robert L Glass. 2002. Software engineering: facts and fallacies.
[12]
Yaroslav Golubev, Maria Eliseeva, Nikita Povarov, and Timofey Bryksin. 2020. A Study of Potential Code Borrowing and License Violations in Java Projects on GitHub. In Proceedings of the 17th International Conference on Mining Software Repositories. 54--64.
[13]
Konstantin Grotov, Sergey Titov, Vladimir Sotnikov, Yaroslav Golubev, and Timofey Bryksin. 2022. The Dataset of Jupyter Notebooks.
[14]
Nick Coghlan Guido van Rossum, Barry Warsaw. 2022. PEP8 standard. https://www.python.org/dev/peps/pep-0008/ [Online. Accessed 25-March-2022].
[15]
Bassey Isong and Ekabua Obeten. 2013. A Systematic Review of the Empirical Validation of Object-Oriented Metrics Towards Fault-Proneness Prediction. International Journal of Software Engineering and Knowledge Engineering 23, 10 (2013), 1513--1540.
[16]
JetBrains. 2020. Jetbrains Python developers survey. https://www.jetbrains.com/lp/python-developers-survey-2020/ [Online. Accessed 25-March-2022].
[17]
Jeremiah W Johnson and Karen H Jin. 2020. Jupyter Notebooks in Education. Journal of Computing Sciences in Colleges 35, 8 (2020), 268--269.
[18]
Hieke Keuning, Bastiaan Heeren, and Johan Jeuring. 2017. Code quality issues in student programs. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education. 110--115.
[19]
Hieke Keuning, Bastiaan Heeren, and Johan Jeuring. 2019. How Teachers Would Help Students to Improve Their Code. In Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education. 119--125.
[20]
Donald Ervin Knuth. 1984. Literate Programming. The computer journal 27, 2 (1984), 97--111.
[21]
Andreas P Koenzen, Neil A Ernst, and Margaret-Anne D Storey. 2020. Code Duplication and Reuse in Jupyter Notebooks. In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 1--9.
[22]
Robert C Martin. 2009. Clean Code: a Handbook of Agile Software Craftsmanship. Pearson Education.
[23]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A Distributed Framework for Emerging {AI} Applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 561--577.
[24]
Siripond Mullanu, Sunwit Petchoo, and Caslon Chua. 2020. Code Complexity Analyser and Visualiser for Novice Programmer. In 2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). IEEE, 1--6.
[25]
Ruchuta Nundhapana and Twittie Senivongse. 2018. Enhancing Understandability of Objective C Programs Using Naming Convention Checking Framework. In Proceedings of the World Congress on Engineering and Computer Science, Vol. 1.
[26]
Serge Sans Paille. 2022. Beniget tool. https://github.com/serge-sans-paille/beniget [Online. Accessed 25-March-2022].
[27]
Yun Peng, Yu Zhang, and Mingzhe Hu. 2021. An Empirical Study for Common Language Features Used in Python Projects. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 24--35.
[28]
Jeffrey M Perkel. 2018. Why Jupyter is Data Scientists' Computational Notebook of Choice. Nature 563, 7732 (2018), 145--147.
[29]
João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2019. A Large-Scale Study about Quality and Reproducibility of Jupyter Notebooks. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 507--517.
[30]
João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2021. Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks. Empirical Software Engineering 26, 4 (2021), 1--55.
[31]
Pylint. 2022. Pylint tool. https://pylint.org/ [Online. Accessed 25-March-2022].
[32]
Danijel Radjenović, Marjan Heričko, Richard Torkar, and Aleš Živkovič. 2013. Software Fault Prediction Metrics: A Systematic Literature Review. Information and software technology 55, 8 (2013), 1397--1418.
[33]
Radon. 2022. Radon tool. https://radon.readthedocs.io/en/latest/ [Online. Accessed 25-March-2022].
[34]
Adam Rule, Aurélien Tabard, and James D Hollan. 2018. Exploration and Explanation in Computational Notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--12.
[35]
Andrew J Simmons, Scott Barnett, Jessica Rivera-Villicana, Akshat Bajaj, and Rajesh Vasa. 2020. A Large-Scale Comparative Analysis of Coding Standard Conformance in Open-Source Data Science Projects. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1--11.
[36]
Ian Cordasco Tarek Ziadé. 2022. Flake8 tool. https://github.com/pycqa/flake8 [Online. Accessed 25-March-2022].
[37]
Sergey Titov, Yaroslav Golubev, and Timofey Bryksin. 2021. ReSplit: Improving the Structure of Jupyter Notebooks by Re-Splitting Their Cells. arXiv preprint arXiv:2112.14825 (2021).
[38]
Raphael Vallat. 2018. Pingouin: Statistics in Python. Journal of Open Source Software 3, 31 (2018), 1026.
[39]
Zhiyuan Wan, Xin Xia, David Lo, and Gail C Murphy. 2019. How Does Machine Learning Change Software Development Practices? IEEE Transactions on Software Engineering (2019).
[40]
Jiawei Wang, Li Li, and Andreas Zeller. 2020. Better Code, Better Sharing: On the Need of Analyzing Jupyter Notebooks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results. 53--56.
[41]
Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring Execution Environments of Jupyter Notebooks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1622--1633.
[42]
WPS. 2022. Wemake Python Styleguide. https://wemake-python-stylegui.de/en/latest/ [Online. Accessed 25-March-2022].
[43]
Ge Zhang, Mike A Merrill, Yang Liu, Jeffrey Heer, and Tim Althoff. 2022. Coral: Code representation learning with weakly-supervised transformers for analyzing data analysis. EPJ Data Science 11, 1 (2022), 14.

Cited By

View all
  • (2024)AI in Context: Harnessing Domain Knowledge for Smarter Machine LearningApplied Sciences10.3390/app14241161214:24(11612)Online publication date: 12-Dec-2024
  • (2024)Using Run-Time Information to Enhance Static Analysis of Machine Learning Code in NotebooksCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663785(497-501)Online publication date: 10-Jul-2024
  • (2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories
May 2022
815 pages
ISBN:9781450393034
DOI:10.1145/3524842
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

MSR '22
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)204
  • Downloads (Last 6 weeks)28
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)AI in Context: Harnessing Domain Knowledge for Smarter Machine LearningApplied Sciences10.3390/app14241161214:24(11612)Online publication date: 12-Dec-2024
  • (2024)Using Run-Time Information to Enhance Static Analysis of Machine Learning Code in NotebooksCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663785(497-501)Online publication date: 10-Jul-2024
  • (2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
  • (2024)Hidden Gems in the Rough: Computational Notebooks as an Uncharted Oasis for IDEsProceedings of the 1st ACM/IEEE Workshop on Integrated Development Environments10.1145/3643796.3648465(107-109)Online publication date: 20-Apr-2024
  • (2024)Exploratory data science on supercomputers for quantum mechanical calculationsElectronic Structure10.1088/2516-1075/ad4b806:2(027003)Online publication date: 11-Jun-2024
  • (2024)Static analysis driven enhancements for comprehension in machine learning notebooksEmpirical Software Engineering10.1007/s10664-024-10525-w29:5Online publication date: 12-Aug-2024
  • (2023)JuGaze: A Cell-based Eye Tracking and Logging Tool for Jupyter NotebooksProceedings of the 23rd Koli Calling International Conference on Computing Education Research10.1145/3631802.3631824(1-11)Online publication date: 13-Nov-2023
  • (2023)Economic Imbalance: Predicting Financial Stress Using Deep Learning Techniques2023 10th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON)10.1109/UPCON59197.2023.10434469(409-413)Online publication date: 1-Dec-2023
  • (2023)Unboxing Default Argument Breaking Changes in Scikit Learn2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM59687.2023.00030(209-219)Online publication date: 2-Oct-2023
  • (2023)Do Code Quality and Style Issues Differ Across (Non-)Machine Learning Notebooks? Yes!2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM59687.2023.00018(72-83)Online publication date: 2-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media