More Web Proxy on the site http://driver.im/

research-article

A large-scale comparison of Python code in Jupyter notebooks and scripts

Authors:

Konstantin Grotov,

Vladimir Sotnikov,

Yaroslav Golubev,

Timofey BryksinAuthors Info & Claims

MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories

Pages 353 - 364

https://doi.org/10.1145/3524842.3528447

Published: 17 October 2022 Publication History

Abstract

In recent years, Jupyter notebooks have grown in popularity in several domains of software engineering, such as data science, machine learning, and computer science education. Their popularity has to do with their rich features for presenting and visualizing data, however, recent studies show that notebooks also share a lot of drawbacks: high number of code clones, low reproducibility, etc. In this work, we carry out a comparison between Python code written in Jupyter Notebooks and in traditional Python scripts. We compare the code from two perspectives: structural and stylistic. In the first part of the analysis, we report the difference in the number of lines, the usage of functions, as well as various complexity metrics. In the second part, we show the difference in the number of stylistic issues and provide an extensive overview of the 15 most frequent stylistic issues in the studied mediums. Overall, we demonstrate that notebooks are characterized by the lower code complexity, however, their code could be perceived as more entangled than in the scripts. As for the style, notebooks tend to have 1.4 times more stylistic issues, but at the same time, some of them are caused by specific coding practices in notebooks and should be considered as false positives. With this research, we want to pave the way to studying specific problems of notebooks that should be addressed by the development of notebook-specific tools, and provide various insights that can be useful in this regard.

References

[1]

[n.d.]. GitHub licenses. https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository. [Online. Accessed 25-March-2022].

[2]

[n.d.]. IntelliJ IDEA. https://www.jetbrains.com/idea/. [Online. Accessed 25-March-2022].

[3]

[n.d.]. Matroskin: a library for the large scale analysis of Jupyter notebooks. https://github.com/JetBrains-Research/Matroskin. [Online. Accessed 25-March-2022].

[4]

[n.d.]. VS Code. https://code.visualstudio.com/. [Online. Accessed 25-March-2022].

[5]

Anastasiia Birillo, Ilya Vlasov, Artyom Burylov, Vitalii Selishchev, Artyom Goncharov, Elena Tikhomirova, Nikolay Vyahhi, and Timofey Bryksin. 2022. Hyperstyle: A Tool for Assessing the Code Quality of Solutions to Programming Assignments. In Proceedings of the 53rd ACM Technical Symposium on Computer Science Education. 307--313.

Digital Library

[6]

Hudson Borges and Marco Tulio Valente. 2018. What's in a GitHub star? Understanding Repository Starring Practices in a Social Coding Platform. Journal of Systems and Software 146 (2018), 112--129.

[7]

Maria Caulo and Giuseppe Scanniello. 2020. A Taxonomy of Metrics for Software Fault Prediction. In 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 429--436.

[8]

Souti Chattopadhyay, Ishita Prasad, Austin Z Henley, Anita Sarma, and Titus Barik. 2020. What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1--12.

Digital Library

[9]

Don Coleman, Dan Ash, Bruce Lowther, and Paul Oman. 1994. Using Netrics to Evaluate Software System Maintainability. Computer 27, 8 (1994), 44--49.

Digital Library

[10]

Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling Projects in GitHub for MSR Studies. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021. IEEE, 560--564.

[11]

Robert L Glass. 2002. Software engineering: facts and fallacies.

[12]

Yaroslav Golubev, Maria Eliseeva, Nikita Povarov, and Timofey Bryksin. 2020. A Study of Potential Code Borrowing and License Violations in Java Projects on GitHub. In Proceedings of the 17th International Conference on Mining Software Repositories. 54--64.

Digital Library

[13]

Konstantin Grotov, Sergey Titov, Vladimir Sotnikov, Yaroslav Golubev, and Timofey Bryksin. 2022. The Dataset of Jupyter Notebooks.

[14]

Nick Coghlan Guido van Rossum, Barry Warsaw. 2022. PEP8 standard. https://www.python.org/dev/peps/pep-0008/ [Online. Accessed 25-March-2022].

[15]

Bassey Isong and Ekabua Obeten. 2013. A Systematic Review of the Empirical Validation of Object-Oriented Metrics Towards Fault-Proneness Prediction. International Journal of Software Engineering and Knowledge Engineering 23, 10 (2013), 1513--1540.

[16]

JetBrains. 2020. Jetbrains Python developers survey. https://www.jetbrains.com/lp/python-developers-survey-2020/ [Online. Accessed 25-March-2022].

[17]

Jeremiah W Johnson and Karen H Jin. 2020. Jupyter Notebooks in Education. Journal of Computing Sciences in Colleges 35, 8 (2020), 268--269.

Digital Library

[18]

Hieke Keuning, Bastiaan Heeren, and Johan Jeuring. 2017. Code quality issues in student programs. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education. 110--115.

Digital Library

[19]

Hieke Keuning, Bastiaan Heeren, and Johan Jeuring. 2019. How Teachers Would Help Students to Improve Their Code. In Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education. 119--125.

Digital Library

[20]

Donald Ervin Knuth. 1984. Literate Programming. The computer journal 27, 2 (1984), 97--111.

[21]

Andreas P Koenzen, Neil A Ernst, and Margaret-Anne D Storey. 2020. Code Duplication and Reuse in Jupyter Notebooks. In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 1--9.

[22]

Robert C Martin. 2009. Clean Code: a Handbook of Agile Software Craftsmanship. Pearson Education.

[23]

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A Distributed Framework for Emerging {AI} Applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 561--577.

Digital Library

[24]

Siripond Mullanu, Sunwit Petchoo, and Caslon Chua. 2020. Code Complexity Analyser and Visualiser for Novice Programmer. In 2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). IEEE, 1--6.

[25]

Ruchuta Nundhapana and Twittie Senivongse. 2018. Enhancing Understandability of Objective C Programs Using Naming Convention Checking Framework. In Proceedings of the World Congress on Engineering and Computer Science, Vol. 1.

[26]

Serge Sans Paille. 2022. Beniget tool. https://github.com/serge-sans-paille/beniget [Online. Accessed 25-March-2022].

[27]

Yun Peng, Yu Zhang, and Mingzhe Hu. 2021. An Empirical Study for Common Language Features Used in Python Projects. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 24--35.

[28]

Jeffrey M Perkel. 2018. Why Jupyter is Data Scientists' Computational Notebook of Choice. Nature 563, 7732 (2018), 145--147.

[29]

João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2019. A Large-Scale Study about Quality and Reproducibility of Jupyter Notebooks. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 507--517.

Digital Library

[30]

João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2021. Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks. Empirical Software Engineering 26, 4 (2021), 1--55.

Digital Library

[31]

Pylint. 2022. Pylint tool. https://pylint.org/ [Online. Accessed 25-March-2022].

[32]

Danijel Radjenović, Marjan Heričko, Richard Torkar, and Aleš Živkovič. 2013. Software Fault Prediction Metrics: A Systematic Literature Review. Information and software technology 55, 8 (2013), 1397--1418.

[33]

Radon. 2022. Radon tool. https://radon.readthedocs.io/en/latest/ [Online. Accessed 25-March-2022].

[34]

Adam Rule, Aurélien Tabard, and James D Hollan. 2018. Exploration and Explanation in Computational Notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--12.

Digital Library

[35]

Andrew J Simmons, Scott Barnett, Jessica Rivera-Villicana, Akshat Bajaj, and Rajesh Vasa. 2020. A Large-Scale Comparative Analysis of Coding Standard Conformance in Open-Source Data Science Projects. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1--11.

Digital Library

[36]

Ian Cordasco Tarek Ziadé. 2022. Flake8 tool. https://github.com/pycqa/flake8 [Online. Accessed 25-March-2022].

[37]

Sergey Titov, Yaroslav Golubev, and Timofey Bryksin. 2021. ReSplit: Improving the Structure of Jupyter Notebooks by Re-Splitting Their Cells. arXiv preprint arXiv:2112.14825 (2021).

[38]

Raphael Vallat. 2018. Pingouin: Statistics in Python. Journal of Open Source Software 3, 31 (2018), 1026.

[39]

Zhiyuan Wan, Xin Xia, David Lo, and Gail C Murphy. 2019. How Does Machine Learning Change Software Development Practices? IEEE Transactions on Software Engineering (2019).

[40]

Jiawei Wang, Li Li, and Andreas Zeller. 2020. Better Code, Better Sharing: On the Need of Analyzing Jupyter Notebooks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results. 53--56.

Digital Library

[41]

Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring Execution Environments of Jupyter Notebooks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1622--1633.

[42]

WPS. 2022. Wemake Python Styleguide. https://wemake-python-stylegui.de/en/latest/ [Online. Accessed 25-March-2022].

[43]

Ge Zhang, Mike A Merrill, Yang Liu, Jeffrey Heer, and Tim Althoff. 2022. Coral: Code representation learning with weakly-supervised transformers for analyzing data analysis. EPJ Data Science 11, 1 (2022), 14.

Cited By

Miller TDurlik IŁobodzińska ADorobczyński LJasionowski R(2024)AI in Context: Harnessing Domain Knowledge for Smarter Machine LearningApplied Sciences10.3390/app14241161214:24(11612)Online publication date: 12-Dec-2024
https://doi.org/10.3390/app142411612
Wang YLópez JNilsson UVarró Dd'Amorim M(2024)Using Run-Time Information to Enhance Static Analysis of Machine Learning Code in NotebooksCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663785(497-501)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663785
Meijer WCombemale BWimmer MChechik MEgyed A(2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
https://dl.acm.org/doi/10.1145/3652620.3688201
Show More Cited By

Recommendations

Assessing and restoring reproducibility of Jupyter notebooks
ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

Jupyter notebooks---documents that contain live code, equations, visualizations, and narrative text---now are among the most popular means to compute, present, discuss and disseminate scientific findings. In principle, Jupyter notebooks should easily ...
Benefits and Pitfalls of Jupyter Notebooks in the Classroom
SIGITE '20: Proceedings of the 21st Annual Conference on Information Technology Education

Jupyter notebooks are widely used in industry and in academic research, but have only begun to make inroads into the classroom. The design of the Jupyter notebook is in many ways well suited for teaching subjects in information technology and computer ...
A Comparison of Machine Learning Code Quality in Python Scripts and Jupyter Notebooks
Papers of the 37th Annual CCSC Southeastern Conference

Jupyter notebooks are currently one of the most popular environments for Python development, especially in domains such as data science. Existing studies have shown that notebooks may promote bad coding habits, leading to poor code quality and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MSR '22: Proceedings of the 19th International Conference on Mining Software Repositories

May 2022

815 pages

ISBN:9781450393034

DOI:10.1145/3524842

General Chair:
David Lo
Singapore Management University, Singapore
,
Program Chairs:
Shane McIntosh
University of Waterloo, Canada
,
Nicole Novielli
University of Bari, Italy

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

MSR '22

Sponsor:

SIGSOFT

MSR '22: 19th International Conference on Mining Software Repositories

May 23 - 24, 2022

Pennsylvania, Pittsburgh

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
431
Total Downloads

Downloads (Last 12 months)204
Downloads (Last 6 weeks)28

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Miller TDurlik IŁobodzińska ADorobczyński LJasionowski R(2024)AI in Context: Harnessing Domain Knowledge for Smarter Machine LearningApplied Sciences10.3390/app14241161214:24(11612)Online publication date: 12-Dec-2024
https://doi.org/10.3390/app142411612
Wang YLópez JNilsson UVarró Dd'Amorim M(2024)Using Run-Time Information to Enhance Static Analysis of Machine Learning Code in NotebooksCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663785(497-501)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663785
Meijer WCombemale BWimmer MChechik MEgyed A(2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
https://dl.acm.org/doi/10.1145/3652620.3688201
Titov SGrotov KPrasad S. Venkatesh ADig DBryksin TGolubev YBezzubov A(2024)Hidden Gems in the Rough: Computational Notebooks as an Uncharted Oasis for IDEsProceedings of the 1st ACM/IEEE Workshop on Integrated Development Environments10.1145/3643796.3648465(107-109)Online publication date: 20-Apr-2024
https://dl.acm.org/doi/10.1145/3643796.3648465
Dawson WBeal LRatcliff LStella MNakajima TGenovese L(2024)Exploratory data science on supercomputers for quantum mechanical calculationsElectronic Structure10.1088/2516-1075/ad4b806:2(027003)Online publication date: 11-Jun-2024
https://doi.org/10.1088/2516-1075/ad4b80
Venkatesh ASabu SChekkapalli MWang JLi LBodden E(2024)Static analysis driven enhancements for comprehension in machine learning notebooksEmpirical Software Engineering10.1007/s10664-024-10525-w29:5Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1007/s10664-024-10525-w
Sparmann SHüsing SSchulte C(2023)JuGaze: A Cell-based Eye Tracking and Logging Tool for Jupyter NotebooksProceedings of the 23rd Koli Calling International Conference on Computing Education Research10.1145/3631802.3631824(1-11)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3631802.3631824
Raman RPatil HKancherla DJoshi ABothe SPonnusamy R(2023)Economic Imbalance: Predicting Financial Stress Using Deep Learning Techniques2023 10th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON)10.1109/UPCON59197.2023.10434469(409-413)Online publication date: 1-Dec-2023
https://doi.org/10.1109/UPCON59197.2023.10434469
Montandon JSilva LPolitowski CBoussaidi GValente M(2023)Unboxing Default Argument Breaking Changes in Scikit Learn2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM59687.2023.00030(209-219)Online publication date: 2-Oct-2023
https://doi.org/10.1109/SCAM59687.2023.00030
Siddik MBezemer C(2023)Do Code Quality and Style Issues Differ Across (Non-)Machine Learning Notebooks? Yes!2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM59687.2023.00018(72-83)Online publication date: 2-Oct-2023
https://doi.org/10.1109/SCAM59687.2023.00018
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents