[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3411763.3451617acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
poster

What Makes a Well-Documented Notebook? A Case Study of Data Scientists’ Documentation Practices in Kaggle

Published: 08 May 2021 Publication History

Abstract

Many data scientists use computational notebooks to test and present their work, as a notebook can weave code and documentation together (computational narrative), and support rapid iteration on code experiments. However, it is not easy to write good documentation in a data science notebook, partially because there is a lack of a corpus of well-documented notebooks as exemplars for data scientists to follow. To cope with this challenge, this work looks at Kaggle — a large online community for data scientists to host and participate in machine learning competitions — and considers highly-voted Kaggle notebooks as a proxy for well-documented notebooks. Through a qualitative analysis at both the notebook level and the markdown-cell level, we find these notebooks are indeed well documented in reference to previous literature. Our analysis also reveals nine categories of content that data scientists write in their documentation cells, and these documentation cells often interplay with different stages of the data science lifecycle. We conclude the paper with design implications and future research directions.

References

[1]
Liang Bai and Yanli Hu. 2018. Problem-driven teaching activities for the capstone project course of data science. In Proceedings of ACM Turing Celebration Conference-China. 130–131.
[2]
Souti Chattopadhyay, Ishita Prasad, Austin Z Henley, Anita Sarma, and Titus Barik. 2020. What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
[3]
Ruijia Cheng and Mark Zachry. 2020. Building Community Knowledge In Online Competitions: Motivation, Practices and Challenges. Proceedings of the ACM on Human-Computer Interaction 4, CSCW2(2020), 1–22.
[4]
Sergio Cozzetti B de Souza, Nicolas Anquetil, and Káthia M de Oliveira. 2005. A study of the documentation essential to software maintenance. In Proceedings of the 23rd annual international conference on Design of communication: documenting & designing for pervasive information. 68–75.
[5]
Jesus Fernandez-Bes, Jerónimo Arenas-García, and Jesús Cid-Sueiro. [n.d.]. Energy generation prediction: Lessons learned from the use of Kaggle in Machine Learning Course. Group 7, 8 ([n. d.]), 9.
[6]
R Stuart Geiger, Nelle Varoquaux, Charlotte Mazel-Cabasse, and Chris Holdgraf. 2018. The types, roles, and practices of documentation in data analytics open source software libraries. Computer Supported Cooperative Work (CSCW) 27, 3-6 (2018), 767–802.
[7]
Youyang Hou and Dakuo Wang. 2017. Hacking with NPOs: collaborative analytics and broker roles in civic data hackathons. Proceedings of the ACM on Human-Computer Interaction 1, CSCW(2017), 53.
[8]
JavaDoc 2020. JavaDoc. https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html.
[9]
Kaggle Competition 2020. House Prices - Advanced Regression Techniques. https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
[10]
Kaggle Competition 2020. Titanic - Machine Learning from Disaster. https://www.kaggle.com/c/titanic.
[11]
Mira Kajko-Mattsson. 2005. A survey of documentation practice within corrective maintenance. Empirical Software Engineering 10, 1 (2005), 31–55.
[12]
Malin Källén, Ulf Sigvardsson, and Tobias Wrigstad. 2020. Jupyter Notebooks on GitHub: Characteristics and Code Clones. arXiv preprint arXiv:2007.10146(2020).
[13]
Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–11.
[14]
Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E. John, and Brad A. Myers. 2018. The Story in the Notebook: Exploratory Data Science Using a Literate Programming Tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). ACM, New York, NY, USA, Article 174, 11 pages. https://doi.org/10.1145/3173574.3173748
[15]
Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian E Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Corlay, 2016. Jupyter Notebooks – a publishing format for reproducible computational workflows. In ELPUB. 87–90.
[16]
Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 795–806.
[17]
Jiali Liu, Nadia Boukhelifa, and James R Eagan. 2019. Understanding the role of alternatives in data analysis practices. IEEE transactions on visualization and computer graphics 26, 1(2019), 66–76.
[18]
Walid Maalej and Martin P Robillard. 2013. Patterns of knowledge in API reference documentation. IEEE Transactions on Software Engineering 39, 9 (2013), 1264–1282.
[19]
Yaoli Mao, Dakuo Wang, Michael Muller, KUSH VARSHNEY, IOANA Baldini, CASEY Dugan, and ALEKSANDRA MOJSILOVIĆ. 2020. How Data Scientists Work Together With Domain Experts in Scientific Collaborations. In Proceedings of the 2020 ACM conference on GROUP. ACM.
[20]
Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q. Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). ACM, New York, NY, USA, Article 126, 15 pages. https://doi.org/10.1145/3290605.3300356
[21]
Yoann Padioleau, Lin Tan, and Yuanyuan Zhou. 2009. Listening to programmers—Taxonomies and characteristics of comments in operating system code. In 2009 IEEE 31st International Conference on Software Engineering. IEEE, 331–341.
[22]
Jeffrey M. Perkel. 2018. Why Jupyter is data scientists’ computational notebook of choice. Nature 563(2018), 145. https://doi.org/10.1038/d41586-018-07196-1
[23]
Mohammed Suhail Rehman. 2019. Towards Understanding Data Analysis Workflows using a Large Notebook Corpus. In Proceedings of the 2019 International Conference on Management of Data. 1841–1843.
[24]
Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software?. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 255–265.
[25]
Adam Rule, Aurélien Tabard, and James D Hollan. 2018. Exploration and explanation in computational notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12.
[26]
Lin Shi, Hao Zhong, Tao Xie, and Mingshu Li. 2011. An empirical study on evolution of API documentation. In International Conference on Fundamental Approaches To Software Engineering. Springer, 416–431.
[27]
Krishna Subramanian, Nur Hamdan, and Jan Borchers. 2020. Casual Notebooks and Rigid Scripts: Understanding Data Science Programming. In 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 1–5.
[28]
Christoph Tauchert, Peter Buxmann, and Jannis Lambinus. 2020. Crowdsourcing Data Science: A Qualitative Analysis of Organizations’ Usage of Kaggle Competitions. In Proceedings of the 53rd Hawaii International Conference on System Sciences.
[29]
Dakuo Wang, Josh Andres, Justin Weisz, Erick Oduor, and Casey Dugan. 2021. AutoDS: Towards Human-Centered Automation of Data Science. In Proceedings of the CHI 2021.
[30]
Dakuo Wang, Q. Vera Liao, Yunfeng Zhang, Udayan Khurana, Horst Samulowitz, Soya Park, Michael Muller, and Lisa Amini. 2021. How Much Automation Does a Data Scientist Want?. In preprint.
[31]
Dakuo Wang, Justin D Weisz, Michael Muller, Parikshit Ram, Werner Geyer, Casey Dugan, Yla Tausczik, Horst Samulowitz, and Alexander Gray. 2019. Human-AI Collaboration in Data Science: Exploring Data Scientists’ Perceptions of Automated AI. Proceedings of the ACM on Human-Computer Interaction 3, CSCW(2019), 1–24.
[32]
Amy X Zhang, Michael Muller, and Dakuo Wang. 2020. How do Data Science Workers Collaborate? Roles, Workflows, and Tools. arXiv preprint arXiv:2001.06684(2020).
[33]
Ge Zhang, Mike A Merrill, Yang Liu, Jeffrey Heer, and Tim Althoff. 2020. CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis. arXiv preprint arXiv:2008.12828(2020).

Cited By

View all
  • (2024)Key Insights from a Feature Discovery User StudyProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665961(1-5)Online publication date: 14-Jun-2024
  • (2024)NextPyter: Open-Source Research Collaborative PlatformPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670516(1-9)Online publication date: 17-Jul-2024
  • (2023)JuGaze: A Cell-based Eye Tracking and Logging Tool for Jupyter NotebooksProceedings of the 23rd Koli Calling International Conference on Computing Education Research10.1145/3631802.3631824(1-11)Online publication date: 13-Nov-2023
  • Show More Cited By

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CHI EA '21: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems
May 2021
2965 pages
ISBN:9781450380959
DOI:10.1145/3411763
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Computational notebooks
  2. Kaggle
  3. code documentation
  4. data science
  5. machine learning

Qualifiers

  • Poster
  • Research
  • Refereed limited

Conference

CHI '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

Upcoming Conference

CHI 2025
ACM CHI Conference on Human Factors in Computing Systems
April 26 - May 1, 2025
Yokohama , Japan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)107
  • Downloads (Last 6 weeks)6
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Key Insights from a Feature Discovery User StudyProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665961(1-5)Online publication date: 14-Jun-2024
  • (2024)NextPyter: Open-Source Research Collaborative PlatformPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670516(1-9)Online publication date: 17-Jul-2024
  • (2023)JuGaze: A Cell-based Eye Tracking and Logging Tool for Jupyter NotebooksProceedings of the 23rd Koli Calling International Conference on Computing Education Research10.1145/3631802.3631824(1-11)Online publication date: 13-Nov-2023
  • (2023)A Survey of Data Quality Requirements That Matter in ML Development PipelinesJournal of Data and Information Quality10.1145/359261615:2(1-39)Online publication date: 22-Jun-2023
  • (2023)Taming the Diversity of Computational NotebooksProceedings of the 27th ACM International Systems and Software Product Line Conference - Volume A10.1145/3579027.3608974(27-33)Online publication date: 28-Aug-2023
  • (2023)How Data Scientists Review the Scholarly LiteratureProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578309(137-152)Online publication date: 19-Mar-2023
  • (2023)Causalvis: Visualizations for Causal InferenceProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581236(1-20)Online publication date: 19-Apr-2023
  • (2023)Code Code Evolution: Understanding How People Change Data Science Notebooks Over TimeProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580997(1-12)Online publication date: 19-Apr-2023
  • (2023)Evaluating Multi-Core Performance of Machine Learning Models Across Different Computing Environments2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT56998.2023.10307025(1-7)Online publication date: 6-Jul-2023
  • (2023)A Systematic Review of Online Learning Platforms for Computer Science Courses2023 IEEE World Engineering Education Conference (EDUNINE)10.1109/EDUNINE57531.2023.10102817(1-6)Online publication date: 12-Mar-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media