[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3643991.3644876acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Bidirectional Paper-Repository Tracing in Software Engineering

Published: 02 July 2024 Publication History

Abstract

While computer science papers frequently include their associated code repositories, establishing a clear link between papers and their corresponding implementations may be challenging due to the number of code repositories used in research publications. In this paper we describe a lightweight method for effectively identifying bidirectional links between papers and repositories from both LaTeX and PDF sources. We have used our approach to analyze more than 14000 PDF and Latex files in the Software Engineering category of Arxiv, generating a dataset of more than 1400 paper-code implementations and assessing current citation practices on it.

References

[1]
2008--2023. GROBID. https://github.com/kermitt2/grobid.swh:1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c
[2]
Miguel Arroyo, Daniel Garijo, and Esteban González Guardia. 2023. Research Software Extraction Framework (RSEF).
[3]
Miguel Arroyo, Daniel Garijo, and González Guardia. 2024. Scripts and data used in the paper the evaluation (v0.0.1).
[4]
Domhnall Carlin, Austen Rainer, and David Wilson. 2023. Where is all the research software? An analysis of software in UK academic repositories. PeerJ Computer Science 9 (2023), e1546.
[5]
Stephan Druskat, Jurriaan H. Spaaks, Neil Chue Hong, Robert Haines, James Baker, Spencer Bliven, Egon Willighagen, David Pérez-Suárez, and Olexandr Konovalov. 2021. Citation File Format.
[6]
Caifan Du, Johanna Cohoon, Patrice Lopez, and James Howison. 2021. Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology 72, 7 (July 2021), 870--884.
[7]
Emily Escamilla, Martin Klein, Talya Cooper, Vicky Rampin, Michele C. Weigle, and Michael L. Nelson. 2022. The Rise of GitHub in Scholarly Publications. In Linking Theory and Practice of Digital Libraries, Gianmaria Silvello, Oscar Corcho, Paolo Manghi, Giorgio Maria Di Nunzio, Koraljka Golub, Nicola Ferro, and Antonella Poggi (Eds.). Springer International Publishing, Cham, 187--200.
[8]
Emily Escamilla, Lamia Salsabil, Martin Klein, Jian Wu, Michele C Weigle, and Michael L Nelson. 2023. It's Not Just GitHub: Identifying Data and Software Sources Included in Publications. In International Conference on Theory and Practice of Digital Libraries. Springer, 195--206.
[9]
Daniel Garijo, Miguel Arroyo, Esteban Gonzalez, Christoph Treude, and Nicola Tarocco. 2023. Bidirectional dataset.
[10]
Daniel Garijo, Miguel Arroyo, Esteban Gonzalez, Christoph Treude, and Nicola Tarocco. 2023. Bidirectional paper-repository traceability.
[11]
Morane Gruenpeter, Daniel S. Katz, Anna-Lena Lamprecht, Tom Honeyman, Daniel Garijo, Alexander Struck, Anna Niehues, Paula Andrea Martinez, Leyla Jael Castro, Tovo Rabemanantsoa, Neil P. Chue Hong, Carlos Martinez-Ortiz, Laurents Sesink, Matthias Liffers, Anne Claire Fouilloux, Chris Erdmann, Silvio Peroni, Paula Martinez Lavanchy, Ilian Todorov, and Manodeep Sinha. 2021. Defining Research Software: a controversial discussion.
[12]
Hideaki Hata, Jin L. C. Guo, Raula Gaikovina Kula, and Christoph Treude. 2021. Science-Software Linkage: The Challenges of Traceability between Scientific Knowledge and Software Artifacts. arXiv:2104.05891 [cs.SE]
[13]
Robert Heumüller, Sebastian Nielebock, Jacob Krüger, and Frank Ortmeier. 2020. Publish or perish, but do not forget your software artifacts. Empirical Software Engineering 25, 6 (Nov. 2020), 4585--4616.
[14]
James Howison and James D. Herbsleb. 2013. Incentives and Integration in Scientific Software Production. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work (San Antonio, Texas, USA) (CSCW '13). Association for Computing Machinery, New York, NY, USA, 459--470.
[15]
Akira Inokuchi, Yusuf Sulistyo Nugroho, Supatsara Wattanakriengkrai, Fumiaki Konishi, Hideaki Hata, Christoph Treude, Akito Monden, and Kenichi Matsumoto. 2020. From Academia to Software Development: Publication Citations in Source Code Comments. arXiv:1910.06932 [cs.SE]
[16]
Ana-Maria Istrate, Donghui Li, Dario Taraborelli, Michaela Torkar, Boris Veytsman, and Ivana Williams. 2022. A large dataset of software mentions in the biomedical literature. arXiv preprint arXiv:2209.00693 (2022).
[17]
Aidan Kelley and Daniel Garijo. 2021. A Framework for Creating Knowledge Graphs of Scientific Software Metadata. Quantitative Science Studies (11 2021), 1--37.
[18]
Jialiang Lin, Yingmin Wang, Yao Yu, Yu Zhou, Yidong Chen, and Xiaodong Shi. 2022. Automatic analysis of available source code of top artificial intelligence conference papers. International Journal of Software Engineering and Knowledge Engineering 32, 07 (2022), 947--970.
[19]
A. Mao, D. Garijo, and S. Fakhraei. 2019. SoMEF: A Framework for Capturing Scientific Software Metadata from its Documentation. In 2019 IEEE International Conference on Big Data (Big Data). 3032--3037.
[20]
David Schindler, Felix Bensmann, Stefan Dietze, and Frank Krüger. 2022. The role of software in science: a knowledge graph-based analysis of software mentions in Central. PeerJ Computer Science 8 (Jan. 2022), e835.
[21]
Arfon M Smith, Daniel S Katz, and Kyle E Niemeyer. 2016. Software citation principles. PeerJ Computer Science 2 (2016), e86.
[22]
Supatsara Wattanakriengkrai, Bodin Chinthanet, Hideaki Hata, Raula Gaikovina Kula, Christoph Treude, Jin Guo, and Kenichi Matsumoto. 2022. GitHub repositories with links to academic papers: Public access, traceability, and evolution. Journal of Systems and Software 183 (2022), 111117.
[23]
Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 3, 1 (2016), 1--9.

Cited By

View all
  • (2024)RepoFromPaper: An Approach to Extract Software Code Implementations from Scientific PublicationsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_7(100-113)Online publication date: 26-May-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories
April 2024
788 pages
ISBN:9798400705878
DOI:10.1145/3643991
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

  1. research software
  2. article analysis
  3. software citation
  4. open science

Qualifiers

  • Research-article

Conference

MSR '24
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)RepoFromPaper: An Approach to Extract Software Code Implementations from Scientific PublicationsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_7(100-113)Online publication date: 26-May-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media