More Web Proxy on the site http://driver.im/

research-article

Analyzing the Evolution and Maintenance of ML Models on Hugging Face

Authors:

Silverio Martínez-Fernández,

Justus BognerAuthors Info & Claims

MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories

Pages 607 - 618

https://doi.org/10.1145/3643991.3644898

Published: 02 July 2024 Publication History

Abstract

Hugging Face (HF) has established itself as a crucial platform for the development and sharing of machine learning (ML) models. This repository mining study, which delves into more than 380,000 models using data gathered via the HF Hub API, aims to explore the community engagement, evolution, and maintenance around models hosted on HF - aspects that have yet to be comprehensively explored in the literature. We first examine the overall growth and popularity of HF, uncovering trends in ML domains, framework usage, authors grouping and the evolution of tags and datasets used. Through text analysis of model card descriptions, we also seek to identify prevalent themes and insights within the developer community. Our investigation further extends to the maintenance aspects of models, where we evaluate the maintenance status of ML models, classify commit messages into various categories (corrective, perfective, and adaptive), analyze the evolution across development stages of commits metrics and introduce a new classification system that estimates the maintenance status of models based on multiple attributes. This study aims to provide valuable insights about ML model maintenance and evolution that could inform future model development strategies on platforms like HF.

References

[1]

Hugging Face Inc., "Hugging Face Hub Documentation," https://huggingface.co/docs/hub/index, 2023.

[2]

W. Jiang, N. Synovic, M. Hyatt, T. R. Schorlemmer, R. Sethi, Y.-H. Lu, G. K. Thiruvathukal, and J. C. Davis, "An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry," in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). Melbourne, Australia: IEEE, May 2023, pp. 2463--2475. [Online]. Available: https://ieeexplore.ieee.org/document/10172757/

[3]

J. Castaño, S. Martínez-Fernández, X. Franch, and J. Bogner, "Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study," in ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). New Orleans, LA, USA: IEEE, 2023.

[4]

L. Gong, J. Zhang, M. Wei, H. Zhang, and Z. Huang, "What is the intended usage context of this model? an exploratory study of pre-trained models on various model repositories," ACM Transactions on Software Engineering and Methodology, vol. 32, no. 3, pp. 1--57, 2023.

Digital Library

[5]

I. H. Sarker, "Machine Learning: Algorithms, Real-World Applications and Research Directions," SN Computer Science, vol. 2, no. 3, p. 160, May 2021. [Online]. Available: https://link.springer.com/10.1007/s42979-021-00592-x

Digital Library

[6]

S. Martínez-Fernández, J. Bogner, X. Franch, M. Oriol, J. Siebert, A. Trendowicz, A. M. Vollmer, and S. Wagner, "Software Engineering for AI-Based Systems: A Survey," ACM Transactions on Software Engineering and Methodology, vol. 31, no. 2, pp. 1--59, Apr. 2022. [Online].

Digital Library

[7]

J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, "Learning under Concept Drift: A Review," IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 12, pp. 1--1, 2018, _eprint: 2004.05785. [Online]. Available: https://ieeexplore.ieee.org/document/8496795/

[8]

J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, "Investigating the relationship between time and predictive model maintenance," Journal of Big Data, vol. 7, no. 1, p. 36, Dec. 2020. [Online]. Available: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00312-x

[9]

A. Paleyes, R.-G. Urma, and N. D. Lawrence, "Challenges in Deploying Machine Learning: A Survey of Case Studies," ACM Computing Surveys, vol. 55, no. 6, pp. 1--29, Jul. 2023. [Online].

Digital Library

[10]

R. Nazir, A. Bucaioni, and P. Pelliccione, "Architecting ML-enabled systems: Challenges, best practices, and design decisions," Journal of Systems and Software, vol. 207, p. 111860, Jan. 2024. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0164121223002558

Digital Library

[11]

ISO/IEC 25010, ISO/IEC 25010:2011, Systems and software engineering --- Systems and software Quality Requirements and Evaluation (SQuaRE) --- System and software quality models, Std., 2011.

[12]

D. Rowe, J. Leaney, and D. Lowe, "Defining systems evolvability-a taxonomy of change," Change, vol. 94, pp. 541--545, 1994.

[13]

S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann, "Software Engineering for Machine Learning: A Case Study," in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, May 2019, pp. 291--300. [Online]. Available: https://ieeexplore.ieee.org/document/8804457/

[14]

K. H. Bennett and V. T. Rajlich, "Software maintenance and evolution: a roadmap," in Proceedings of the Conference on the Future of Software Engineering, 2000, pp. 73--87.

[15]

M. M. Lehman, J. F. Ramil, P. D. Wernick, D. E. Perry, and W. M. Turski, "Metrics and laws of software evolution-the nineties view," in Proceedings Fourth International Software Metrics Symposium. IEEE, 1997, pp. 20--32.

[16]

S. M. Jain, "Hugging Face," in Introduction to Transformers for NLP. Berkeley, CA: Apress, 2022, pp. 51--67. [Online]. Available: https://link.springer.com/10.1007/978-1-4842-8844-3_4

[17]

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., "Huggingface's transformers: State-of-the-art natural language processing," arXiv preprint arXiv:1910.03771, 2019.

[18]

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, "Model cards for model reporting," in Proceedings of the conference on fairness, accountability, and transparency, 2019, pp. 220--229.

Digital Library

[19]

K. Shivashankar and A. Martini, "Maintainability Challenges in ML: A Systematic Literature Review," in 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). Gran Canaria, Spain: IEEE, Aug. 2022, pp. 60--67. [Online]. Available: https://ieeexplore.ieee.org/document/10011474/

[20]

J. Bogner, R. Verdecchia, and I. Gerostathopoulos, "Characterizing Technical Debt and Antipatterns in AI-Based Systems: A Systematic Mapping Study," in 2021 IEEE/ACM International Conference on Technical Debt (TechDebt). IEEE, May 2021, pp. 64--73, arXiv: 2103.09783. [Online]. Available: https://ieeexplore.ieee.org/document/9463054/

[21]

Y. Tang, R. Khatchadourian, M. Bagherzadeh, R. Singh, A. Stewart, and A. Raja, "An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems," in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, May 2021, pp. 238--250. [Online]. Available: https://ieeexplore.ieee.org/document/9401990/

[22]

M. Dilhara, A. Ketkar, and D. Dig, "Understanding Software-2.0: A Study of Machine Learning Library Usage and Evolution," ACM Transactions on Software Engineering and Methodology, vol. 30, no. 4, pp. 1--42, Jul. 2021. [Online].

Digital Library

[23]

J. Leest, I. Gerostathopoulos, and C. Raibulet, "Evolvability of Machine Learning-based Systems: An Architectural Design Decision Framework," in 2023 IEEE 20th International Conference on Software Architecture Companion (ICSA-C). L'Aquila, Italy: IEEE, Mar. 2023, pp. 106--110. [Online]. Available: https://ieeexplore.ieee.org/document/10092638/

[24]

A. Kathikar, A. Nair, B. Lazarine, A. Sachdeva, and S. Samtani, "Assessing the Vulnerabilities of the Open-Source Artificial Intelligence (AI) Landscape: A Large-Scale Analysis of the Hugging Face Platform," in IEEE Intelligence and Security Informatics. Charlotte, NC, USA: IEEE, Oct. 2023.

[25]

A. Ait, J. L. C. Izquierdo, and J. Cabot, "HFCommunity: A Tool to Analyze the Hugging Face Hub Community," in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). Taipa, Macao: IEEE, Mar. 2023, pp. 728--732. [Online]. Available: https://ieeexplore.ieee.org/document/10123660/

[26]

V. R. B. G. Caldiera and H. D. Rombach, "The goal question metric approach," Encyclopedia of software engineering, pp. 528--532, 1994.

[27]

A. Anonymous, "Replication Package for 'What is the Evolution and Maintenance of Pre-Trained ML models on Hugging Face?'," Nov. 2023. [Online].

[28]

"HfApi Client," https://huggingface.co/docs/huggingface_hub/package_reference/hf_api, Accessed: 01-02-2024.

[29]

E. B. Swanson, "The dimensions of maintenance," in Proceedings of the 2nd international conference on Software engineering, 1976, pp. 492--497.

[30]

M. U. Sarwar, S. Zafar, M. W. Mkaouer, G. S. Walia, and M. Z. Malik, "Multi-label classification of commit messages using transfer learning," in 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 2020, pp. 37--42.

[31]

V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, "Fast unfolding of communities in large networks," Journal of statistical mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.

[32]

D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," Journal of machine Learning research, vol. 3, no. Jan, pp. 993--1022, 2003.

Digital Library

[33]

M. Röder, A. Both, and A. Hinneburg, "Exploring the space of topic coherence measures," in Proceedings of the eighth ACM international conference on Web search and data mining, 2015, pp. 399--408.

[34]

J. Coelho, M. T. Valente, L. Milen, and L. L. Silva, "Is this github project maintained? measuring the level of maintenance activity of open-source projects," Information and Software Technology, vol. 122, p. 106274, 2020.

[35]

E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, "Dbscan revisited, revisited: why and how you should (still) use dbscan," ACM Transactions on Database Systems (TODS), vol. 42, no. 3, pp. 1--21, 2017.

Digital Library

[36]

P. E. McKnight and J. Najab, "Mann-whitney u test," The Corsini encyclopedia of psychology, pp. 1--1, 2010.

[37]

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., "Transformers: State-of-the-art natural language processing," in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38--45.

[38]

E. L. Oreamuno, R. F. Khan, A. A. Bangash, C. Stinson, and B. Adams, "The state of documentation practices of third-party machine learning models and datasets," arXiv preprint arXiv:2312.15058, 2023.

[39]

A. Bhat, A. Coursey, G. Hu, S. Li, N. Nahar, S. Zhou, C. Kästner, and J. L. Guo, "Aspirations and practice of ml model documentation: Moving the needle with nudging and traceability," in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1--17.

[40]

T. Zimmermann, S. Kim, A. Zeller, and E. J. Whitehead, "Mining version archives for co-changed lines," in Proceedings of the 2006 International Workshop on Mining Software Repositories, ser. MSR '06. New York, NY, USA: Association for Computing Machinery, 2006, p. 72--75. [Online].

Digital Library

[41]

R. S. Pressman, Software engineering: a practitioner's approach. Palgrave macmillan, 2005.

[42]

"Data Version Control · DVC," https://dvc.org/, Accessed: 01-02-2024.

[43]

"DagsHub: The Home for Machine Learning Collaboration," https://dagshub.com/, Accessed: 01-02-2024.

[44]

F. Lanubile, S. Martínez-Fernández, and L. Quaranta, "Training future ml engineers: a project-based course on mlops," IEEE software, 2023.

[45]

J. Tsay, A. Braz, M. Hirzel, A. Shinnar, and T. Mummert, "Aimmx: Artificial intelligence model metadata extractor," in Proceedings of the 17th International Conference on Mining Software Repositories, ser. MSR '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 81--92. [Online].

Digital Library

Index Terms

Analyzing the Evolution and Maintenance of ML Models on Hugging Face

Index terms have been assigned to the content through auto-classification.

Recommendations

Software evolution: background, theory, practice
Special issue: Contribution to computing science

This paper opens with a brief summary of some 30 years of study of the software evolution phenomenon. The results of those studies include the SPE program classification, a principle of software uncertainty and laws of E-type software evolution. The ...
Condition-based maintenance and production speed optimization under limited maintenance capacity
Abstract
Many industrial facilities consist of multiple units working in parallel to reach a production target or to maximize production revenues. These units deteriorate due to stress that is often dependent on the production speed, and thus maintenance ...
Highlights
- Simultaneous optimization of condition-based maintenance and production speed.
- Maintenance takes considerable time; at most one unit can be maintained at a time.
- Deterioration levels of units are desynchronized when there is a ...
Software Evolution from the Field

Over the last few years, we actively participated in the maintenance and evolution of Squeak, an open-source Smalltalk. The community is constantly faced with the problem of enabling changes while at the same time preserving compatibility. In this paper ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories

April 2024

788 pages

ISBN:9798400705878

DOI:10.1145/3643991

Chair:
Diomidis Spinellis,
Program Chair:
Alberto Bacchelli,
Program Co-chair:
Eleni Constantinou

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MSR '24

Sponsor:

SIGSOFT

MSR '24: 21st International Conference on Mining Software Repositories

April 15 - 16, 2024

Lisbon, Portugal

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
140
Total Downloads

Downloads (Last 12 months)140
Downloads (Last 6 weeks)10

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents