[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3643991.3644898acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Analyzing the Evolution and Maintenance of ML Models on Hugging Face

Published: 02 July 2024 Publication History

Abstract

Hugging Face (HF) has established itself as a crucial platform for the development and sharing of machine learning (ML) models. This repository mining study, which delves into more than 380,000 models using data gathered via the HF Hub API, aims to explore the community engagement, evolution, and maintenance around models hosted on HF - aspects that have yet to be comprehensively explored in the literature. We first examine the overall growth and popularity of HF, uncovering trends in ML domains, framework usage, authors grouping and the evolution of tags and datasets used. Through text analysis of model card descriptions, we also seek to identify prevalent themes and insights within the developer community. Our investigation further extends to the maintenance aspects of models, where we evaluate the maintenance status of ML models, classify commit messages into various categories (corrective, perfective, and adaptive), analyze the evolution across development stages of commits metrics and introduce a new classification system that estimates the maintenance status of models based on multiple attributes. This study aims to provide valuable insights about ML model maintenance and evolution that could inform future model development strategies on platforms like HF.

References

[1]
Hugging Face Inc., "Hugging Face Hub Documentation," https://huggingface.co/docs/hub/index, 2023.
[2]
W. Jiang, N. Synovic, M. Hyatt, T. R. Schorlemmer, R. Sethi, Y.-H. Lu, G. K. Thiruvathukal, and J. C. Davis, "An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry," in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). Melbourne, Australia: IEEE, May 2023, pp. 2463--2475. [Online]. Available: https://ieeexplore.ieee.org/document/10172757/
[3]
J. Castaño, S. Martínez-Fernández, X. Franch, and J. Bogner, "Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study," in ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). New Orleans, LA, USA: IEEE, 2023.
[4]
L. Gong, J. Zhang, M. Wei, H. Zhang, and Z. Huang, "What is the intended usage context of this model? an exploratory study of pre-trained models on various model repositories," ACM Transactions on Software Engineering and Methodology, vol. 32, no. 3, pp. 1--57, 2023.
[5]
I. H. Sarker, "Machine Learning: Algorithms, Real-World Applications and Research Directions," SN Computer Science, vol. 2, no. 3, p. 160, May 2021. [Online]. Available: https://link.springer.com/10.1007/s42979-021-00592-x
[6]
S. Martínez-Fernández, J. Bogner, X. Franch, M. Oriol, J. Siebert, A. Trendowicz, A. M. Vollmer, and S. Wagner, "Software Engineering for AI-Based Systems: A Survey," ACM Transactions on Software Engineering and Methodology, vol. 31, no. 2, pp. 1--59, Apr. 2022. [Online].
[7]
J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, "Learning under Concept Drift: A Review," IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 12, pp. 1--1, 2018, _eprint: 2004.05785. [Online]. Available: https://ieeexplore.ieee.org/document/8496795/
[8]
J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, "Investigating the relationship between time and predictive model maintenance," Journal of Big Data, vol. 7, no. 1, p. 36, Dec. 2020. [Online]. Available: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00312-x
[9]
A. Paleyes, R.-G. Urma, and N. D. Lawrence, "Challenges in Deploying Machine Learning: A Survey of Case Studies," ACM Computing Surveys, vol. 55, no. 6, pp. 1--29, Jul. 2023. [Online].
[10]
R. Nazir, A. Bucaioni, and P. Pelliccione, "Architecting ML-enabled systems: Challenges, best practices, and design decisions," Journal of Systems and Software, vol. 207, p. 111860, Jan. 2024. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0164121223002558
[11]
ISO/IEC 25010, ISO/IEC 25010:2011, Systems and software engineering --- Systems and software Quality Requirements and Evaluation (SQuaRE) --- System and software quality models, Std., 2011.
[12]
D. Rowe, J. Leaney, and D. Lowe, "Defining systems evolvability-a taxonomy of change," Change, vol. 94, pp. 541--545, 1994.
[13]
S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann, "Software Engineering for Machine Learning: A Case Study," in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, May 2019, pp. 291--300. [Online]. Available: https://ieeexplore.ieee.org/document/8804457/
[14]
K. H. Bennett and V. T. Rajlich, "Software maintenance and evolution: a roadmap," in Proceedings of the Conference on the Future of Software Engineering, 2000, pp. 73--87.
[15]
M. M. Lehman, J. F. Ramil, P. D. Wernick, D. E. Perry, and W. M. Turski, "Metrics and laws of software evolution-the nineties view," in Proceedings Fourth International Software Metrics Symposium. IEEE, 1997, pp. 20--32.
[16]
S. M. Jain, "Hugging Face," in Introduction to Transformers for NLP. Berkeley, CA: Apress, 2022, pp. 51--67. [Online]. Available: https://link.springer.com/10.1007/978-1-4842-8844-3_4
[17]
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., "Huggingface's transformers: State-of-the-art natural language processing," arXiv preprint arXiv:1910.03771, 2019.
[18]
M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, "Model cards for model reporting," in Proceedings of the conference on fairness, accountability, and transparency, 2019, pp. 220--229.
[19]
K. Shivashankar and A. Martini, "Maintainability Challenges in ML: A Systematic Literature Review," in 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). Gran Canaria, Spain: IEEE, Aug. 2022, pp. 60--67. [Online]. Available: https://ieeexplore.ieee.org/document/10011474/
[20]
J. Bogner, R. Verdecchia, and I. Gerostathopoulos, "Characterizing Technical Debt and Antipatterns in AI-Based Systems: A Systematic Mapping Study," in 2021 IEEE/ACM International Conference on Technical Debt (TechDebt). IEEE, May 2021, pp. 64--73, arXiv: 2103.09783. [Online]. Available: https://ieeexplore.ieee.org/document/9463054/
[21]
Y. Tang, R. Khatchadourian, M. Bagherzadeh, R. Singh, A. Stewart, and A. Raja, "An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems," in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, May 2021, pp. 238--250. [Online]. Available: https://ieeexplore.ieee.org/document/9401990/
[22]
M. Dilhara, A. Ketkar, and D. Dig, "Understanding Software-2.0: A Study of Machine Learning Library Usage and Evolution," ACM Transactions on Software Engineering and Methodology, vol. 30, no. 4, pp. 1--42, Jul. 2021. [Online].
[23]
J. Leest, I. Gerostathopoulos, and C. Raibulet, "Evolvability of Machine Learning-based Systems: An Architectural Design Decision Framework," in 2023 IEEE 20th International Conference on Software Architecture Companion (ICSA-C). L'Aquila, Italy: IEEE, Mar. 2023, pp. 106--110. [Online]. Available: https://ieeexplore.ieee.org/document/10092638/
[24]
A. Kathikar, A. Nair, B. Lazarine, A. Sachdeva, and S. Samtani, "Assessing the Vulnerabilities of the Open-Source Artificial Intelligence (AI) Landscape: A Large-Scale Analysis of the Hugging Face Platform," in IEEE Intelligence and Security Informatics. Charlotte, NC, USA: IEEE, Oct. 2023.
[25]
A. Ait, J. L. C. Izquierdo, and J. Cabot, "HFCommunity: A Tool to Analyze the Hugging Face Hub Community," in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). Taipa, Macao: IEEE, Mar. 2023, pp. 728--732. [Online]. Available: https://ieeexplore.ieee.org/document/10123660/
[26]
V. R. B. G. Caldiera and H. D. Rombach, "The goal question metric approach," Encyclopedia of software engineering, pp. 528--532, 1994.
[27]
A. Anonymous, "Replication Package for 'What is the Evolution and Maintenance of Pre-Trained ML models on Hugging Face?'," Nov. 2023. [Online].
[28]
"HfApi Client," https://huggingface.co/docs/huggingface_hub/package_reference/hf_api, Accessed: 01-02-2024.
[29]
E. B. Swanson, "The dimensions of maintenance," in Proceedings of the 2nd international conference on Software engineering, 1976, pp. 492--497.
[30]
M. U. Sarwar, S. Zafar, M. W. Mkaouer, G. S. Walia, and M. Z. Malik, "Multi-label classification of commit messages using transfer learning," in 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 2020, pp. 37--42.
[31]
V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, "Fast unfolding of communities in large networks," Journal of statistical mechanics: theory and experiment, vol. 2008, no. 10, p. P10008, 2008.
[32]
D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," Journal of machine Learning research, vol. 3, no. Jan, pp. 993--1022, 2003.
[33]
M. Röder, A. Both, and A. Hinneburg, "Exploring the space of topic coherence measures," in Proceedings of the eighth ACM international conference on Web search and data mining, 2015, pp. 399--408.
[34]
J. Coelho, M. T. Valente, L. Milen, and L. L. Silva, "Is this github project maintained? measuring the level of maintenance activity of open-source projects," Information and Software Technology, vol. 122, p. 106274, 2020.
[35]
E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, "Dbscan revisited, revisited: why and how you should (still) use dbscan," ACM Transactions on Database Systems (TODS), vol. 42, no. 3, pp. 1--21, 2017.
[36]
P. E. McKnight and J. Najab, "Mann-whitney u test," The Corsini encyclopedia of psychology, pp. 1--1, 2010.
[37]
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., "Transformers: State-of-the-art natural language processing," in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38--45.
[38]
E. L. Oreamuno, R. F. Khan, A. A. Bangash, C. Stinson, and B. Adams, "The state of documentation practices of third-party machine learning models and datasets," arXiv preprint arXiv:2312.15058, 2023.
[39]
A. Bhat, A. Coursey, G. Hu, S. Li, N. Nahar, S. Zhou, C. Kästner, and J. L. Guo, "Aspirations and practice of ml model documentation: Moving the needle with nudging and traceability," in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1--17.
[40]
T. Zimmermann, S. Kim, A. Zeller, and E. J. Whitehead, "Mining version archives for co-changed lines," in Proceedings of the 2006 International Workshop on Mining Software Repositories, ser. MSR '06. New York, NY, USA: Association for Computing Machinery, 2006, p. 72--75. [Online].
[41]
R. S. Pressman, Software engineering: a practitioner's approach. Palgrave macmillan, 2005.
[42]
"Data Version Control · DVC," https://dvc.org/, Accessed: 01-02-2024.
[43]
"DagsHub: The Home for Machine Learning Collaboration," https://dagshub.com/, Accessed: 01-02-2024.
[44]
F. Lanubile, S. Martínez-Fernández, and L. Quaranta, "Training future ml engineers: a project-based course on mlops," IEEE software, 2023.
[45]
J. Tsay, A. Braz, M. Hirzel, A. Shinnar, and T. Mummert, "Aimmx: Artificial intelligence model metadata extractor," in Proceedings of the 17th International Conference on Mining Software Repositories, ser. MSR '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 81--92. [Online].

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories
April 2024
788 pages
ISBN:9798400705878
DOI:10.1145/3643991
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

  1. repository mining
  2. software evolution
  3. maintenance

Qualifiers

  • Research-article

Conference

MSR '24
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 140
    Total Downloads
  • Downloads (Last 12 months)140
  • Downloads (Last 6 weeks)10
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media