[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ICSE-SEIP58684.2023.00052acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

An Empirical Study on Quality Issues of Deep Learning Platform

Published: 20 September 2023 Publication History

Abstract

In recent years, deep learning (DL) has been increasingly adopted in many application areas. To help deep learning developers better train and test their models, enterprises have built dedicated, multi-tenant platforms equipped with a mass of computing devices like GPUs. The service quality of these platforms plays a critical role in system efficiency and user experience. Nevertheless, there indeed exist diverse types of quality issues that not only waste computing resources significantly but also slow down development productivity severely. In this paper, we present a comprehensive empirical study on quality issues of Platform-X in Microsoft. Platform-X is an internal production deep learning platform that serves hundreds of developers and researchers. We have manually examined 360 real issues and investigated their common symptoms, root causes, and mitigation actions. Our major findings include: (1) 28.33% of the quality issues are caused by hardware (the GPU, network, and compute node) faults; (2) 28.33% of them result from system-side faults (e.g., system defects and service outages); (3) User-side faults (e.g., user bugs and policy violation) account for more than two-fifths (43.34%) of all the common causes; (4) More than three-fifths of all the quality issues can be mitigated by simply resubmitting jobs (34.72%) and improving user code (24.72%). Our study results provide valuable guidance on promoting the service quality of deep learning platforms from both the development and maintenance aspects. The results further motivate possible research directions and tooling support.

References

[1]
J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: pre-training of deep bidirectional transformers for language understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), 2019, pp. 4171--4186.
[2]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., "Language models are few-shot learners," in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1877--1901.
[3]
O. Vinyals, I. Babuschkin, W. M. Czarnecki, A. D. Michaël Mathieu, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al., "Grandmaster level in starcraft ii using multi-agent reinforcement learning," Nature, vol. 575, 2019.
[4]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., "Evaluating large language models trained on code," CoRR, vol. abs/2107.03374, 2021.
[5]
E. Rolf, J. Proctor, T. Carleton, I. Bolliger, V. Shankar, M. Ishihara, B. Recht, and S. Hsiang, "A generalizable and accessible approach to machine learning with global satellite imagery," Nature Communications, vol. 12, 2021.
[6]
"Microsoft azure machine learning," https://azure.microsoft.com/en-us/services/machine-learning-service, 2022.
[7]
"Amazon sagemaker," https://aws.amazon.com/sagemaker, 2022.
[8]
"Google cloud ai," https://cloud.google.com/products/ai, 2022.
[9]
Wikipedia, "InfiniBand --- Wikipedia, the free encyclopedia," http://en.wikipedia.org/w/index.php?title=InfiniBand&oldid=1104722169, 2022.
[10]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "Pytorch: An imperative style, highperformance deep learning library," in Advances in Neural Information Processing Systems 32, vol. 32. Curran Associates, Inc., 2019, pp. 8024--8035.
[11]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, "Tensorflow: A system for large-scale machine learning," in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA: USENIX Association, Nov. 2016, pp. 265--283.
[12]
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew, "Huggingface's transformers: State-of-the-art natural language processing," CoRR, vol. abs/1910.03771, 2019.
[13]
B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, "Borg, omega, and kubernetes," Commun. ACM, vol. 59, no. 5, p. 50--57, apr 2016.
[14]
C. B. Seaman, F. Shull, M. Regardie, D. Elbert, R. L. Feldmann, Y. Guo, and S. Godfrey, "Defect categorization: Making use of a decade of widely varying historical data," in Proceedings of the Second ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ser. ESEM '08. New York, NY, USA: Association for Computing Machinery, 2008, p. 149--157.
[15]
Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai, "Have things changed now? an empirical study of bug characteristics in modern open source software," in Proceedings of the 1st Workshop on Architectural and System Support for Improving Software Dependability, ser. ASID '06. New York, NY, USA: Association for Computing Machinery, 2006, p. 25--33.
[16]
A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler, "An empirical study of operating systems errors," in Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles, ser. SOSP '01. New York, NY, USA: Association for Computing Machinery, 2001, p. 73--88.
[17]
N. Fenton and N. Ohlsson, "Quantitative analysis of faults and failures in a complex software system," IEEE Transactions on Software Engineering, vol. 26, no. 8, pp. 797--814, 2000.
[18]
C. Andersson and P. Runeson, "A replicated quantitative analysis of fault distributions in complex software systems," IEEE Transactions on Software Engineering, vol. 33, no. 5, pp. 273--286, 2007.
[19]
H. Zhou, J.-G. Lou, H. Zhang, H. Lin, H. Lin, and T. Qin, "An empirical study on quality issues of production big data platform," in Proceedings of the 37th International Conference on Software Engineering - Volume 2, ser. ICSE '15. IEEE Press, 2015, p. 17--26.
[20]
S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, "An analysis of traces from a production mapreduce cluster," in 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010, pp. 94--103.
[21]
T. Xiao, J. Zhang, H. Zhou, Z. Guo, S. McDirmid, W. Lin, W. Chen, and L. Zhou, "Nondeterminism in mapreduce considered harmful? an empirical study on non-commutative aggregators in mapreduce programs," in Companion Proceedings of the 36th International Conference on Software Engineering, ser. ICSE Companion 2014. New York, NY, USA: Association for Computing Machinery, 2014, p. 44--53.
[22]
S. Li, H. Zhou, H. Lin, T. Xiao, H. Lin, W. Lin, and T. Xie, "A characteristic study on failures of production distributed data-parallel programs," in Proceedings of the 2013 International Conference on Software Engineering, ser. ICSE '13. IEEE Press, 2013, p. 963--972.
[23]
Y. Zhang, Y. Chen, S.-C. Cheung, Y. Xiong, and L. Zhang, "An empirical study on tensorflow program bugs," in Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2018. New York, NY, USA: Association for Computing Machinery, 2018, pp. 129--140.
[24]
M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, "A comprehensive study on deep learning bug characteristics," in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2019. New York, NY, USA: Association for Computing Machinery, 2019, p. 510--520.
[25]
T. Zhang, C. Gao, L. Ma, M. Lyu, and M. Kim, "An empirical study of common challenges in developing deep learning applications," in 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), 2019, pp. 104--115.
[26]
M. Jeon, S. Venkataraman, A. Phanishayee, u. Qian, W. Xiao, and F. Yang, "Analysis of large-scale multi-tenant gpu clusters for dnn training workloads," in Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference, ser. USENIX ATC '19. USA: USENIX Association, 2019, pp. 947--960.
[27]
R. Zhang, W. Xiao, H. Zhang, Y. Liu, H. Lin, and M. Yang, "An empirical study on program failures of deep learning jobs," in Proceedings of the 42nd International Conference on Software Engineering, ser. ICSE '20. New York, NY, USA: Association for Computing Machinery, 2020, pp. 1159--1170.
[28]
L. Jia, H. Zhong, X. Wang, L. Huang, and X. Lu, "An empirical study on bugs inside tensorflow," in Database Systems for Advanced Applications: 25th International Conference, DASFAA 2020, Jeju, South Korea, September 24--27, 2020, Proceedings, Part I. Berlin, Heidelberg: Springer-Verlag, 2020, p. 604--620.
[29]
Q. Shen, H. Ma, J. Chen, Y. Tian, S.-C. Cheung, and X. Chen, "A comprehensive study of deep learning compiler bugs," in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2021. New York, NY, USA: Association for Computing Machinery, 2021, p. 968--980.
[30]
Q. Hu, P. Sun, S. Yan, Y. Wen, and T. Zhang, "Characterization and prediction of deep learning workloads in large-scale gpu datacenters," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. New York, NY, USA: Association for Computing Machinery, 2021.
[31]
J. Cao, B. Chen, C. Sun, L. Hu, and X. Peng, "Characterizing performance bugs in deep learning systems," CoRR, vol. abs/2112.01771, 2021.
[32]
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
[33]
ONNX, "Open neural network exchange," https://onnx.ai/, 2017.
[34]
D. Merkel, "Docker: Lightweight linux containers for consistent development and deployment," Linux J., vol. 2014, no. 239, mar 2014.
[35]
M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, "fairseq: A fast, extensible toolkit for sequence modeling," in Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
[36]
J. Ousterhout, "Scheduling techniques for concurrent systems," in Proceedings of the 3rd International Conference on Distributed Computing Systems, 1982, pp. 22--30.
[37]
NVIDIA, "Nvidia collective communications library (nccl)," https://developer.nvidia.com/nccl, 2022.
[38]
D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland, "Understanding gpu errors on large-scale hpc systems and the implications for system design and operation," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 331--342.
[39]
B. Nie, D. Tiwari, S. Gupta, E. Smirni, and J. H. Rogers, "A large-scale study of soft-errors on gpus in the field," in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 519--530.
[40]
B. Nie, J. Xue, S. Gupta, T. Patel, C. Engelmann, E. Smirni, and D. Tiwari, "Machine learning models for gpu error prediction in a large scale hpc system," in 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018, pp. 95--106.
[41]
H. Liu, Z. Li, C. Tan, R. Yang, G. Cao, Z. Liu, and C. Guo, "Prediction of GPU failures under deep learning workloads," CoRR, vol. abs/2201.11853, 2022.
[42]
NVIDIA, "Nvidia data center gpu manager (dcgm)," https://developer.nvidia.com/dcgm, 2022.
[43]
NVIDIA, "Nvlink & nvswitch: Fastest hpc data center platform," https://www.nvidia.com/en-us/data-center/nvlink, 2022.
[44]
M. Azure, "Azure spot virtual machines," https://azure.microsoft.com/en-us/products/virtual-machines/spot/, 2022.
[45]
T. Tsai, S. K. S. Hari, M. Sullivan, O. Villa, and S. W. Keckler, "Nvbitfi: Dynamic fault injection for gpus," in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2021, pp. 284--291.
[46]
Microsoft, "Superbench: a validation and profiling tool for ai infrastructure," https://github.com/microsoft/superbenchmark, 2021.
[47]
"Tensorboard: Tensorflow's visualization toolkit," https://www.tensorflow.org/tensorboard, 2022.
[48]
X. Wang, Z. Guo, X. Liu, Z. Xu, H. Lin, X. Wang, and Z. Zhang, "Hang analysis: Fighting responsiveness bugs," in Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008, ser. Eurosys '08. New York, NY, USA: Association for Computing Machinery, 2008, p. 177--190.
[49]
J. He, T. Dai, X. Gu, and G. Jin, "Hangfix: Automatically fixing software hang bugs for production cloud systems," in Proceedings of the 11th ACM Symposium on Cloud Computing, ser. SoCC '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 344--357.
[50]
Jupyter, "Project jupyter," https://jupyter.org, 2022.
[51]
J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, "Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters," in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD '20. New York, NY, USA: Association for Computing Machinery, 2020, pp. 3505--3506.
[52]
Microsoft, "What is an azure machine learning workspace?" https://learn.microsoft.com/en-us/azure/machine-learning/concept-workspace, 2022.
[53]
S. Boag, P. Dube, B. Herta, W. Hummer, V. Ishakian, K. JAYARAM, M. Kalantar, V. Muthusamy, P. NAG-PURKAR, and F. Rosenberg, "Scalable multi-framework multi-tenant lifecycle management of deep learning training jobs," in Workshop on ML Systems, NIPS, 2017.
[54]
W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang, F. Yang, and L. Zhou, "Gandiva: Introspective cluster scheduling for deep learning," in Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI '18. USA: USENIX Association, 2018, p. 595--610.
[55]
S. Gupta, D. Tiwari, C. Jantzi, J. Rogers, and D. Maxwell, "Understanding and exploiting spatial properties of system failures on extreme-scale hpc systems," in 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2015, pp. 37--44.
[56]
Meta, "Metaseq: A codebase for working with open pre-trained transformers." https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/README.md, 2022.
[57]
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, pp. 1735--80, 12 1997.
[58]
S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, and D. J. Inman, "1d convolutional neural networks and applications: A survey," Mechanical systems and signal processing, vol. 151, p. 107398, 2021.
[59]
C. Zhang and Y. Ma, Ensemble Machine Learning: Methods and Applications. Springer Publishing Company, Incorporated, 2012.
[60]
Z. Lan and Y. Li, "Adaptive fault management of parallel applications for high-performance computing," IEEE Trans. Comput., vol. 57, no. 12, p. 1647--1660, dec 2008.
[61]
J. Mohan, A. Phanishayee, and V. Chidambaram, "CheckFreq: Frequent, Fine-Grained DNN checkpointing," in 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, Feb. 2021, pp. 203--216.
[62]
A. Eisenman, K. K. Matam, S. Ingram, D. Mudigere, R. Krishnamoorthi, M. Annavaram, K. Nair, and M. Smelyanskiy, "Check-n-run: A checkpointing system for training recommendation models," CoRR, vol. abs/2010.08679, 2020.
[63]
B. Nicolae, J. Li, J. M. Wozniak, G. Bosilca, M. Dorier, and F. Cappello, "Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models," in 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), 2020, pp. 172--181.
[64]
K. Varda et al., "Cap'n proto serialization/rpc system - core tools and c++ library," https://github.com/capnproto/capnproto, 2013.
[65]
D. Raghavan, P. Levis, M. Zaharia, and I. Zhang, "Breakfast of champions: Towards zero-copy serialization with nic scatter-gather," in Proceedings of the Workshop on Hot Topics in Operating Systems, ser. HotOS '21. New York, NY, USA: Association for Computing Machinery, 2021, p. 199--205.
[66]
PyTorch, "Torch distributed elastic," https://pytorch.org/docs/1.12/distributed.elastic.html, 2022.
[67]
Y. Wu, K. Ma, X. Yan, Z. Liu, and J. Cheng, "Elastic deep learning in multi-tenant GPU cluster," CoRR, vol. abs/1909.11985, 2019.
[68]
D. Shukla, M. Sivathanu, S. Viswanatha, B. Gulavani, R. Nehme, A. Agrawal, C. Chen, N. Kwatra, R. Ramjee, P. Sharma, A. Katiyar, V. Modi, V. Sharma, A. Singh, S. Singhal, K. Welankar, L. Xun, R. Anupindi, K. Elangovan, H. Rahman, Z. Lin, R. Seetharaman, C. Xu, E. Ailijiang, S. Krishnappa, and M. Russinovich, "Singularity: Planetscale, preemptive and elastic scheduling of ai workloads," 2022.
[69]
I. Vanninen, "Limits of the cloud: Stress testing in azure - part 1," https://zure.com/blog/limits-of-the-cloud-stress-testing-in-azure-part-1/, 2021.
[70]
M. Linkhorst, "chaoskube: periodically killing random pods in your kubernetes cluster," https://github.com/linki/chaoskube, 2022.
[71]
J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou, "Modist: Transparent model checking of unmodified distributed systems," in Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, ser. NSDI '09. USA: USENIX Association, 2009, p. 213--228.
[72]
C. Di Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer, "Lessons learned from the analysis of system failures at petascale: The case of blue waters," in 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014, pp. 610--621.
[73]
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, "TVM: An automated End-to-End optimizing compiler for deep learning," in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). Carlsbad, CA: USENIX Association, Oct. 2018, pp. 578--594.

Cited By

View all
  • (2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
  • (2024)Navigating the Complexity of Generative AI Adoption in Software EngineeringACM Transactions on Software Engineering and Methodology10.1145/365215433:5(1-50)Online publication date: 4-Jun-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE-SEIP '23: Proceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice
May 2023
522 pages
ISBN:9798350300376

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 20 September 2023

Check for updates

Author Tags

  1. deep learning
  2. deep learning platform
  3. quality issue
  4. empirical study

Qualifiers

  • Research-article

Conference

ICSE-SEIP '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)7
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
  • (2024)Navigating the Complexity of Generative AI Adoption in Software EngineeringACM Transactions on Software Engineering and Methodology10.1145/365215433:5(1-50)Online publication date: 4-Jun-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media