[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3442442.3452056acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Reliability of Large Scale GPU Clusters for Deep Learning Workloads

Published: 03 June 2021 Publication History

Abstract

Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences.

References

[1]
[n.d.]. hdfs-mount: A tool to mount HDFS as a local Linux file system. https://github.com/microsoft/hdfs-mount.
[2]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI.
[3]
Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. 2003. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro 23, 2 (March 2003), 22–28.
[4]
Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, and Srinidhi Viswanatha. 2020. Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning. In EuroSys.
[5]
Juncheng Gu, Kang G. Chowdhury, Mosharaf abd Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In NSDI.
[6]
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, unjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In USENIX ATC.
[7]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS.
[8]
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In NSDI.
[9]
Avner May, Alireza Bagheri Garakani, Zhiyun Lu, Dong Guo, Kuan Liu, Aurélien Bellet, Linxi Fan, Michael Collins, Daniel Hsu, Brian Kingsbury, 2017. Kernel Approximation Methods for Speech Recognition. arXiv preprint arXiv:1701.03577(2017).
[10]
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia. 2020. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In OSDI.
[11]
Daniel Nurmi, John Brevik, and Rich Wolski. 2005. Modeling Machine Availability in Enterprise and Wide-area Distributed Computing Environments. In Euro-Par.
[12]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. In EuroSys.
[13]
pytorch 2018. PyTorch. https://pytorch.org/.
[14]
Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft’s Open-Source Deep-Learning Toolkit. In KDD.
[15]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In NIPS.
[16]
Feng Wang, Jie Qiu, Jie Yang, Bo Dong, Xinhui Li, and Ying Li. 2009. Hadoop High Availability through Metadata Replication. In Proceedings of the First International Workshop on Cloud Data Management.
[17]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In OSDI.
[18]
Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An Empirical Study on Program Failures of Deep Learning Jobs. In ICSE.
[19]
Yue Zhu, Weikuan Yu, Bing Jiao, Kathryn Mohror, Adam Moody, and Fahim Chowdhury. 2019. Efficient User-level Storage Disaggregation for Deep Learning. In CLUSTER.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '21: Companion Proceedings of the Web Conference 2021
April 2021
726 pages
ISBN:9781450383134
DOI:10.1145/3442442
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DNN training workloads
  2. GPU clusters
  3. reliability

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '21
Sponsor:
WWW '21: The Web Conference 2021
April 19 - 23, 2021
Ljubljana, Slovenia

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 359
    Total Downloads
  • Downloads (Last 12 months)92
  • Downloads (Last 6 weeks)9
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media