Article

Machine learning at the limit

Authors:

John Canny,

Huasha Zhao,

Bobby Jaros,

Ye Chen,

Jiangchang MaoAuthors Info & Claims

BIG DATA '15: Proceedings of the 2015 IEEE International Conference on Big Data (Big Data)

Pages 233 - 242

https://doi.org/10.1109/BigData.2015.7363760

Published: 29 October 2015 Publication History

Abstract

Many systems have been developed for machine learning at scale. Performance has steadily improved, but there has been relatively little work on explicitly defining or approaching the limits of performance. In this paper we describe the application of roofline design, an approach borrowed from computer architecture, to large-scale machine learning. In roofline design, one exposes ALU, memory, and network limits, and the constraints they imply for algorithms. Using roofline design, we have developed a system called BIDMach which has demonstrated the highest performance to date for many ML problems. On one GPU-accelerated node, it generally outperforms other single-machine toolkits and cluster toolkits running on 100s of nodes. This performance level is enabled by a relatively small number of rooflined matrix primitives. Such performance implies a dramatic reduction in the energy used to perform these calculations. Beyond matrix kernels, roofline design can be applied to the end-to-end design of machine learning algorithms which minimize memory usage to optimize speed. This approach offers a further 2x to 3x gain in performance. Roofline design can also be applied to network primitives. We describe recent work on a sparse allreduce primitive called Kylix. We have shown that Kylix approaches the practical network throughput limit for allreduce, a basic primitive for distributed machine learning. Using Kylix, we describe an efficient transformation from model-parallel to data-parallel calculations. This transformation uses a secondary storage roofline, with similar parameters to the network. Finally, we describe several deployments of these techniques on real-world problems in two large internet companies. Once again, single node rooflined design demonstrated substantial gains over alternatives on either single nodes or clusters.

Cited By

View all

Yi SSun SPeng LSun YYang MCao ZLi QJung MZhou KZhang JWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)BIZA: Design of Self-Governing Block-Interface ZNS AFA for Endurance and PerformanceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695953(313-329)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695953
Wang SCao QJiang HLu ZYao JChen YPan A(2024)Explorations and Exploitation for Parity-based RAIDs with Ultra-fast SSDsACM Transactions on Storage10.1145/362799220:1(1-32)Online publication date: 30-Jan-2024
https://dl.acm.org/doi/10.1145/3627992
Pham KCho SLee SNguyen LYeo HJeong ILee SKim NSon Y(2024)ScaleCache: A Scalable Page Cache for Multiple Solid-State DrivesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629588(641-656)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629588
Show More Cited By

Machine learning at the limit

Recommendations

GPU parallel computing for machine learning in Python: how to build a parallel computer
Machine Learning for Vectorization Decision in OpenCL/SYCL Kernel
IWOCL '23: Proceedings of the 2023 International Workshop on OpenCL

Vectorization of OpenCL/SYCL kernel on CPU device could improve performance significantly. It utilizes single instruction multiple data (SIMD) instruction to process multiple work-items concurrently. However, some applications don't benefit from ...
Lifelong Machine Learning

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

BIG DATA '15: Proceedings of the 2015 IEEE International Conference on Big Data (Big Data)

October 2015

3094 pages

ISBN:9781479999262

Publisher

IEEE Computer Society

United States

Publication History

Published: 29 October 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Yi SSun SPeng LSun YYang MCao ZLi QJung MZhou KZhang JWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)BIZA: Design of Self-Governing Block-Interface ZNS AFA for Endurance and PerformanceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695953(313-329)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695953
Wang SCao QJiang HLu ZYao JChen YPan A(2024)Explorations and Exploitation for Parity-based RAIDs with Ultra-fast SSDsACM Transactions on Storage10.1145/362799220:1(1-32)Online publication date: 30-Jan-2024
https://dl.acm.org/doi/10.1145/3627992
Pham KCho SLee SNguyen LYeo HJeong ILee SKim NSon Y(2024)ScaleCache: A Scalable Page Cache for Multiple Solid-State DrivesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629588(641-656)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629588
Gerogiannis GYesil SLenadora DCao DMendis CTorrellas JSolihin YHeinrich M(2023)SPADE: A Flexible and Scalable Accelerator for SpMM and SDDMMProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589054(1-15)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589054
Randall TAllen TGe RZhou HMoreira JMueller FEtsion Y(2021)FULL-W2VProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460373(455-466)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460373
Tan WChang SFong LLi CWang ZCao L(2018)Matrix Factorization on GPUs with Memory Optimization and Approximate ComputingProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225096(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225096
Rengasamy VFu TLee WMadduri K(2017)Optimizing Word2Vec Performance on Multicore SystemsProceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms10.1145/3149704.3149768(1-9)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3149704.3149768

Abstract

Cited By

Recommendations

GPU parallel computing for machine learning in Python: how to build a parallel computer

Machine Learning for Vectorization Decision in OpenCL/SYCL Kernel

Lifelong Machine Learning

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations