More Web Proxy on the site http://driver.im/

research-article

Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching

Authors:

Dazhao ChengAuthors Info & Claims

SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Article No.: 49, Pages 1 - 15

https://doi.org/10.1109/SC41406.2024.00055

Published: 17 November 2024 Publication History

Abstract

Deep Learning Recommendation Models (DLRMs) are pivotal in various sectors, yet they are hindered by the high memory demands of embedding tables and the significant communication overhead in distributed training environments. Traditional approaches, like Tensor-Train (TT) decomposition, although effective for compressing these tables, introduce substantial computational burdens. Furthermore, existing frameworks for distributed training are inadequate due to the excessive data exchange requirements.

This paper proposes EcoRec, an advanced library designed to expedite the training of DLRMs through a synergistic integration of TT decomposition technology and distributed training. EcoRec introduces a novel computation pattern that eliminates redundancy in TT operations, alongside an efficient multiplication pathway, significantly reducing computational time. Additionally, it provides a unique micro-batching technique with sorted indices to decrease memory demands without additional computational costs. EcoRec also features a novel pipeline training system for embedding layers, ensuring balanced data distribution and enhanced communication efficiency. EcoRec, built on PyTorch and CUDA, has been evaluated on a 32 GPU cluster. The results show EcoRec significantly outperforms the existing EL-Rec system, achieving up to a 3.1× speedup and a 38.5% reduction in memory requirements. EcoRec marks a notable advancement in high-performance DLRM training.

Supplemental Material

MP4 File

Recorded presentation of "Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching" at SC24.

Download
99.52 MB

References

[1]

M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini et al., "Deep learning recommendation model for personalization and recommendation systems," arXiv preprint arXiv:1906.00091, 2019.

[2]

S. Zhang, L. Yao, A. Sun, and Y. Tay, "Deep learning based recommender system: A survey and new perspectives," ACM computing surveys (CSUR), no. 1, pp. 1--38, 2019.

Digital Library

[3]

B. Acun, M. Murphy, X. Wang, J. Nie, C.-J. Wu, and K. Hazelwood, "Understanding training efficiency of deep learning recommendation models at scale," in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 802--814.

[4]

Y. Ma, B. Narayanaswamy, H. Lin, and H. Ding, "Temporal-contextual recommendation in real-time," in Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 2291--2299.

[5]

H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir et al., "Wide & deep learning for recommender systems," in Proceedings of the 1st workshop on deep learning for recommender systems, 2016, pp. 7--10.

Digital Library

[6]

J. Wang, P. Huang, H. Zhao, Z. Zhang, B. Zhao, and D. L. Lee, "Billion-scale commodity embedding for e-commerce recommendation in alibaba," in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 839--848.

[7]

I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

Digital Library

[8]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.

[9]

D. Kalamkar, E. Georganas, S. Srinivasan, J. Chen, M. Shiryaev, and A. Heinecke, "Optimizing deep learning recommender systems training on cpu cluster architectures," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC), ser. SC '20, 2020.

[10]

S. Wang, O. J. Gonzalez, X. Zhou, T. Williams, B. D. Friedman, M. Havemann, and T. Woo, "An efficient and non-intrusive gpu scheduling framework for deep learning training systems," in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1--13.

[11]

C. Sima, Y. Fu, M.-K. Sit, L. Guo, X. Gong, F. Lin, J. Wu, Y. Li, H. Rong, P.-L. Aublin et al., "Ekko: A {Large-Scale} deep learning recommender system with {Low-Latency} model update," in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022, pp. 821--839.

[12]

D. Mudigere, Y. Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park et al., "Software-hardware co-design for fast and scalable training of deep learning recommendation models," in Proceedings of the 49th Annual International Symposium on Computer Architecture(ISCA), 2022, pp. 993--1011.

[13]

R. K. Singh, M. Mishra, and R. Singhal, "Scalable high-performance architecture for evolving recommender system," in Proceedings of the 3rd Workshop on Machine Learning and Systems, 2023, p. 154--162.

[14]

J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, "Nvidia a100 tensor core gpu: Performance and innovation," IEEE Micro, vol. 41, no. 2, pp. 29--35, 2021.

[15]

J. Choquette and W. Gandhi, "Nvidia a100 gpu: Performance & innovation for gpu computing," in 2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society, 2020, pp. 1--43.

[16]

A. Gharaibeh, S. Al-Kiswany, S. Gopalakrishnan, and M. Ripeanu, "A gpu accelerated storage system," in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing(HPDC), 2010, pp. 167--178.

[17]

C. Yin, B. Acun, C.-J. Wu, and X. Liu, "Tt-rec: Tensor train compression for deep learning recommendation models," Proceedings of Machine Learning and Systems, vol. 3, pp. 448--462, 2021.

[18]

Z. Wang, Y. Wang, B. Feng, D. Mudigere, B. Muthiah, and Y. Ding, "El-rec: efficient large-scale recommendation model training via tensortrain embedding table," in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis(SC), 2022.

[19]

I. V. Oseledets, "Tensor-train decomposition," SIAM Journal on Scientific Computing, vol. 33, no. 5, pp. 2295--2317, 2011.

Digital Library

[20]

S. Smith and G. Karypis, "Tensor-matrix products with a compressed sparse tensor," in Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, 2015, pp. 1--7.

[21]

Z. Wang, Y. Wei, M. Lee, M. Langer, F. Yu, J. Liu, S. Liu, D. G. Abel, X. Guo, J. Dong et al., "Merlin hugectr: Gpu-accelerated recommender system training and inference," in Proceedings of the 16th ACM Conference on Recommender Systems, 2022, pp. 534--537.

Digital Library

[22]

Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., "Gpipe: Efficient training of giant neural networks using pipeline parallelism," Advances in neural information processing systems, vol. 32, 2019.

[23]

O. Hrinchuk, V. Khrulkov, L. Mirvakhabova, E. Orlova, and I. Oseledets, "Tensorized embedding layers for efficient model compression," arXiv preprint arXiv:1901.10787, 2019.

[24]

G. P. Rodrigo Álvarez, P.-O. (Östberg, E. Elmroth, K. Antypas, R. Gerber, and L. Ramakrishnan, "Hpc system lifetime story: Workload characterization and evolutionary analyses on nersc systems," in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing(HPDC), 2015, pp. 57--60.

[25]

Z. Zhang, D. Yang, Y. Xia, L. Ding, D. Tao, X. Zhou, and D. Cheng, "Mpipemoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism," in 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2023, pp. 167--177.

[26]

Z. Zhang, Y. Xia, H. Wang, D. Yang, C. Hu, X. Zhou, and D. Cheng, "Mpmoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism," IEEE Transactions on Parallel and Distributed Systems, 2024.

Digital Library

[27]

S. Agarwal, C. Yan, Z. Zhang, and S. Venkataraman, "Bagpipe: Accelerating deep recommendation model training," in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 348--363.

Digital Library

[28]

S. Zhao, F. Li, X. Chen, T. Shen, L. Chen, S. Wang, N. Zhang, C. Li, and H. Cui, "Naspipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism," in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS), 2022, pp. 374--387.

[29]

G. Daniel, J. Gray et al., "Opt\_einsum-a python package for optimizing contraction order for einsum-like expressions," Journal of Open Source Software, vol. 3, no. 26, p. 753, 2018.

[30]

CriteoLabs. (2013) Terabyte click logs. [Online]. Available: https://labs.criteo.com/2013/12/downloadterabyte-click-logs

[31]

Facebook. (2019) Dlrm datasets. [Online]. Available: https://github.com/facebookresearch/dlrm_datasets

[32]

T. Chen, L. Li, and Y. Sun, "Differentiable product quantization for end-to-end embedding compression," in International Conference on Machine Learning. PMLR, 2020, pp. 1617--1626.

[33]

A. Desai, L. Chou, and A. Shrivastava, "Random offset block embedding (robe) for compressed embedding tables in deep learning recommendation systems," Proceedings of Machine Learning and Systems, vol. 4, pp. 762--778, 2022.

[34]

F. Wang, M. Dai, X. Li, and L. Pan, "Compressing embedding table via multi-dimensional quantization encoding for sequential recommender model," in Proceedings of the 7th International Conference on Communication and Information Processing, 2021, pp. 234--239.

[35]

A. H. Baker, H. Xu, J. M. Dennis, M. N. Levy, D. Nychka, S. A. Mickelson, J. Edwards, M. Vertenstein, and A. Wegener, "A methodology for evaluating the impact of data compression on climate simulation data," in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing(HPDC), 2014, pp. 203--214.

[36]

K. Zhao, S. Di, X. Liang, S. Li, D. Tao, Z. Chen, and F. Cappello, "Significantly improving lossy compression for hpc datasets with second-order prediction and parameter optimization," in Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing(HPDC), 2020, pp. 89--100.

[37]

E. R. Schendel, S. V. Pendse, J. Jenkins, D. A. Boyuka, Z. Gong, S. Lakshminarasimhan, Q. Liu, H. Kolla, J. Chen, S. Klasky et al., "Isobar hybrid compression-i/o interleaving for large-scale parallel i/o optimization," in Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing(HPDC), 2012, pp. 61--72.

[38]

I. Raicu, Y. Zhao, I. T. Foster, and A. Szalay, "Accelerating large-scale data exploration through data diffusion," in Proceedings of the 2008 international workshop on Data-aware distributed computing, 2008, pp. 9--18.

[39]

J. A. Yang, J. Huang, J. Park, P. T. P. Tang, and A. Tulloch, "Mixed-precision embedding using a cache," arXiv preprint arXiv:2010.11305, 2020.

[40]

H. Guan, A. Malevich, J. Yang, J. Park, and H. Yuen, "Post-training 4-bit quantization on embedding tables," arXiv preprint arXiv:1911.02079, 2019.

[41]

D. Liang, J. Altosaar, L. Charlin, and D. M. Blei, "Factorization meets the item embedding: Regularizing matrix factorization with item cooccurrence," in Proceedings of the 10th ACM conference on recommender systems, 2016, pp. 59--66.

Digital Library

[42]

D. Cao, L. Nie, X. He, X. Wei, S. Zhu, and T.-S. Chua, "Embedding factorization models for jointly recommending items and user generated lists," in Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, 2017, pp. 585--594.

[43]

X. Shen, B. Yi, H. Liu, W. Zhang, Z. Zhang, S. Liu, and N. Xiong, "Deep variational matrix factorization with knowledge embedding for recommendation system," IEEE Transactions on Knowledge and Data Engineering, no. 5, pp. 1906--1918, 2019.

[44]

B. Yi, X. Shen, H. Liu, Z. Zhang, W. Zhang, S. Liu, and N. Xiong, "Deep matrix factorization with implicit feedback embedding for recommendation system," IEEE Transactions on Industrial Informatics, no. 8, pp. 4591--4601, 2019.

[45]

H.-J. M. Shi, D. Mudigere, M. Naumov, and J. Yang, "Compositional embeddings using complementary partitions for memory-efficient recommendation systems," in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 165--175.

[46]

D. Zha, L. Feng, Q. Tan, Z. Liu, K.-H. Lai, B. Bhushanam, Y. Tian, A. Kejariwal, and X. Hu, "Dreamshard: Generalizable embedding table placement for recommender systems," Advances in Neural Information Processing Systems, pp. 15 190-15 203, 2022.

[47]

D. Zha, L. Feng, B. Bhushanam, D. Choudhary, J. Nie, Y. Tian, J. Chae, Y. Ma, A. Kejariwal, and X. Hu, "Autoshard: Automated embedding table sharding for recommender systems," in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 4461--4471.

[48]

D. Zha, L. Feng, L. Luo, B. Bhushanam, Z. Liu, Y. Hu, J. Nie, Y. Huang, Y. Tian, A. Kejariwal et al., "Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models," Proceedings of Machine Learning and Systems, 2023.

[49]

M. Xie, Y. Lu, J. Lin, Q. Wang, J. Gao, K. Ren, and J. Shu, "Fleche: an efficient gpu embedding cache for personalized recommendations," in Proceedings of the Seventeenth European Conference on Computer Systems(EUROSYS), 2022, p. 402--416.

Digital Library

[50]

Q. M. Malluhi and W. E. Johnston, "Approaches for a reliable highperformance distributed-parallel storage system," in Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing(HPDC). IEEE, 1996, pp. 500--509.

[51]

M. Adnan, Y. E. Maboud, D. Mahajan, and P. J. Nair, "Accelerating recommendation system training by leveraging popular choices," arXiv preprint arXiv:2103.00686, 2021.

[52]

J. Li, Y. Jiang, Y. Zhu, C. Wang, and H. Xu, "Lita: Accelerating distributed training of sparsely activated models," arXiv preprint arXiv:2210.17223, 2022.

Index Terms

Index terms have been assigned to the content through auto-classification.

Recommendations

Distributed training for accelerating metalearning algorithms
BiDEDE '21: Proceedings of the International Workshop on Big Data in Emergent Distributed Environments

The lack of large amounts of training data diminishes the power of deep learning to train models with a high accuracy. Few shot learning (i.e. learning using few data samples) is implemented by Meta-learning, a learn to learn approach. Most gradient ...
Multi-label co-training
IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence

Multi-label learning aims at assigning a set of appropriate labels to multi-label samples. Although it has been successfully applied in various domains in recent years, most multi-label learning methods require sufficient labeled training samples, ...
Automatic feature decomposition for single view co-training
ICML'11: Proceedings of the 28th International Conference on International Conference on Machine Learning

One of the most successful semi-supervised learning approaches is co-training for multi-view data. In co-training, one trains two classifiers, one for each view, and uses the most confident predictions of the unlabeled data for the two classifiers to "...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

November 2024

1758 pages

ISBN:9798350352917

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

IEEE Press

Publication History

Published: 17 November 2024

Check for updates

Badges

Qualifiers

Research-article
Research
Refereed limited

Conference

SC '24

Sponsor:

SIGHPC

SC '24: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 22, 2024

GA, Atlanta, USA

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
134
Total Downloads

Downloads (Last 12 months)134
Downloads (Last 6 weeks)64

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents