[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/SC41406.2024.00055acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching

Published: 17 November 2024 Publication History

Abstract

Deep Learning Recommendation Models (DLRMs) are pivotal in various sectors, yet they are hindered by the high memory demands of embedding tables and the significant communication overhead in distributed training environments. Traditional approaches, like Tensor-Train (TT) decomposition, although effective for compressing these tables, introduce substantial computational burdens. Furthermore, existing frameworks for distributed training are inadequate due to the excessive data exchange requirements.
This paper proposes EcoRec, an advanced library designed to expedite the training of DLRMs through a synergistic integration of TT decomposition technology and distributed training. EcoRec introduces a novel computation pattern that eliminates redundancy in TT operations, alongside an efficient multiplication pathway, significantly reducing computational time. Additionally, it provides a unique micro-batching technique with sorted indices to decrease memory demands without additional computational costs. EcoRec also features a novel pipeline training system for embedding layers, ensuring balanced data distribution and enhanced communication efficiency. EcoRec, built on PyTorch and CUDA, has been evaluated on a 32 GPU cluster. The results show EcoRec significantly outperforms the existing EL-Rec system, achieving up to a 3.1× speedup and a 38.5% reduction in memory requirements. EcoRec marks a notable advancement in high-performance DLRM training.

Supplemental Material

MP4 File
Recorded presentation of "Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching" at SC24.

References

[1]
M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini et al., "Deep learning recommendation model for personalization and recommendation systems," arXiv preprint arXiv:1906.00091, 2019.
[2]
S. Zhang, L. Yao, A. Sun, and Y. Tay, "Deep learning based recommender system: A survey and new perspectives," ACM computing surveys (CSUR), no. 1, pp. 1--38, 2019.
[3]
B. Acun, M. Murphy, X. Wang, J. Nie, C.-J. Wu, and K. Hazelwood, "Understanding training efficiency of deep learning recommendation models at scale," in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 802--814.
[4]
Y. Ma, B. Narayanaswamy, H. Lin, and H. Ding, "Temporal-contextual recommendation in real-time," in Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 2291--2299.
[5]
H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir et al., "Wide & deep learning for recommender systems," in Proceedings of the 1st workshop on deep learning for recommender systems, 2016, pp. 7--10.
[6]
J. Wang, P. Huang, H. Zhao, Z. Zhang, B. Zhao, and D. L. Lee, "Billion-scale commodity embedding for e-commerce recommendation in alibaba," in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 839--848.
[7]
I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[8]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[9]
D. Kalamkar, E. Georganas, S. Srinivasan, J. Chen, M. Shiryaev, and A. Heinecke, "Optimizing deep learning recommender systems training on cpu cluster architectures," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC), ser. SC '20, 2020.
[10]
S. Wang, O. J. Gonzalez, X. Zhou, T. Williams, B. D. Friedman, M. Havemann, and T. Woo, "An efficient and non-intrusive gpu scheduling framework for deep learning training systems," in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1--13.
[11]
C. Sima, Y. Fu, M.-K. Sit, L. Guo, X. Gong, F. Lin, J. Wu, Y. Li, H. Rong, P.-L. Aublin et al., "Ekko: A {Large-Scale} deep learning recommender system with {Low-Latency} model update," in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022, pp. 821--839.
[12]
D. Mudigere, Y. Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park et al., "Software-hardware co-design for fast and scalable training of deep learning recommendation models," in Proceedings of the 49th Annual International Symposium on Computer Architecture(ISCA), 2022, pp. 993--1011.
[13]
R. K. Singh, M. Mishra, and R. Singhal, "Scalable high-performance architecture for evolving recommender system," in Proceedings of the 3rd Workshop on Machine Learning and Systems, 2023, p. 154--162.
[14]
J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, "Nvidia a100 tensor core gpu: Performance and innovation," IEEE Micro, vol. 41, no. 2, pp. 29--35, 2021.
[15]
J. Choquette and W. Gandhi, "Nvidia a100 gpu: Performance & innovation for gpu computing," in 2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society, 2020, pp. 1--43.
[16]
A. Gharaibeh, S. Al-Kiswany, S. Gopalakrishnan, and M. Ripeanu, "A gpu accelerated storage system," in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing(HPDC), 2010, pp. 167--178.
[17]
C. Yin, B. Acun, C.-J. Wu, and X. Liu, "Tt-rec: Tensor train compression for deep learning recommendation models," Proceedings of Machine Learning and Systems, vol. 3, pp. 448--462, 2021.
[18]
Z. Wang, Y. Wang, B. Feng, D. Mudigere, B. Muthiah, and Y. Ding, "El-rec: efficient large-scale recommendation model training via tensortrain embedding table," in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis(SC), 2022.
[19]
I. V. Oseledets, "Tensor-train decomposition," SIAM Journal on Scientific Computing, vol. 33, no. 5, pp. 2295--2317, 2011.
[20]
S. Smith and G. Karypis, "Tensor-matrix products with a compressed sparse tensor," in Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, 2015, pp. 1--7.
[21]
Z. Wang, Y. Wei, M. Lee, M. Langer, F. Yu, J. Liu, S. Liu, D. G. Abel, X. Guo, J. Dong et al., "Merlin hugectr: Gpu-accelerated recommender system training and inference," in Proceedings of the 16th ACM Conference on Recommender Systems, 2022, pp. 534--537.
[22]
Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., "Gpipe: Efficient training of giant neural networks using pipeline parallelism," Advances in neural information processing systems, vol. 32, 2019.
[23]
O. Hrinchuk, V. Khrulkov, L. Mirvakhabova, E. Orlova, and I. Oseledets, "Tensorized embedding layers for efficient model compression," arXiv preprint arXiv:1901.10787, 2019.
[24]
G. P. Rodrigo Álvarez, P.-O. (Östberg, E. Elmroth, K. Antypas, R. Gerber, and L. Ramakrishnan, "Hpc system lifetime story: Workload characterization and evolutionary analyses on nersc systems," in Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing(HPDC), 2015, pp. 57--60.
[25]
Z. Zhang, D. Yang, Y. Xia, L. Ding, D. Tao, X. Zhou, and D. Cheng, "Mpipemoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism," in 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2023, pp. 167--177.
[26]
Z. Zhang, Y. Xia, H. Wang, D. Yang, C. Hu, X. Zhou, and D. Cheng, "Mpmoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism," IEEE Transactions on Parallel and Distributed Systems, 2024.
[27]
S. Agarwal, C. Yan, Z. Zhang, and S. Venkataraman, "Bagpipe: Accelerating deep recommendation model training," in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 348--363.
[28]
S. Zhao, F. Li, X. Chen, T. Shen, L. Chen, S. Wang, N. Zhang, C. Li, and H. Cui, "Naspipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism," in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS), 2022, pp. 374--387.
[29]
G. Daniel, J. Gray et al., "Opt\_einsum-a python package for optimizing contraction order for einsum-like expressions," Journal of Open Source Software, vol. 3, no. 26, p. 753, 2018.
[30]
CriteoLabs. (2013) Terabyte click logs. [Online]. Available: https://labs.criteo.com/2013/12/downloadterabyte-click-logs
[31]
Facebook. (2019) Dlrm datasets. [Online]. Available: https://github.com/facebookresearch/dlrm_datasets
[32]
T. Chen, L. Li, and Y. Sun, "Differentiable product quantization for end-to-end embedding compression," in International Conference on Machine Learning. PMLR, 2020, pp. 1617--1626.
[33]
A. Desai, L. Chou, and A. Shrivastava, "Random offset block embedding (robe) for compressed embedding tables in deep learning recommendation systems," Proceedings of Machine Learning and Systems, vol. 4, pp. 762--778, 2022.
[34]
F. Wang, M. Dai, X. Li, and L. Pan, "Compressing embedding table via multi-dimensional quantization encoding for sequential recommender model," in Proceedings of the 7th International Conference on Communication and Information Processing, 2021, pp. 234--239.
[35]
A. H. Baker, H. Xu, J. M. Dennis, M. N. Levy, D. Nychka, S. A. Mickelson, J. Edwards, M. Vertenstein, and A. Wegener, "A methodology for evaluating the impact of data compression on climate simulation data," in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing(HPDC), 2014, pp. 203--214.
[36]
K. Zhao, S. Di, X. Liang, S. Li, D. Tao, Z. Chen, and F. Cappello, "Significantly improving lossy compression for hpc datasets with second-order prediction and parameter optimization," in Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing(HPDC), 2020, pp. 89--100.
[37]
E. R. Schendel, S. V. Pendse, J. Jenkins, D. A. Boyuka, Z. Gong, S. Lakshminarasimhan, Q. Liu, H. Kolla, J. Chen, S. Klasky et al., "Isobar hybrid compression-i/o interleaving for large-scale parallel i/o optimization," in Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing(HPDC), 2012, pp. 61--72.
[38]
I. Raicu, Y. Zhao, I. T. Foster, and A. Szalay, "Accelerating large-scale data exploration through data diffusion," in Proceedings of the 2008 international workshop on Data-aware distributed computing, 2008, pp. 9--18.
[39]
J. A. Yang, J. Huang, J. Park, P. T. P. Tang, and A. Tulloch, "Mixed-precision embedding using a cache," arXiv preprint arXiv:2010.11305, 2020.
[40]
H. Guan, A. Malevich, J. Yang, J. Park, and H. Yuen, "Post-training 4-bit quantization on embedding tables," arXiv preprint arXiv:1911.02079, 2019.
[41]
D. Liang, J. Altosaar, L. Charlin, and D. M. Blei, "Factorization meets the item embedding: Regularizing matrix factorization with item cooccurrence," in Proceedings of the 10th ACM conference on recommender systems, 2016, pp. 59--66.
[42]
D. Cao, L. Nie, X. He, X. Wei, S. Zhu, and T.-S. Chua, "Embedding factorization models for jointly recommending items and user generated lists," in Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, 2017, pp. 585--594.
[43]
X. Shen, B. Yi, H. Liu, W. Zhang, Z. Zhang, S. Liu, and N. Xiong, "Deep variational matrix factorization with knowledge embedding for recommendation system," IEEE Transactions on Knowledge and Data Engineering, no. 5, pp. 1906--1918, 2019.
[44]
B. Yi, X. Shen, H. Liu, Z. Zhang, W. Zhang, S. Liu, and N. Xiong, "Deep matrix factorization with implicit feedback embedding for recommendation system," IEEE Transactions on Industrial Informatics, no. 8, pp. 4591--4601, 2019.
[45]
H.-J. M. Shi, D. Mudigere, M. Naumov, and J. Yang, "Compositional embeddings using complementary partitions for memory-efficient recommendation systems," in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 165--175.
[46]
D. Zha, L. Feng, Q. Tan, Z. Liu, K.-H. Lai, B. Bhushanam, Y. Tian, A. Kejariwal, and X. Hu, "Dreamshard: Generalizable embedding table placement for recommender systems," Advances in Neural Information Processing Systems, pp. 15 190-15 203, 2022.
[47]
D. Zha, L. Feng, B. Bhushanam, D. Choudhary, J. Nie, Y. Tian, J. Chae, Y. Ma, A. Kejariwal, and X. Hu, "Autoshard: Automated embedding table sharding for recommender systems," in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 4461--4471.
[48]
D. Zha, L. Feng, L. Luo, B. Bhushanam, Z. Liu, Y. Hu, J. Nie, Y. Huang, Y. Tian, A. Kejariwal et al., "Pre-train and search: Efficient embedding table sharding with pre-trained neural cost models," Proceedings of Machine Learning and Systems, 2023.
[49]
M. Xie, Y. Lu, J. Lin, Q. Wang, J. Gao, K. Ren, and J. Shu, "Fleche: an efficient gpu embedding cache for personalized recommendations," in Proceedings of the Seventeenth European Conference on Computer Systems(EUROSYS), 2022, p. 402--416.
[50]
Q. M. Malluhi and W. E. Johnston, "Approaches for a reliable highperformance distributed-parallel storage system," in Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing(HPDC). IEEE, 1996, pp. 500--509.
[51]
M. Adnan, Y. E. Maboud, D. Mahajan, and P. J. Nair, "Accelerating recommendation system training by leveraging popular choices," arXiv preprint arXiv:2103.00686, 2021.
[52]
J. Li, Y. Jiang, Y. Zhu, C. Wang, and H. Xu, "Lita: Accelerating distributed training of sparsely activated models," arXiv preprint arXiv:2210.17223, 2022.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2024
1758 pages
ISBN:9798350352917

Sponsors

Publisher

IEEE Press

Publication History

Published: 17 November 2024

Check for updates

Badges

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SC '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 134
    Total Downloads
  • Downloads (Last 12 months)134
  • Downloads (Last 6 weeks)64
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media