poster

POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model Training

Authors:

Jiaao He,

Shengqi Chen,

Jidong ZhaiAuthors Info & Claims

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Pages 466 - 468

https://doi.org/10.1145/3627535.3638481

Published: 20 February 2024 Publication History

Get Access

Abstract

Recommendation models are an important category of deep learning models whose size is growing enormous. They consist of a sparse part with TBs of memory footprint and a dense part that demands PFLOPs of computing capability to train. Unfortunately, the high sparse communication cost to re-organize data for different parallel strategies of the two parts impedes the scalability in training.

Based on observations of sparse access patterns, we design a two-fold fine-grained parallel strategy to accelerate sparse communication. A performance model is built to select an optimal set of items that are replicated across all GPUs so that all-to-all communication volume is reduced, while keeping memory consumption acceptable. The all-to-all overhead is further reduced by parallel scheduling techniques. In our evaluation on 32 GPUs over real-world datasets, 2.16 -- 16.8× end-to-end speedup is achieved over the baselines.

References

[1]

Alimama. 2020. Ad Display/Click Data on Taobao.com. https://www.kaggle.com/datasets/pavansanagapati/ad-displayclick-data-on-taobaocom.

Google Scholar

[2]

Criteo. 2014. Criteo 1TB Click Logs Dataset. https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset.

Google Scholar

[3]

Criteo. 2014. Criteo Display Advertising Challenge. http://www.kaggle.com/c/criteo-display-ad-challenge.

Google Scholar

[4]

Dmytro Ivchenko, Dennis Van Der Staay, Colin Taylor, Xing Liu, Will Feng, Rahul Kindi, Anirudh Sudarshan, and Shahin Sefati. 2022. TorchRec: a PyTorch Domain Library for Recommendation Systems. In Proceedings of the 16th ACM Conference on Recommender Systems. 482--483.

Digital Library

Google Scholar

[5]

Soojeong Kim, Gyeong-In Yu, Hojin Park, Sungwoo Cho, Eunji Jeong, Hyeonmin Ha, Sanha Lee, Joo Seong Jeong, and Byung-Gon Chun. 2019. Parallax: Sparsity-aware data parallel training of deep neural networks. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25--28, 2019, George Candea, Robbert van Renesse, and Christof Fetzer (Eds.). ACM, 43:1--43:15.

Digital Library

Google Scholar

[6]

Xiangru Lian, Binhang Yuan, Xuefeng Zhu, Yulong Wang, Yongjun He, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, et al. 2022. Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3288--3298.

Digital Library

Google Scholar

[7]

Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, et al. 2022. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 993--1011.

Digital Library

Google Scholar

[8]

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).

Google Scholar

[9]

Geet Sethi, Bilge Acun, Niket Agarwal, Christos Kozyrakis, Caroline Trippel, and Carole-Jean Wu. 2022. RecShard: statistical feature-based memory optimization for industry-scale neural recommendation. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 344--358.

Digital Library

Google Scholar

[10]

Shashank Verma, Wenwen Gao, Hao Wu, Deyu Fu, and Tomasz Grel. 2022. Fast, Terabyte-Scale Recommender Training Made Easy with NVIDIA Merlin Distributed-Embeddings. https://github.com/NVIDIA-Merlin/distributed-embeddings.

Google Scholar

[11]

Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G Abel, Xu Guo, Jianbing Dong, et al. 2022. Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference. In Proceedings of the 16th ACM Conference on Recommender Systems. 534--537.

Digital Library

Google Scholar

[12]

Carole-Jean Wu, Robin Burke, Ed H Chi, Joseph Konstan, Julian McAuley, Yves Raimond, and Hao Zhang. 2020. Developing a recommendation benchmark for MLPerf training and inference. arXiv preprint arXiv:2003.07336 (2020).

Google Scholar

Cited By

View all

Feng HZhang BYe FSi MChu CTian JYin CDeng SHao YBalaji PGeng TTao D(2024)Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy CompressionSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00095(1-16)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00095
Hao QWang CXiao YLin H(2024)Simplices-based higher-order enhancement graph neural network for multi-behavior recommendationInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10379061:5Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1016/j.ipm.2024.103790

Index Terms

POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model Training
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep Learning
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Synchronous stochastic gradient descent (S-SGD) with data parallelism has become a de-facto approach in training large-scale deep neural networks (DNNs) on multi-GPU systems. However, S-SGD requires iteratively synchronizing gradients from all workers, ...
Scaling deep learning on GPU and knights landing clusters
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Training neural networks has become a big bottleneck. For example, training ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, ...
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

March 2024

498 pages

ISBN:9798400704352

DOI:10.1145/3627535

Chair:
Michel Steuwer,
Program Chairs:
I-Ting Angelina Lee,
Milind Chabbi
Uber Technologies Inc.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2024

Check for updates

Author Tags

Qualifiers

Poster

Conference

PPoPP '24

Sponsor:

PPoPP '24: 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

March 2 - 6, 2024

Edinburgh, United Kingdom

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
233
Total Downloads

Downloads (Last 12 months)201
Downloads (Last 6 weeks)25

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Feng HZhang BYe FSi MChu CTian JYin CDeng SHao YBalaji PGeng TTao D(2024)Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy CompressionSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00095(1-16)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00095
Hao QWang CXiao YLin H(2024)Simplices-based higher-order enhancement graph neural network for multi-behavior recommendationInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10379061:5Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1016/j.ipm.2024.103790

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep Learning

Scaling deep learning on GPU and knights landing clusters

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations