Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleNovember 2024
Reducing Energy Bloat in Large Model Training
SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems PrinciplesPages 144–159https://doi.org/10.1145/3694715.3695970Training large AI models on numerous GPUs consumes a massive amount of energy, making power delivery one of the largest limiting factors in building and operating datacenters for AI workloads. However, we observe that not all energy consumed during ...
Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
- Hao Ge,
- Fangcheng Fu,
- Haoyang Li,
- Xuanyu Wang,
- Sheng Lin,
- Yujie Wang,
- Xiaonan Nie,
- Hailin Zhang,
- Xupeng Miao,
- Bin Cui
SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems PrinciplesPages 178–194https://doi.org/10.1145/3694715.3695969Training of large-scale deep learning models necessitates parallelizing the model and data across numerous devices, and the choice of parallelism strategy substantially depends on the training workloads such as memory consumption, computation cost, and ...
- research-articleNovember 2024
ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems PrinciplesPages 211–228https://doi.org/10.1145/3694715.3695960Training large Deep Neural Network (DNN) models requires thousands of GPUs over the course of several days or weeks. At this scale, failures are frequent and can have a big impact on training throughput. Utilizing spare GPU servers to mitigate ...
- research-articleOctober 2024
NeutronCache: An Efficient Cache-Enhanced Distributed Graph Neural Network Training System
CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge ManagementPages 3310–3319https://doi.org/10.1145/3627673.3679815As real-world graph data continues to grow larger and larger, training large graphs in a distributed environment is becoming increasingly prevalent. However, network transmission in a distributed environment can hinder subsequent training steps, ...
- research-articleAugust 2024
MSPipe: Efficient Temporal GNN Training via Staleness-Aware Pipeline
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data MiningPages 2651–2662https://doi.org/10.1145/3637528.3671844Memory-based Temporal Graph Neural Networks (MTGNNs) are a class of temporal graph neural networks that utilize a node memory module to capture and retain long-term temporal dependencies, leading to superior performance compared to memory-less ...
-
- research-articleAugust 2024
Dissecting Convolutional Neural Networks for Runtime and Scalability Prediction
ICPP '24: Proceedings of the 53rd International Conference on Parallel ProcessingPages 168–178https://doi.org/10.1145/3673038.3673107Given the computational complexity of deep neural networks (DNN), accurate prediction of their training and inference time using performance modeling is crucial for efficient infrastructure planning and DNN development. However, existing methods often ...
MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 ConferencePages 679–690https://doi.org/10.1145/3651890.3672252Performance of collective communication is critical for distributed systems. Using libraries to implement collective communication algorithms is not a good fit for a multi-tenant cloud environment because the tenant is not aware of the underlying ...
- research-articleAugust 2024
RDMA over Ethernet for Distributed Training at Meta Scale
- Adithya Gangidi,
- Rui Miao,
- Shengbao Zheng,
- Sai Jayesh Bondu,
- Guilherme Goes,
- Hany Morsy,
- Rohit Puri,
- Mohammad Riftadi,
- Ashmitha Jeevaraj Shetty,
- Jingyi Yang,
- Shuqiang Zhang,
- Mikel Jimenez Fernandez,
- Shashidhar Gandham,
- Hongyi Zeng
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 ConferencePages 57–70https://doi.org/10.1145/3651890.3672233The rapid growth in both computational density and scale in AI models in recent years motivates the construction of an efficient and reliable dedicated network infrastructure. This paper presents the design, implementation, and operation of Meta's Remote ...
Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 ConferencePages 707–720https://doi.org/10.1145/3651890.3672228Rapid advances in machine learning necessitate significant computing power and memory for training, which is accessible only to large corporations today. Small-scale players like academics often only have consumer-grade GPU clusters locally and can ...
- research-articleNovember 2024
SC-GNN: A Communication-Efficient Semantic Compression for Distributed Training of GNNs
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 168, Pages 1–6https://doi.org/10.1145/3649329.3657383Training big graph neural networks (GNNs) in distributed systems is quite time-consuming mainly because of the ubiquitous aggregate operations that involve a large amount of cross-partition communication for collecting embeddings/gradients during the ...
- research-articleNovember 2024
ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours
- Feiwen Zhu,
- Arkadiusz Nowaczynski,
- Rundong Li,
- Jie Xin,
- Yifei Song,
- Michal Marcinkiewicz,
- Sukru Burc Eryilmaz,
- Jun Yang,
- Michael Andersch
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 265, Pages 1–6https://doi.org/10.1145/3649329.3657326AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more ...
- research-articleJune 2024
System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
- Sam Ade Jacobs,
- Masahiro Tanaka,
- Chengming Zhang,
- Minjia Zhang,
- Reza Yazdani Aminadabi,
- Shuaiwen Leon Song,
- Samyam Rajbhandari,
- Yuxiong He
PODC '24: Proceedings of the 43rd ACM Symposium on Principles of Distributed ComputingPages 121–130https://doi.org/10.1145/3662158.3662806Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three ...
- abstractJune 2024
FedQV: Leveraging Quadratic Voting in Federated Learning
SIGMETRICS/PERFORMANCE '24: Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsPages 91–92https://doi.org/10.1145/3652963.3655055Federated Learning (FL) permits different parties to collaboratively train a global model without disclosing their respective local labels. A crucial step of FL, that of aggregating local models to produce the global one, shares many similarities with ...
Also Published in:
ACM SIGMETRICS Performance Evaluation Review: Volume 52 Issue 1 - research-articleMay 2024
FedQV: Leveraging Quadratic Voting in Federated Learning
Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), Volume 8, Issue 2Article No.: 22, Pages 1–36https://doi.org/10.1145/3656006Federated Learning (FL) permits different parties to collaboratively train a global model without disclosing their respective local labels. A crucial step of FL, that of aggregating local models to produce the global one, shares many similarities with ...
- research-articleMay 2024
InArt: In-Network Aggregation with Route Selection for Accelerating Distributed Training
WWW '24: Proceedings of the ACM Web Conference 2024Pages 2879–2889https://doi.org/10.1145/3589334.3645394Deep learning has brought about a revolutionary transformation in network applications, particularly in domains like e-commerce and online advertising. Distributed training (DT), as a critical means to expedite model training, has progressively emerged ...
- research-articleApril 2024
Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2Pages 1095–1111https://doi.org/10.1145/3620665.3640399Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as ...
- research-articleApril 2024
Training Job Placement in Clusters with Statistical In-Network Aggregation
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1Pages 420–434https://doi.org/10.1145/3617232.3624863In-Network Aggregation (INA) offloads the gradient aggregation in distributed training (DT) onto programmable switches, where the switch memory could be allocated to jobs in either synchronous or statistical multiplexing mode. Statistical INA has ...
- posterFebruary 2024
POSTER: ParGNN: Efficient Training for Large-Scale Graph Neural Network on GPU Clusters
- Shunde Li,
- Junyu Gu,
- Jue Wang,
- Tiechui Yao,
- Zhiqiang Liang,
- Yumeng Shi,
- Shigang Li,
- Weiting Xi,
- Shushen Li,
- Chunbao Zhou,
- Yangang Wang,
- Xuebin Chi
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel ProgrammingPages 469–471https://doi.org/10.1145/3627535.3638488Full-batch graph neural network (GNN) training is essential for interdisciplinary applications. Large-scale graph data is usually divided into subgraphs and distributed across multiple compute units to train GNN. The state-of-the-art load balancing ...
- research-articleAugust 2024
Semantic Privacy-Preserving for Video Surveillance Services on the Edge
SEC '23: Proceedings of the Eighth ACM/IEEE Symposium on Edge ComputingPages 300–305https://doi.org/10.1145/3583740.3626820Intelligent Video surveillance systems, leveraging edge computing, have become increasingly prevalent in various facilities, providing advanced monitoring and management capabilities. However, these systems can inadvertently compromise personally ...
- research-articleDecember 2023
Lightweight Workloads in Heterogeneous Federated Learning via Few-shot Learning
DistributedML '23: Proceedings of the 4th International Workshop on Distributed Machine LearningPages 21–26https://doi.org/10.1145/3630048.3630185With a growing variety of devices capable of generating data and due to their data holding diverse characteristics, hardware heterogeneity and data heterogeneity have become pressing issues in Federated Learning (FL). Many studies aim to address the ...