Address: [go: up one dir, main page]

Include Form Remove Scripts Accept Cookies Show Images Show Referer Rotate13 Base64 Strip Meta Strip Title Session Cookies

More Web Proxy on the site http://driver.im/ skip to main content

Advanced Search
Browse
About
- Sign in
- Register

Advanced Search
Journals
Magazines
Proceedings
Books
SIGs
Conferences
People
More

Search ACM Digital Library

Search Results

Search

Advanced Search

Applied Filters

distributed training

People

Names

Bengchin Ooi (4)
Gang Chen (4)
Bo Jiang (3)
Chenghu Zhou (3)
Dongsheng Li (3)
Jinyang Gao (3)
Kianlee Tan (3)
Meihui Zhang (3)
Nikolaos Laoutaris (3)
Tian Guo (3)
Tianyue Chu (3)
Wei Wang (3)
Xinbing Wang (3)
Yunzhuo Liu (3)
Dhabaleswar Kumar (dk) Panda (2)
Dijiang Huang (2)
George Karypis (2)
Khaled Hamidouche (2)
Marco Canini (2)
Mosharaf Chowdhury (2)

Institutions

Amazon.com, Inc. (8)
NVIDIA (7)
National University of Singapore (6)
Peking University (6)
Chinese Academy of Sciences (5)
The University of Hong Kong (4)
Tsinghua University (4)
Zhejiang University (4)
Arizona State University (3)
Carlos III University of Madrid (3)
IMDEA Networks Institute (3)
National University of Defense Technology China (3)
Shanghai Jiao Tong University (3)
The Ohio State University (3)
University of Science and Technology of China (3)
Worcester Polytechnic Institute (3)
Hong Kong University of Science and Technology (2)
The University of Sheffield (2)
University of California, Berkeley (2)
University of Michigan, Ann Arbor (2)

Authors

Bengchin Ooi (4)
Gang Chen (4)
Bo Jiang (3)
Chenghu Zhou (3)
Dongsheng Li (3)
Jinyang Gao (3)
Kianlee Tan (3)
Meihui Zhang (3)
Nikolaos Laoutaris (3)
Tian Guo (3)
Tianyue Chu (3)
Wei Wang (3)
Xinbing Wang (3)
Yunzhuo Liu (3)
Dhabaleswar Kumar (dk) Panda (2)
Dijiang Huang (2)
George Karypis (2)
Khaled Hamidouche (2)
Marco Canini (2)
Mosharaf Chowdhury (2)

Reviewers

Brad D. Reid (1)

Publications

Journal/Magazine Names

ACM SIGMETRICS Performance Evaluation Review (2)
Proceedings of the ACM on Management of Data (2)
Proceedings of the ACM on Measurement and Analysis of Computing Systems (2)
ACM SIGPLAN Notices (1)
ACM Transactions on Internet of Things (1)
ACM Transactions on Multimedia Computing, Communications, and Applications (1)
ACM Transactions on Software Engineering and Methodology (1)
Intelligent Data Analysis (1)
International Journal of Intelligent Systems (1)
Journal of Computer Science and Technology (1)

Proceedings/Book Names

ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference (3)
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing (3)
SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles (3)
SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (3)
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (2)
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (2)
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference (2)
DistributedML '23: Proceedings of the 4th International Workshop on Distributed Machine Learning (2)
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2)
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (2)
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2)
MM '15: Proceedings of the 23rd ACM international conference on Multimedia (2)
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2)
CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies (1)
ICCAD '20: Proceedings of the 39th International Conference on Computer-Aided Design (1)
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (1)
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (1)
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (1)
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (1)
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (1)

All Publications

Proceedings (73)
Journals (7)
Newsletters (3)
Other periodicals (3)

Content Type

Research Article (74)
Abstract (5)
Short Paper (3)
Poster (2)
Announcement (1)
Tutorial (1)

Supplemental Material Type

audio/video (14)
Other (3)
Software (1)

Media Formats

PDF (83)
Image (16)
Video (14)
HTML (9)
Fulltext (2)

Publisher

Association for Computing Machinery (77)
IEEE Press (4)
International Foundation for Autonomous Agents and Multiagent Systems (2)
IOS Press (1)
John Wiley and Sons Ltd. (1)
Springer-Verlag (1)

Conferences

Sponsors

SIGMOD (14)
SIGOPS (12)
SIGKDD (10)
SIGCOMM (9)
SIGHPC (7)
SIGPLAN (7)
SIGARCH (6)
SIGIR (6)
SIGWEB (5)
SIGDA (4)
SIGMM (3)
SIGAI (2)
SIGMETRICS (2)
SIGACT (1)
SIGAPP (1)
SIGBED (1)
SIGMICRO (1)
SIGMOBILE (1)

Conference Event

SOSP '24 (3)
SOSP '23 (3)
ICPP 2023 (3)
ACM SIGCOMM '24 (3)
SIGMOD/PODS '22 (2)
SC '21 (2)
MM '15 (2)
KDD '22 (2)
KDD '21 (2)
KDD '20 (2)
DAC '24 (2)
CoNEXT 2023 (2)
CIKM'16 (2)
CIKM '23 (2)
ASPLOS '24 (2)
ASPLOS '22 (1)
APNet 2022 (1)
ACM SIGCOMM '23 (1)
AAMAS '21 (1)
AAMAS '18 (1)

Proceedings Series

KDD: Knowledge Discovery and Data Mining (10)
COMM: ACM SIGCOMM (6)
ICPP: International Conference on Parallel Processing (6)
SOSP: ACM Symposium on Operating Systems Principles (6)
CIKM: Conference on Information and Knowledge Management (5)
ASPLOS: Architectural Support for Programming Languages and Operating Systems (4)
CoNEXT: International Conference On Emerging Networking Experiments And Technologies (4)
MOD: International Conference on Management of Data (4)
SC: The International Conference for High Performance Computing, Networking, Storage, and Analysis (4)
DAC: Design Automation Conference (3)
MM: International Multimedia Conference (3)
PPoPP: Principles and Practice of Parallel Programming (3)
AAMAS: International Conference on Autonomous Agents and Multiagent Systems (2)
METRICS: Measurement and Modeling of Computer Systems (2)
EuroSys: European Conference on Computer Systems (1)
HPDC: High Performance Distributed Computing (1)
ICCAD: International Conference on Computer-Aided Design (1)
ICIAI: International Conference on Innovation in Artificial Intelligence (1)
IR: Research and Development in Information Retrieval (1)
ISCA: International Symposium on Computer Architecture (1)

Reproducibility Badges

Artifacts Available (15)
Artifacts Evaluated & Functional (13)
Results Reproduced (10)
Results Replicated (1)

Publication Date

Upon changing this filter the page will automatically refresh

Past 5 years
Past 2 years
Past year

Export Citations

Selected
All Results

Select Citation format

Please download or close your previous search result export first before starting a new bulk export.
Preview is not available.
By clicking download,a status dialog will open to start the export process. The process may takea few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress.
Download
- Download citation
- Copy citation

83 Results for: Keyword: distributed trainingEdit SearchSave SearchFailed to save your search, try again laterSearch has been saved (My Saved Searches)RSS

Save this search

Please login to be able to save your searches and receive alerts for new content matching your search criteria.

Searched The ACM Guide to Computing Literature (3,801,463 records)|Limit your search to The ACM Full-Text Collection (771,159 records)

Results
Videos
Change zoom level
Caption
People

Showing 1 - 20of83 Results

Select All

Export Citations Save to Binder

Save to Binder

Create a New Binder

Name

per page:

10
20
50

Sort by: Recency

Earliest
Latest
Downloaded
Cited

research-article
November 2024
Results Reproduced / v1.1
Artifacts Evaluated & Functional / v1.1
Artifacts Available / v1.1
Reducing Energy Bloat in Large Model Training
SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems PrinciplesPages 144–159https://doi.org/10.1145/3694715.3695970

Training large AI models on numerous GPUs consumes a massive amount of energy, making power delivery one of the largest limiting factors in building and operating datacenters for AI workloads. However, we observe that not all energy consumed during ...
0
156
Metrics
Total Citations0
Total Downloads156
Last 12 Months156
Last 6 weeks156
1
Supplementary Material
p144-supp.pdf
Get Access
research-article
November 2024
Artifacts Evaluated & Functional / v1.1
Artifacts Available / v1.1
Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
- Hao Ge,
- Fangcheng Fu,
- Haoyang Li,
- Xuanyu Wang,
- Sheng Lin,
- Yujie Wang,
- Xiaonan Nie,
- Hailin Zhang,
- Xupeng Miao,
- Bin Cui
SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems PrinciplesPages 178–194https://doi.org/10.1145/3694715.3695969

Training of large-scale deep learning models necessitates parallelizing the model and data across numerous devices, and the choice of parallelism strategy substantially depends on the training workloads such as memory consumption, computation cost, and ...
0
460
Metrics
Total Citations0
Total Downloads460
Last 12 Months460
Last 6 weeks460
Get Access
research-article
November 2024
ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems PrinciplesPages 211–228https://doi.org/10.1145/3694715.3695960

Training large Deep Neural Network (DNN) models requires thousands of GPUs over the course of several days or weeks. At this scale, failures are frequent and can have a big impact on training throughput. Utilizing spare GPU servers to mitigate ...
0
274
Metrics
Total Citations0
Total Downloads274
Last 12 Months274
Last 6 weeks274
Get Access
research-article
October 2024
NeutronCache: An Efficient Cache-Enhanced Distributed Graph Neural Network Training System
CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge ManagementPages 3310–3319https://doi.org/10.1145/3627673.3679815

As real-world graph data continues to grow larger and larger, training large graphs in a distributed environment is becoming increasingly prevalent. However, network transmission in a distributed environment can hinder subsequent training steps, ...
0
112
Metrics
Total Citations0
Total Downloads112
Last 12 Months112
Last 6 weeks44
Get Access
research-article
August 2024
MSPipe: Efficient Temporal GNN Training via Staleness-Aware Pipeline
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data MiningPages 2651–2662https://doi.org/10.1145/3637528.3671844

Memory-based Temporal Graph Neural Networks (MTGNNs) are a class of temporal graph neural networks that utilize a node memory module to capture and retain long-term temporal dependencies, leading to superior performance compared to memory-less ...
1
132
Metrics
Total Citations1
Total Downloads132
Last 12 Months132
Last 6 weeks38
1
Supplementary Material
rtp0970-2min-promo.mp4
Get Access
Upcoming Conferences
Skip slideshow

ASPLOS '25

March 30 - April 3, 2025

Postillion Hotel and Convention Centre WTC Rotterdam, Rotterdam, Netherlands

EuroSys '25

March 30 - April 3, 2025

World Trade Center, Rotterdam, Netherlands

EuroSys '25 Website

ISCA '25

June 21 - 25, 2025

Waseda University & RIHGA Royal Hotel Tokyo, Tokyo, Japan

ISCA '25 Website

DAC '25

June 22 - 26, 2025

Moscone Center, San Francisco, CA, USA

DAC '25 Website

KDD '25

August 3 - 7, 2025

Metro Toronto Convention Centre, Toronto, ON, Canada

KDD '25 Website

SOSP '25

October 13 - 16, 2025

Lotte Hotel World, Seoul, Republic of Korea

SOSP '25 Website

CIKM '25

November 10 - 14, 2025

COEX, Seoul, Republic of Korea

SC '25

November 16 - 21, 2025

America's Center, St Louis, MO, USA

SC '25 Website
research-article
Open Access
August 2024
Artifacts Evaluated & Functional / v1.1
Artifacts Available / v1.1
Results Replicated / v1.1
Dissecting Convolutional Neural Networks for Runtime and Scalability Prediction
ICPP '24: Proceedings of the 53rd International Conference on Parallel ProcessingPages 168–178https://doi.org/10.1145/3673038.3673107

Given the computational complexity of deep neural networks (DNN), accurate prediction of their training and inference time using performance modeling is crucial for efficient infrastructure planning and DNN development. However, existing methods often ...
0
138
Metrics
Total Citations0
Total Downloads138
Last 12 Months138
Last 6 weeks43
1
Supplementary Material
apdx171s2_pap371.pdf
View online with eReader
View this article in HTML format
PDF
research-article
Open Access
August 2024
Artifacts Evaluated & Functional / v1.1
Artifacts Available / v1.1
MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 ConferencePages 679–690https://doi.org/10.1145/3651890.3672252

Performance of collective communication is critical for distributed systems. Using libraries to implement collective communication algorithms is not a good fit for a multi-tenant cloud environment because the tenant is not aware of the underlying ...
0
1,112
Metrics
Total Citations0
Total Downloads1,112
Last 12 Months1,112
Last 6 weeks339
View online with eReader
PDF
research-article
August 2024
RDMA over Ethernet for Distributed Training at Meta Scale
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 ConferencePages 57–70https://doi.org/10.1145/3651890.3672233

The rapid growth in both computational density and scale in AI models in recent years motivates the construction of an efficient and reliable dedicated network infrastructure. This paper presents the design, implementation, and operation of Meta's Remote ...
2
10,379
Metrics
Total Citations2
Total Downloads10,379
Last 12 Months10,379
Last 6 weeks682
Get Access
research-article
Open Access
August 2024
Artifacts Available / v1.1
Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 ConferencePages 707–720https://doi.org/10.1145/3651890.3672228

Rapid advances in machine learning necessitate significant computing power and memory for training, which is accessible only to large corporations today. Small-scale players like academics often only have consumer-grade GPU clusters locally and can ...
0
2,049
Metrics
Total Citations0
Total Downloads2,049
Last 12 Months2,049
Last 6 weeks539
View online with eReader
PDF
research-article
November 2024
SC-GNN: A Communication-Efficient Semantic Compression for Distributed Training of GNNs
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 168, Pages 1–6https://doi.org/10.1145/3649329.3657383

Training big graph neural networks (GNNs) in distributed systems is quite time-consuming mainly because of the ubiquitous aggregate operations that involve a large amount of cross-partition communication for collecting embeddings/gradients during the ...
0
62
Metrics
Total Citations0
Total Downloads62
Last 12 Months62
Last 6 weeks62
Get Access
research-article
Open Access
November 2024
ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 265, Pages 1–6https://doi.org/10.1145/3649329.3657326

AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more ...
0
85
Metrics
Total Citations0
Total Downloads85
Last 12 Months85
Last 6 weeks85
View online with eReader
PDF
research-article
Open Access
June 2024
System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
PODC '24: Proceedings of the 43rd ACM Symposium on Principles of Distributed ComputingPages 121–130https://doi.org/10.1145/3662158.3662806

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three ...
0
771
Metrics
Total Citations0
Total Downloads771
Last 12 Months771
Last 6 weeks203
View online with eReader
PDF
abstract
June 2024
FedQV: Leveraging Quadratic Voting in Federated Learning
- Tianyue Chu,
- Nikolaos Laoutaris
SIGMETRICS/PERFORMANCE '24: Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsPages 91–92https://doi.org/10.1145/3652963.3655055

Federated Learning (FL) permits different parties to collaboratively train a global model without disclosing their respective local labels. A crucial step of FL, that of aggregating local models to produce the global one, shares many similarities with ...
Also Published in:
ACM SIGMETRICS Performance Evaluation Review: Volume 52 Issue 1
2
35
Metrics
Total Citations2
Total Downloads35
Last 12 Months35
Last 6 weeks2
Get Access
research-article
May 2024
FedQV: Leveraging Quadratic Voting in Federated Learning
- Tianyue Chu,
- Nikolaos Laoutaris
Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), Volume 8, Issue 2Article No.: 22, Pages 1–36https://doi.org/10.1145/3656006

Federated Learning (FL) permits different parties to collaboratively train a global model without disclosing their respective local labels. A crucial step of FL, that of aggregating local models to produce the global one, shares many similarities with ...
1
114
Metrics
Total Citations1
Total Downloads114
Last 12 Months114
Last 6 weeks6
Get Access
research-article
May 2024
InArt: In-Network Aggregation with Route Selection for Accelerating Distributed Training
- Jiawei Liu,
- Yutong Zhai,
- Gongming Zhao,
- Hongli Xu,
- Jin Fang,
- Zhen Zeng,
- Ying Zhu
WWW '24: Proceedings of the ACM Web Conference 2024Pages 2879–2889https://doi.org/10.1145/3589334.3645394

Deep learning has brought about a revolutionary transformation in network applications, particularly in domains like e-commerce and online advertising. Distributed training (DT), as a critical means to expedite model training, has progressively emerged ...
0
201
Metrics
Total Citations0
Total Downloads201
Last 12 Months201
Last 6 weeks20
1
Supplementary Material
rfp0463.mp4
Get Access
research-article
Open Access
April 2024
Results Reproduced / v1.1
Artifacts Evaluated & Functional / v1.1
Artifacts Available / v1.1
Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2Pages 1095–1111https://doi.org/10.1145/3620665.3640399

Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as ...
3
745
Metrics
Total Citations3
Total Downloads745
Last 12 Months745
Last 6 weeks93
View online with eReader
PDF
research-article
Open Access
April 2024
Training Job Placement in Clusters with Statistical In-Network Aggregation
- Bohan Zhao,
- Wei Xu,
- Shuo Liu,
- Yang Tian,
- Qiaoling Wang,
- Wenfei Wu
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1Pages 420–434https://doi.org/10.1145/3617232.3624863

In-Network Aggregation (INA) offloads the gradient aggregation in distributed training (DT) onto programmable switches, where the switch memory could be allocated to jobs in either synchronous or statistical multiplexing mode. Statistical INA has ...
0
854
Metrics
Total Citations0
Total Downloads854
Last 12 Months854
Last 6 weeks125
View online with eReader
PDF
poster
February 2024
POSTER: ParGNN: Efficient Training for Large-Scale Graph Neural Network on GPU Clusters
- Shunde Li,
- Junyu Gu,
- Jue Wang,
- Tiechui Yao,
- Zhiqiang Liang,
- Yumeng Shi,
- Shigang Li,
- Weiting Xi,
- Shushen Li,
- Chunbao Zhou,
- Yangang Wang,
- Xuebin Chi
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel ProgrammingPages 469–471https://doi.org/10.1145/3627535.3638488

Full-batch graph neural network (GNN) training is essential for interdisciplinary applications. Large-scale graph data is usually divided into subgraphs and distributed across multiple compute units to train GNN. The state-of-the-art load balancing ...
0
338
Metrics
Total Citations0
Total Downloads338
Last 12 Months338
Last 6 weeks47
Get Access
research-article
August 2024
Semantic Privacy-Preserving for Video Surveillance Services on the Edge
SEC '23: Proceedings of the Eighth ACM/IEEE Symposium on Edge ComputingPages 300–305https://doi.org/10.1145/3583740.3626820

Intelligent Video surveillance systems, leveraging edge computing, have become increasingly prevalent in various facilities, providing advanced monitoring and management capabilities. However, these systems can inadvertently compromise personally ...
0
19
Metrics
Total Citations0
Total Downloads19
Last 12 Months19
Last 6 weeks2
Get Access
research-article
December 2023
Lightweight Workloads in Heterogeneous Federated Learning via Few-shot Learning
DistributedML '23: Proceedings of the 4th International Workshop on Distributed Machine LearningPages 21–26https://doi.org/10.1145/3630048.3630185

With a growing variety of devices capable of generating data and due to their data holding diverse characteristics, hardware heterogeneity and data heterogeneity have become pressing issues in Federated Learning (FL). Many studies aim to address the ...
0
80
Metrics
Total Citations0
Total Downloads80
Last 12 Months80
Last 6 weeks3
Get Access

1
2
3
4
5

Footer

Categories

Journals
Magazines
Books
Proceedings
SIGs
Conferences
Collections
People

About

About ACM Digital Library
ACM Digital Library Board
Subscription Information
Author Guidelines
Using ACM Digital Library
All Holdings within the ACM Digital Library
ACM Computing Classification System
Accessibility Statement

Join

Join ACM
Join SIGs
Subscribe to Publications
Institutions and Libraries

Connect

Contact us via email
ACM on Facebook
ACM DL on X
ACM on Linkedin
Send Feedback
Submit a Bug Report

The ACM Digital Library is published by the Association for Computing Machinery. Copyright © 2024 ACM, Inc.

Terms of Usage
Privacy Policy
Code of Ethics

ACM Digital Library home

Your Search Results Download Request

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your Search Results Download Request

Your file of search results citations is now ready.

Your Search Results Download Request

Your search export query has expired. Please try again.