Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2407.04656 (cs)

[Submitted on 5 Jul 2024]

Title:Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

Authors:Yongji Wu, Wenjie Qu, Tianyang Tao, Zhuang Wang, Wei Bai, Zhuohao Li, Yuan Tian, Jiaheng Zhang, Matthew Lentz, Danyang Zhuo

View PDF HTML (experimental)

Abstract:Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs) due to its sub-linear scaling for computation costs. However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to wait idle until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy adopted by the MoE architecture.
We present Lazarus, a system for resilient and elastic training of MoE models. Lazarus adaptively allocates expert replicas to address the inherent imbalance in expert workload and speeds-up training, while a provably optimal expert placement algorithm is developed to maximize the probability of recovery upon failures. Through adaptive expert placement and a flexible token dispatcher, Lazarus can also fully utilize all available nodes after failures, leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE training systems by up to 5.7x under frequent node failures and 3.4x on a real spot instance trace.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2407.04656 [cs.DC]
	(or arXiv:2407.04656v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2407.04656

Submission history

From: Yongji Wu [view email]
[v1] Fri, 5 Jul 2024 17:13:41 UTC (783 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators