research-article

On-demand and Parallel Checkpoint/Restore for GPU Applications

Authors:

Yanning Yang,

Dong Du,

Haitao Song,

Yubin XiaAuthors Info & Claims

SoCC '24: Proceedings of the 2024 ACM Symposium on Cloud Computing

Pages 415 - 433

https://doi.org/10.1145/3698038.3698510

Published: 20 November 2024 Publication History

Get Access

Abstract

Leveraging serverless computing for cloud-based machine learning services is on the rise, promising cost-efficiency and flexibility are crucial for ML applications relying on high-performance GPUs and substantial memory. However, despite modern serverless platforms handling diverse devices like GPUs seamlessly on a pay-as-you-go basis, a longstanding challenge remains: startup latency, a well-studied issue when serverless is CPU-centric. For example, initializing GPU apps with minor GPU models, like MobileNet, demands several seconds. For more intricate models such as GPT-2, startup latency can escalate to around 10 seconds, vastly overshadowing the short computation time for GPU-based inference. Prior solutions tailored for CPU serverless setups, like fork() and Checkpoint/Restore, cannot be directly and effectively applied due to differences between CPUs and GPUs.

This paper presents gCROP (GPU Checkpoint/Restore made On-demand and Parallel), the first GPU runtime that achieves <100ms startup latency for GPU apps with up to 774 million parameters (3.1GB GPT-2-Large model). The key insight behind gCROP is to selectively restore essential states on demand and in parallel during boot from a prepared checkpoint image. To this end, gCROP first introduces a global service, GPU Restore Server, which can break the existing barrier between restore stages and achieve parallel restore. Besides, gCROP leverages both CPU and GPU page faults, and can on-demand restore both CPU and GPU data with profile-guided order to mitigate costs caused by faults. Moreover, gCROP designs a multi-checkpoint mechanism to increase the common contents among checkpoint images and utilizes deduplication to reduce storage costs. Implementation and evaluations on AMD GPUs show significant improvement in startup latency, 6.4x-24.7x compared with booting from scratch and 3.9x-23.5x over the state-of-the-art method (CRIU).

References

[1]

2024. AMD GPU DMA buffer. https://github.com/torvalds/linux/blob/v6.8/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h. Accessed: 2024-07-04.

Abstract

References

Index Terms

Recommendations

GPU Acceleration in Unikernels Using Cricket GPU Virtualization

Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Energy-efficient stencil computations on distributed GPUs using dynamic parallelism and GPU-controlled communication

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations