research-article

FLAME: Fully Leveraging MoE Sparsity for Transformer on FPGA

Authors:

Jun Yu,

Kun WangAuthors Info & Claims

DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference

Article No.: 248, Pages 1 - 6

https://doi.org/10.1145/3649329.3656507

Published: 07 November 2024 Publication History

Get Access

Abstract

MoE (Mixture-of-Experts) mechanism has been widely adopted in transformer-based models to facilitate further expansion of model parameter size and enhance generalization capabilities. However, the practical deployment of MoE mechanism for transformer on resource-constrained platforms, such as FPGA, remains challenging due to heavy memory footprints and impractical runtime costs introduced by the MoE mechanism. Diving into the MoE mechanism, we raise two key observations: (1) Expert weights are heavy but cold, making it ideal to leverage expert weight sparsity. (2) There exists highly skewed expert activation paths for MoE layers in transformer-based models, making it feasible to conduct expert prediction and prefetching. Motivated by these two observations, we propose FLAME, the first algorithm-hardware co-optimized MoE accelerating framework designed to fully leverage MoE sparsity for efficient transformer deployment on FPGA. First, to leverage expert weight sparsity, we integrate an N:M pruning algorithm, allowing for the pruning of expert weights without significantly compromising model accuracy. Second, to settle expert activation sparsity, we propose a circular expert prediction (CEPR) strategy. CEPR prefetches expert weights from external storage to on-chip cache before the activated expert index is determined. Last, we co-optimize both MoE sparsity through the introduction of an efficient pruning-aware expert buffering (PA-BUF) mechanism. Experimental results demonstrate that FLAME achieves 84.4% accuracy of expert prediction with merely two expert caches on-chip. In comparison with CPU and GPU, FLAME achieves 4.12× and 1.49× speedup, respectively.

References

[1]

Bogdan Gliwa et.al. 2019. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization. Association for Computational Linguistics, Hong Kong, China, 70--79.

Crossref

Google Scholar

[2]

Chao Fang et.al. 2022. An algorithm-hardware co-optimized framework for accelerating N: M sparse transformers. VLSI 30, 11 (2022), 1573--1586.

Google Scholar

[3]

Dmitry Lepikhin et.al. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668 [cs.CL]

Google Scholar

[4]

Haiyang Huang et.al. 2023. Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference. arXiv preprint arXiv:2303.06182 (2023).

Google Scholar

[5]

Hongwu Peng et.al. 2021. Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning. In ISQED. 142--148.

Crossref

Google Scholar

[6]

Nan Du et.al. 2022. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. arXiv:2112.06905 [cs.CL]

Google Scholar

[7]

Robert A. Jacobs et.al. 1991. Adaptive Mixtures of Local Experts. Neural Computation 3, 1 (1991), 79--87.

Crossref

Google Scholar

[8]

Ranggi Hwang et.al. 2023. Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference. arXiv preprint arXiv:2308.12066 (2023).

Google Scholar

[9]

Rongjie Yi et.al. 2023. Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352 (2023).

Google Scholar

[10]

Song Han et.al. 2015. Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28 (2015).

Google Scholar

[11]

Shazeer Noam et.al. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).

Google Scholar

[12]

Sarkar Rishov et.al. 2023. Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts. arXiv preprint arXiv:2305.18691 (2023).

Google Scholar

[13]

William Fedus et.al. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv:2101.03961 [cs.LG]

Google Scholar

[14]

Yueyin Bai et.al. 2023. LTrans-OPU: A Low-Latency FPGA-Based Overlay Processor for Transformer Networks. In FPL. 283--287.

Crossref

Google Scholar

Index Terms

FLAME: Fully Leveraging MoE Sparsity for Transformer on FPGA
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Reconfigurable computing
  2. Embedded and cyber-physical systems
    1. System on a chip
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
      2. Reconfigurable logic applications

Index terms have been assigned to the content through auto-classification.

Recommendations

FNM-Trans: Efficient FPGA-based Transformer Architecture with Full N:M Sparsity
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference

Transformer models have become popular in various AI applications due to their exceptional performance. However, their impressive performance comes with significant computing and memory costs, hindering efficient deployment of Transformer-based ...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Parallel accelerators are playing an increasingly important role in scientific computing. However, it is perceived that their weakness nowadays is their reduced ''programmability'' in comparison with traditional general-purpose CPUs. For the domain of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference

June 2024

2159 pages

ISBN:9798400706011

DOI:10.1145/3649329

Chair:
Vivek De

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2024

Check for updates

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China

Conference

DAC '24

Sponsor:

SIGDA

DAC '24: 61st ACM/IEEE Design Automation Conference

June 23 - 27, 2024

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
126
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)69

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

FNM-Trans: Efficient FPGA-based Transformer Architecture with Full N:M Sparsity

Nuclear Reactor Simulations on OpenCL FPGA Platform

The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations