Author: Lau, Jason : Search

research-article

Open Access

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 17, Issue 3Article No.: 51, Pages 1–31https://doi.org/10.1145/3686163

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have ...

research-article

Open Access

TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 16, Issue 4Article No.: 63, Pages 1–31https://doi.org/10.1145/3609335

In this article, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient ...

research-article

Open Access

RapidStream 2.0: Automated Parallel Implementation of Latency–Insensitive FPGA Designs Through Partial Reconfiguration

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 16, Issue 4Article No.: 59, Pages 1–30https://doi.org/10.1145/3593025

Field-programmable gate arrays (FPGAs) require a much longer compilation cycle than conventional computing platforms such as CPUs. In this article, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end ...

research-article

TARO: Automatic Optimization for Free-Running Kernels in FPGA High-Level Synthesis

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCADICS), Volume 42, Issue 7Pages 2423–2427https://doi.org/10.1109/TCAD.2022.3216544

Streaming applications have become one of the key application domains for high-level synthesis (HLS) tools. For a streaming application, there is a potential to simplify the control logic by regulating each task with a stream of input and output data. ...

research-article

Open Access

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate ArraysPages 153–164https://doi.org/10.1145/3543622.3573210

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have ...

research-article

Open Access

FPGA HLS Today: Successes, Challenges, and Opportunities

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 15, Issue 4Article No.: 51, Pages 1–42https://doi.org/10.1145/3530775

The year 2011 marked an important transition for FPGA high-level synthesis (HLS), as it went from prototyping to deployment. A decade later, in this article, we assess the progress of the deployment of HLS technology and highlight the successes in several ...

research-article

Open Access

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysPages 65–77https://doi.org/10.1145/3490422.3502357

Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges - (1) the random ...

research-article

Open Access

Best Paper

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs

FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysPages 81–92https://doi.org/10.1145/3431920.3439289

Despite an increasing adoption of high-level synthesis (HLS) for its design productivity advantages, there remains a significant gap in the achievable clock frequency between an HLS-generated design and a handcrafted RTL one. A key factor that limits ...

research-article

Analysis and optimization of the implicit broadcasts in FPGA HLS to improve maximum frequency

DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation ConferenceArticle No.: 35, Pages 1–6

Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. In this work, we study the timing issues in a diverse set of realistic and complex FPGA HLS designs. (1) We observe that in almost ...

research-article

Open Access

HeteroRefactor: refactoring for heterogeneous computing with FPGA

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software EngineeringPages 493–505https://doi.org/10.1145/3377811.3380340

Heterogeneous computing with field-programmable gate-arrays (FPGAs) has demonstrated orders of magnitude improvement in computing efficiency for many applications. However, the use of such platforms so far is limited to a small subset of programmers with ...

poster

Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysPage 311https://doi.org/10.1145/3373087.3375332

Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. We study the timing issues in a diverse set of nine realistic HLS designs and observe that in most cases the frequency degradation ...

Search Results

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Paper Award

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Reproducibility Badges

Publication Date

Caption

CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture

TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design

RapidStream 2.0: Automated Parallel Implementation of Latency–Insensitive FPGA Designs Through Partial Reconfiguration

TARO: Automatic Optimization for Free-Running Kernels in FPGA High-Level Synthesis

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

FPGA HLS Today: Successes, Challenges, and Opportunities

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs

Analysis and optimization of the implicit broadcasts in FPGA HLS to improve maximum frequency

HeteroRefactor: refactoring for heterogeneous computing with FPGA

Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Paper Award

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Reproducibility Badges

Publication Date

Save to Binder