More Web Proxy on the site http://driver.im/

research-article

Open access

HBM Connect: High-Performance HLS Interconnect for FPGA HBM

Authors:

Young-kyu Choi,

Nikola Samardzic,

Jason CongAuthors Info & Claims

FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 116 - 126

https://doi.org/10.1145/3431920.3439301

Published: 17 February 2021 Publication History

Abstract

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, fully utilizing the available bandwidth may not be an easy task. If an application requires multiple processing elements to access multiple HBM channels, we observed a significant drop in the effective bandwidth. The existing high-level synthesis (HLS) programming environment had limitation in producing an efficient communication architecture. In order to solve this problem, we propose HBM Connect, a high-performance customized interconnect for FPGA HBM board. Novel HLS-based optimization techniques are introduced to increase the throughput of AXI bus masters and switching elements. We also present a high-performance customized crossbar that may replace the built-in crossbar. The effectiveness of HBM Connect is demonstrated using Xilinx's Alveo U280 HBM board. Based on bucket sort and merge sort case studies, we explore several design spaces and find the design point with the best resource-performance trade-off. The result shows that HBM Connect improves the resource-performance metrics by 6.5X-211X.

Supplementary Material

MP4 File (3431920.3439301.mp4)

Twenty minutes video for "HBM Connect: High-Performance HLS Interconnect for FPGA HBM" at FPGA'21

Download
117.94 MB

References

[1]

ARM. 2011. https://developer.arm.com/docs/ihi0022/dAMBA AXI and ACE Protocol Specification AXI3, AXI4, and AXI4-Lite, ACE and ACE-Lite. www.arm.com

[2]

J. Bakos. 2010. https://aip.scitation.org/doi/abs/10.1109/MCSE.2010.135 High-performance heterogeneous computing with the Convey HC-1. IEEE Comput. Sci. Eng., Vol. 12, 6 (2010), 80--87.

Digital Library

[3]

R. Chen, S. Siriyal, and V. Prasanna. 2015. https://dl.acm.org/doi/abs/10.1145/2684746.2689068Energy and memory efficient mapping of bitonic sorting on FPGA. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays. 240--249.

[4]

Y. Choi, Y. Chi, J. Wang, L. Guo, and J. Cong. 2020. When HLS meets FPGA HBM: Benchmarking and bandwidth optimization. ArXiv Preprint (2020). https://arxiv.org/abs/2010.06075

[5]

Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2016. http://dl.acm.org/citation.cfm?id=2897972 A quantitative analysis on microarchitectures of modern CPU-FPGA platform. In Proc. Ann. Design Automation Conf. 109--114.

[6]

Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2019. https://dl.acm.org/citation.cfm?id=3294054 In-depth analysis on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM Trans. Reconfigurable Technology and Systems, Vol. 12, 1 (Feb. 2019).

Digital Library

[7]

Y. Choi, P. Zhang, P. Li, and J. Cong. 2017. https://ieeexplore.ieee.org/document/8203844HLScope+: Fast and accurate performance estimation for FPGA HLS. In Proc. IEEE/ACM Int. Conf. Computer-Aided Design. 691--698.

[8]

J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang. 2018. https://ieeexplore.ieee.org/abstract/document/8457638 Understanding performance differences of FPGAs and GPUs. In IEEE Ann. Int. Symp. Field-Programmable Custom Computing Machines. 93--96.

[9]

P. Cooke, J. Fowers, G. Brown, and G. Stitt. 2015. https://dl.acm.org/doi/abs/10.1145/2659000 A tradeoff analysis of FPGAs, GPUs, and multicores for sliding-window applications. ACM Trans. Reconfigurable Technol. Syst., Vol. 8, 1 (Mar. 2015), 1--24.

Digital Library

[10]

B. Cope, P. Cheung, W. Luk, and L. Howes. 2010. https://ieeexplore.ieee.org/abstract/document/5374368 Performance comparison of graphics processors to reconfigurable logic: a case study. IEEE Trans. Computers, Vol. 59, 4 (Apr. 2010), 433--448.

Digital Library

[11]

W. J. Dally and C. L. Seitz. 1987. https://ieeexplore.ieee.org/document/1676939 Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans. Computers, Vol. C-36, 5 (May 1987), 547--553.

[12]

K. Fleming, M. King, and M. C. Ng. 2008. https://ieeexplore.ieee.org/abstract/document/4547704High-throughput pipelined mergesort. In Int. Conf. Formal Methods and Models for Co-Design .

[13]

Intel. 2020 a. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-20031.pdfHigh Bandwidth Memory (HBM2) Interface Intel FPGA IP User Guide. https://www.intel.com/

[14]

Intel. 2020 b. https://www.intel.com/content/www/us/en/programmable/documentation/nik1412467993397.htmlAvalon Interface Specifications. https://www.intel.com/

[15]

JEDEC. 2020. High Bandwidth Memory (HBM) DRAM. https://www.jedec.org/standards-documents/docs/jesd235a

[16]

H. Jun, J. Cho, K. Lee, H. Son, K. Kim, H. Jin, and K. Kim. 2017. https://ieeexplore.ieee.org/abstract/document/7939084 HBM (High Bandwidth Memory) DRAM technology and architecture. In Proc. IEEE Int. Memory Workshop. 1--4.

[17]

S. Lahti, P. Sjövall, and J. Vanne. 2019. https://ieeexplore.ieee.org/document/8356004Are we there yet? A study on the state of high-level synthesis. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, Vol. 38, 5 (May 2019), 898--911.

[18]

R. Li, H. Huang, Z. Wang, Z. Shao, X. Liao, and H. Jin. 2020. Optimizing memory performance of Xilinx FPGAs under Vitis. ArXiv Preprint (2020). https://arxiv.org/abs/2010.08916

[19]

A. Lu, Z. Fang, W. Liu, and L. Shannon. 2021. Demystifying the memory system of modern datacenter FPGAs for software programmers through microbenchmarking. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays .

[20]

H. Miao, M. Jeon, G. Pekhimenko, K. S. McKinley, and F. X. Lin. 2019. https://dl.acm.org/doi/abs/10.1145/3297858.3304031 Strea-HBM: Stream analytics on high bandwidth hybrid memory. In Proc. Int. Conf. Architectural Support for Programming Languages and Operating Systems. 167--181.

[21]

D. Molka, D. Hackenberg, and R. Schöne. 2014. https://dl.acm.org/doi/abs/10.1145/2618128.2618129 Main memory and cache performance of Intel Sandy Bridge and AMD Bulldozer. In Proc. Workshop on Memory Systems Performance and Correctness. 1--10.

[22]

E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh. 2017. https://dl.acm.org/doi/abs/10.1145/3020078.3021740Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays. 5--14.

[23]

Nvidia. 2020. Nvidia Titan V. https://www.nvidia.com/en-us/titan/titan-v/

[24]

J. Park, P. Diniz, and K. Shayee. 2004. https://ieeexplore.ieee.org/abstract/document/1336763 Performance and area modeling of complete FPGA designs in the presence of loop transformations. IEEE Trans. Computers, Vol. 53, 11 (Sept. 2004), 1420--1435.

[25]

M. Saitoh, E. A. Elsayed, T. V. Chu, S. Mashimo, and K. Kise. 2018. https://ieeexplore.ieee.org/abstract/document/8457653 A high-performance and cost-effective hardware merge sorter without feedback datapath. In IEEE Ann. Int. Symp. Field-Programmable Custom Computing Machines. 197--204.

[26]

N. Samardzic, W. Qiao, V. Aggarwal, M. F. Chang, and J. Cong. 2020. https://www.iscaconf.org/isca2020/papers/466100a282.pdf Bonsai: High-performance adaptive merge tree sorting. In Ann. Int. Symp. Comput. Architecture. 282--294.

[27]

Z. Wang, H. Huang, J. Zhang, and G. Alonso. 2020. https://wangzeke.github.io/doc/shuhai_fccm_20.pdf Shuhai: Benchmarking High Bandwidth Memory on FPGAs. In IEEE Ann. Int. Symp. Field-Programmable Custom Computing Machines .

[28]

Xilinx. 2020 a. Alveo U280 Data Center Accelerator Card User Guide. https://www.xilinx.com/support/documentation/boards_and_kits/accelerator-cards/ug1314-u280-reconfig-accel.pdf

[29]

Xilinx. 2020 b. Alveo U50 Data Center Accelerator Card User Guide. https://www.xilinx.com/support/documentation/boards_and_kits/accelerator-cards/ug1371-u50-reconfig-accel.pdf

[30]

Xilinx. 2020 c. AXI High Bandwidth Memory Controller v1.0. https://www.xilinx.com/support/documentation/ip_documentation/hbm/v1_0/pg276-axi-hbm.pdf

[31]

Xilinx. 2020 d. https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_2/ug902-vivado-high-level-synthesis.pdfVivado High-level Synthesis (UG902). https://www.xilinx.com/

[32]

Xilinx. 2020 e. https://www.xilinx.com/support/documentation/user_guides/ug573-ultrascale-memory-resources.pdfUltraScale Architecture Memory Resources (UG573). https://www.xilinx.com/

[33]

Xilinx. 2020 f. Vitis Unified Software Platform. https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html

Cited By

Yi EDuan YBai YZhao KJin ZLiu W(2024)Cuper: Customized Dataflow and Perceptual Decoding for Sparse Matrix-Vector Multiplication on HBM-Equipped FPGAs2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546672(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546672
Li KXu SShao ZZheng RLiao XJin H(2024)ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA ChipACM Transactions on Reconfigurable Technology and Systems10.1145/365003717:2(1-39)Online publication date: 29-Feb-2024
https://dl.acm.org/doi/10.1145/3650037
Gao YWang TGong LWang CHu YYang YLiu ZLi XZhou X(2024)Enhancing Graph Random Walk Acceleration via Efficient Dataflow and Hybrid Memory ArchitectureIEEE Transactions on Computers10.1109/TC.2023.334767473:3(887-901)Online publication date: Mar-2024
https://doi.org/10.1109/TC.2023.3347674
Show More Cited By

Index Terms

HBM Connect: High-Performance HLS Interconnect for FPGA HBM
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. High-level language architectures

Recommendations

High-Performance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS: A Case Study on SpMV
FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Sparse linear algebra operators are memory bound due to low compute to memory access ratio and irregular data access patterns. The exceptional bandwidth improvement provided by the emerging high-bandwidth memory (HBM) technologies, coupled with the ...
StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Stream analytics has an insatiable demand for memory and performance. Emerging hybrid memories combine commodity DDR4 DRAM with 3D-stacked High Bandwidth Memory (HBM) DRAM to meet such demands. However, achieving this promise is challenging because (1) ...
ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA Chip
The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2021

240 pages

ISBN:9781450382182

DOI:10.1145/3431920

General Chair:
Lesley Shannon
Simon Fraser University, Canada
,
Program Chair:
Michael Adler
Intel, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Intel Corporation and NSF (National Science Foundation)
NSF (National Science Foundation)

Conference

FPGA '21

Sponsor:

SIGDA

FPGA '21: The 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 28 - March 2, 2021

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
2,593
Total Downloads

Downloads (Last 12 months)762
Downloads (Last 6 weeks)112

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yi EDuan YBai YZhao KJin ZLiu W(2024)Cuper: Customized Dataflow and Perceptual Decoding for Sparse Matrix-Vector Multiplication on HBM-Equipped FPGAs2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546672(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546672
Li KXu SShao ZZheng RLiao XJin H(2024)ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA ChipACM Transactions on Reconfigurable Technology and Systems10.1145/365003717:2(1-39)Online publication date: 29-Feb-2024
https://dl.acm.org/doi/10.1145/3650037
Gao YWang TGong LWang CHu YYang YLiu ZLi XZhou X(2024)Enhancing Graph Random Walk Acceleration via Efficient Dataflow and Hybrid Memory ArchitectureIEEE Transactions on Computers10.1109/TC.2023.334767473:3(887-901)Online publication date: Mar-2024
https://doi.org/10.1109/TC.2023.3347674
Perdomo EMartorell XCervero TSalami B(2024)Memory Sandbox: A Versatile Tool for Analyzing and Optimizing HBM Performance in FPGA2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00026(206-217)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SBAC-PAD63648.2024.00026
Jung JPark JJeong YPark J(2024)MHC : Multi-flit HBM Crossbar with Enhancing Performance and Resource Utilization2024 21st International SoC Design Conference (ISOCC)10.1109/ISOCC62682.2024.10762743(99-100)Online publication date: 19-Aug-2024
https://doi.org/10.1109/ISOCC62682.2024.10762743
Cheng QZheng ZJiang TTang CWang TGong LWang CZhou X(2024)SoGraph: A State-Aware Architecture for Out-of-Memory Graph Processing on HBM-Equipped FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00021(87-91)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00021
Doumet MStan MHall MBetz V(2024)H2PIPE: High Throughput CNN Inference on FPGAs with High-Bandwidth Memory2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00019(69-77)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00019
Shahrouz SBetz V(2024)The Road Less Traveled: Congestion-Aware NoC Placement and Packet Routing for FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00015(33-42)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00015
Hung PKan HKnopf G(2024)The Road AheadEdge Computing Acceleration10.1002/9781119813873.ch9(241-262)Online publication date: 29-Nov-2024
https://doi.org/10.1002/9781119813873.ch9
Xue SLiang HWu QJin X(2023)Scheduling Memory Access Optimization for HBM Based on CLOS2023 25th International Conference on Advanced Communication Technology (ICACT)10.23919/ICACT56868.2023.10079513(448-453)Online publication date: 19-Feb-2023
https://doi.org/10.23919/ICACT56868.2023.10079513
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents