[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3431920.3439301acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Open access

HBM Connect: High-Performance HLS Interconnect for FPGA HBM

Published: 17 February 2021 Publication History

Abstract

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, fully utilizing the available bandwidth may not be an easy task. If an application requires multiple processing elements to access multiple HBM channels, we observed a significant drop in the effective bandwidth. The existing high-level synthesis (HLS) programming environment had limitation in producing an efficient communication architecture. In order to solve this problem, we propose HBM Connect, a high-performance customized interconnect for FPGA HBM board. Novel HLS-based optimization techniques are introduced to increase the throughput of AXI bus masters and switching elements. We also present a high-performance customized crossbar that may replace the built-in crossbar. The effectiveness of HBM Connect is demonstrated using Xilinx's Alveo U280 HBM board. Based on bucket sort and merge sort case studies, we explore several design spaces and find the design point with the best resource-performance trade-off. The result shows that HBM Connect improves the resource-performance metrics by 6.5X-211X.

Supplementary Material

MP4 File (3431920.3439301.mp4)
Twenty minutes video for "HBM Connect: High-Performance HLS Interconnect for FPGA HBM" at FPGA'21

References

[1]
ARM. 2011. https://developer.arm.com/docs/ihi0022/dAMBA AXI and ACE Protocol Specification AXI3, AXI4, and AXI4-Lite, ACE and ACE-Lite. www.arm.com
[2]
J. Bakos. 2010. https://aip.scitation.org/doi/abs/10.1109/MCSE.2010.135 High-performance heterogeneous computing with the Convey HC-1. IEEE Comput. Sci. Eng., Vol. 12, 6 (2010), 80--87.
[3]
R. Chen, S. Siriyal, and V. Prasanna. 2015. https://dl.acm.org/doi/abs/10.1145/2684746.2689068Energy and memory efficient mapping of bitonic sorting on FPGA. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays. 240--249.
[4]
Y. Choi, Y. Chi, J. Wang, L. Guo, and J. Cong. 2020. When HLS meets FPGA HBM: Benchmarking and bandwidth optimization. ArXiv Preprint (2020). https://arxiv.org/abs/2010.06075
[5]
Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2016. http://dl.acm.org/citation.cfm?id=2897972 A quantitative analysis on microarchitectures of modern CPU-FPGA platform. In Proc. Ann. Design Automation Conf. 109--114.
[6]
Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2019. https://dl.acm.org/citation.cfm?id=3294054 In-depth analysis on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM Trans. Reconfigurable Technology and Systems, Vol. 12, 1 (Feb. 2019).
[7]
Y. Choi, P. Zhang, P. Li, and J. Cong. 2017. https://ieeexplore.ieee.org/document/8203844HLScope+: Fast and accurate performance estimation for FPGA HLS. In Proc. IEEE/ACM Int. Conf. Computer-Aided Design. 691--698.
[8]
J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang. 2018. https://ieeexplore.ieee.org/abstract/document/8457638 Understanding performance differences of FPGAs and GPUs. In IEEE Ann. Int. Symp. Field-Programmable Custom Computing Machines. 93--96.
[9]
P. Cooke, J. Fowers, G. Brown, and G. Stitt. 2015. https://dl.acm.org/doi/abs/10.1145/2659000 A tradeoff analysis of FPGAs, GPUs, and multicores for sliding-window applications. ACM Trans. Reconfigurable Technol. Syst., Vol. 8, 1 (Mar. 2015), 1--24.
[10]
B. Cope, P. Cheung, W. Luk, and L. Howes. 2010. https://ieeexplore.ieee.org/abstract/document/5374368 Performance comparison of graphics processors to reconfigurable logic: a case study. IEEE Trans. Computers, Vol. 59, 4 (Apr. 2010), 433--448.
[11]
W. J. Dally and C. L. Seitz. 1987. https://ieeexplore.ieee.org/document/1676939 Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans. Computers, Vol. C-36, 5 (May 1987), 547--553.
[12]
K. Fleming, M. King, and M. C. Ng. 2008. https://ieeexplore.ieee.org/abstract/document/4547704High-throughput pipelined mergesort. In Int. Conf. Formal Methods and Models for Co-Design .
[13]
Intel. 2020 a. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-20031.pdfHigh Bandwidth Memory (HBM2) Interface Intel FPGA IP User Guide. https://www.intel.com/
[14]
Intel. 2020 b. https://www.intel.com/content/www/us/en/programmable/documentation/nik1412467993397.htmlAvalon Interface Specifications. https://www.intel.com/
[15]
JEDEC. 2020. High Bandwidth Memory (HBM) DRAM. https://www.jedec.org/standards-documents/docs/jesd235a
[16]
H. Jun, J. Cho, K. Lee, H. Son, K. Kim, H. Jin, and K. Kim. 2017. https://ieeexplore.ieee.org/abstract/document/7939084 HBM (High Bandwidth Memory) DRAM technology and architecture. In Proc. IEEE Int. Memory Workshop. 1--4.
[17]
S. Lahti, P. Sjövall, and J. Vanne. 2019. https://ieeexplore.ieee.org/document/8356004Are we there yet? A study on the state of high-level synthesis. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, Vol. 38, 5 (May 2019), 898--911.
[18]
R. Li, H. Huang, Z. Wang, Z. Shao, X. Liao, and H. Jin. 2020. Optimizing memory performance of Xilinx FPGAs under Vitis. ArXiv Preprint (2020). https://arxiv.org/abs/2010.08916
[19]
A. Lu, Z. Fang, W. Liu, and L. Shannon. 2021. Demystifying the memory system of modern datacenter FPGAs for software programmers through microbenchmarking. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays .
[20]
H. Miao, M. Jeon, G. Pekhimenko, K. S. McKinley, and F. X. Lin. 2019. https://dl.acm.org/doi/abs/10.1145/3297858.3304031 Strea-HBM: Stream analytics on high bandwidth hybrid memory. In Proc. Int. Conf. Architectural Support for Programming Languages and Operating Systems. 167--181.
[21]
D. Molka, D. Hackenberg, and R. Schöne. 2014. https://dl.acm.org/doi/abs/10.1145/2618128.2618129 Main memory and cache performance of Intel Sandy Bridge and AMD Bulldozer. In Proc. Workshop on Memory Systems Performance and Correctness. 1--10.
[22]
E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh. 2017. https://dl.acm.org/doi/abs/10.1145/3020078.3021740Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays. 5--14.
[23]
Nvidia. 2020. Nvidia Titan V. https://www.nvidia.com/en-us/titan/titan-v/
[24]
J. Park, P. Diniz, and K. Shayee. 2004. https://ieeexplore.ieee.org/abstract/document/1336763 Performance and area modeling of complete FPGA designs in the presence of loop transformations. IEEE Trans. Computers, Vol. 53, 11 (Sept. 2004), 1420--1435.
[25]
M. Saitoh, E. A. Elsayed, T. V. Chu, S. Mashimo, and K. Kise. 2018. https://ieeexplore.ieee.org/abstract/document/8457653 A high-performance and cost-effective hardware merge sorter without feedback datapath. In IEEE Ann. Int. Symp. Field-Programmable Custom Computing Machines. 197--204.
[26]
N. Samardzic, W. Qiao, V. Aggarwal, M. F. Chang, and J. Cong. 2020. https://www.iscaconf.org/isca2020/papers/466100a282.pdf Bonsai: High-performance adaptive merge tree sorting. In Ann. Int. Symp. Comput. Architecture. 282--294.
[27]
Z. Wang, H. Huang, J. Zhang, and G. Alonso. 2020. https://wangzeke.github.io/doc/shuhai_fccm_20.pdf Shuhai: Benchmarking High Bandwidth Memory on FPGAs. In IEEE Ann. Int. Symp. Field-Programmable Custom Computing Machines .
[28]
Xilinx. 2020 a. Alveo U280 Data Center Accelerator Card User Guide. https://www.xilinx.com/support/documentation/boards_and_kits/accelerator-cards/ug1314-u280-reconfig-accel.pdf
[29]
Xilinx. 2020 b. Alveo U50 Data Center Accelerator Card User Guide. https://www.xilinx.com/support/documentation/boards_and_kits/accelerator-cards/ug1371-u50-reconfig-accel.pdf
[30]
Xilinx. 2020 c. AXI High Bandwidth Memory Controller v1.0. https://www.xilinx.com/support/documentation/ip_documentation/hbm/v1_0/pg276-axi-hbm.pdf
[31]
Xilinx. 2020 d. https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_2/ug902-vivado-high-level-synthesis.pdfVivado High-level Synthesis (UG902). https://www.xilinx.com/
[32]
Xilinx. 2020 e. https://www.xilinx.com/support/documentation/user_guides/ug573-ultrascale-memory-resources.pdfUltraScale Architecture Memory Resources (UG573). https://www.xilinx.com/
[33]
Xilinx. 2020 f. Vitis Unified Software Platform. https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html

Cited By

View all
  • (2024)Cuper: Customized Dataflow and Perceptual Decoding for Sparse Matrix-Vector Multiplication on HBM-Equipped FPGAs2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546672(1-6)Online publication date: 25-Mar-2024
  • (2024)ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA ChipACM Transactions on Reconfigurable Technology and Systems10.1145/365003717:2(1-39)Online publication date: 29-Feb-2024
  • (2024)Enhancing Graph Random Walk Acceleration via Efficient Dataflow and Hybrid Memory ArchitectureIEEE Transactions on Computers10.1109/TC.2023.334767473:3(887-901)Online publication date: Mar-2024
  • Show More Cited By

Index Terms

  1. HBM Connect: High-Performance HLS Interconnect for FPGA HBM

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
    February 2021
    240 pages
    ISBN:9781450382182
    DOI:10.1145/3431920
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 February 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. field-programmable gate array
    2. high bandwidth memory
    3. high-level synthesis
    4. on-chip network
    5. performance optimization

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    FPGA '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 125 of 627 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)762
    • Downloads (Last 6 weeks)112
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cuper: Customized Dataflow and Perceptual Decoding for Sparse Matrix-Vector Multiplication on HBM-Equipped FPGAs2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546672(1-6)Online publication date: 25-Mar-2024
    • (2024)ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA ChipACM Transactions on Reconfigurable Technology and Systems10.1145/365003717:2(1-39)Online publication date: 29-Feb-2024
    • (2024)Enhancing Graph Random Walk Acceleration via Efficient Dataflow and Hybrid Memory ArchitectureIEEE Transactions on Computers10.1109/TC.2023.334767473:3(887-901)Online publication date: Mar-2024
    • (2024)Memory Sandbox: A Versatile Tool for Analyzing and Optimizing HBM Performance in FPGA2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00026(206-217)Online publication date: 13-Nov-2024
    • (2024)MHC : Multi-flit HBM Crossbar with Enhancing Performance and Resource Utilization2024 21st International SoC Design Conference (ISOCC)10.1109/ISOCC62682.2024.10762743(99-100)Online publication date: 19-Aug-2024
    • (2024)SoGraph: A State-Aware Architecture for Out-of-Memory Graph Processing on HBM-Equipped FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00021(87-91)Online publication date: 2-Sep-2024
    • (2024)H2PIPE: High Throughput CNN Inference on FPGAs with High-Bandwidth Memory2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00019(69-77)Online publication date: 2-Sep-2024
    • (2024)The Road Less Traveled: Congestion-Aware NoC Placement and Packet Routing for FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00015(33-42)Online publication date: 2-Sep-2024
    • (2024)The Road AheadEdge Computing Acceleration10.1002/9781119813873.ch9(241-262)Online publication date: 29-Nov-2024
    • (2023)Scheduling Memory Access Optimization for HBM Based on CLOS2023 25th International Conference on Advanced Communication Technology (ICACT)10.23919/ICACT56868.2023.10079513(448-453)Online publication date: 19-Feb-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media