8000 GitHub - jake-ke/NoH-benchmarks
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

jake-ke/NoH-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NoH Benchmarks

This repository contains 4 benchmark designs used in the NoH evaluation. Each benchmark has four configurations.

Workload FPGA Memory # Dice Most-used
MM VPK180 DRAM 4 DSP
Jacobi3D VPK180 DRAM 4 DSP
KNN VHK158 HBM 2 LUT
SpMV VHK158 HBM 2 BRAM

Each configuration has two versions of RTL folders. RTL is the default version generated by TAPA. We also include RTL-pipelined, generated by AutoBridge, which should achieve higher frequency.

AutoBridge creates a coarse-grained floorplan of the design to balance utilization and reduce congestion, and adds pipelining registers to long connections. constraint.tcl is the floorplan but the instance prefix (top_arm_i/dut_0/) needs to be changed according to the instance hierarchy in your project.

Note: AutoBridge is from 2021. We recommend interested users to try Rapidstream, the latest optimized version and free for academic use.

Example Tcl to Import RTL to Vivado

proc import_ips_from_dir {dir} {
    # Get a list of all .xci files in the specified directory and its subdirectories
    foreach file [glob -nocomplain -directory $dir *] {
        if {[file isdirectory $file]} {
            set ip_file [glob -nocomplain -directory $file *.xci]
            puts "Importing IP: $ip_file"
            import_ip $ip_file
        }
    }
}

import_ips_from_dir <rtl_folder>
import_files <rtl_folder>

Directory Tree of Each Benchmark

root
└── benchmark
    └── configuration
        ├── tapa_src  # TAPA HLS
        ├── rtl
        ├── rtl_pipelined
        └── constraint.tcl  # coarse-grained placement

Systolic-array Matrix-multiply (mm) Accelerators

Generated using AutoSA. The targeted device is xcvp1802-lsvc4072-2MP-e-S (VPK180 board). The target HLS frequency is 300 MHz. The top-level module is kernel0. We vary the systolic array width and height. The four configurations are:

  • 18x16
  • 18x17
  • 18x18
  • 18x19

Note: The pipelined versions all failed to route because AutoBridge could not handle these many nodes (>200). Its ILP-based algorithm could not converge in 7 hours so the resulting floorplan is sub-optimal with very high inter-die crossings.

Example AutoSA Command

Change array_part[0] and array_part[1] to vary the width and height respectively:

./autosa ./autosa_tests/large/mm/kernel.c \
--config=./autosa_config/autosa_config.json \
--target=autosa_tapa \
--output-dir=./autosa.tmp/output \
--sa-sizes="{kernel[]->space_time[3];kernel[]->array_part[144,128,64];kernel[]->latency[8,8];kernel[]->simd[32]}" \
--data-pack-sizes="{kernel[]->cin[64,64,64];kernel[]->cout[64,64,64];kernel[]->w[64,64,64]}" \
--simd-info=./autosa_tests/large/mm/simd_info.json \
--host-serialize \
--hls

Stencil (jacobi3d) Accelerators

Generated using SODA. The targeted device is xcvp1802-lsvc4072-2MP-e-S (VPK180 board). The target HLS frequency is 300 MHz. The top-level module is jacobi3d_kernel. We vary the number of iterations to compute jacobi3d. The four configurations are:

  • iter109
  • iter115
  • iter121
  • iter124

The jacobi3d application has thin connections between operations with only 512 bits. Therefore, the frequency difference between the baseline RTL and the pipelined version is minimal.

Example SODA Command

sodac tests/src/jacobi3d.soda --xocl-kernel src/jacobi3d.cpp --xocl-interface tapa::mmap --frt-host src/jacobi3d_soda.host.cpp

Example jacobi3d.soda configuration:

kernel: jacobi3d
burst width: 512
unroll factor: 16
input dram 0 float: t1(16, 16, *)
output dram 1 float: t0(0, 0, 0) = (t1(0, 0, 0)
    + t1(1, 0, 0) + t1(-1,  0,  0)
    + t1(0, 1, 0) + t1( 0, -1,  0)
    + t1(0, 0, 1) + t1( 0,  0, -1)
    ) * 0.142857142f
iterate: 50
border: ignore
cluster: coarse

KNN (knn) Accelerators

Generated using CHIP-KNN. The targeted device is xcvh1582-vsva3697-2MP-e-S (VHK158 board). The target HLS frequency is 300 MHz. The top-level module is Knn. We vary the number of HBM ports manually. The four configurations are:

  • knn27
  • knn36
  • knn45
  • knn54

Sparse Matrix-vector Multiplication (spmv) Accelerators

Generated using Serpens. The targeted device is xcvh1582-vsva3697-2MP-e-S (VHK158 board). The target HLS frequency is 300 MHz. The top-level module is Serpens. We vary the number of HBM ports manually in serpens.h (constexpr int NUM_CH_SPARSE = 56; //or, 32, 40, 48, 56 ). The four configurations are:

  • serpens32
  • serpens40
  • serpens48
  • serpens56

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0