This repository contains 4 benchmark designs used in the NoH evaluation. Each benchmark has four configurations.
Workload | FPGA | Memory | # Dice | Most-used |
---|---|---|---|---|
MM | VPK180 | DRAM | 4 | DSP |
Jacobi3D | VPK180 | DRAM | 4 | DSP |
KNN | VHK158 | HBM | 2 | LUT |
SpMV | VHK158 | HBM | 2 | BRAM |
Each configuration has two versions of RTL folders. RTL
is the default version generated by TAPA. We also include RTL-pipelined
, generated by AutoBridge, which should achieve higher frequency.
AutoBridge creates a coarse-grained floorplan of the design to balance utilization and reduce congestion, and adds pipelining registers to long connections. constraint.tcl
is the floorplan but the instance prefix (top_arm_i/dut_0/
) needs to be changed according to the instance hierarchy in your project.
Note: AutoBridge is from 2021. We recommend interested users to try Rapidstream, the latest optimized version and free for academic use.
proc import_ips_from_dir {dir} {
# Get a list of all .xci files in the specified directory and its subdirectories
foreach file [glob -nocomplain -directory $dir *] {
if {[file isdirectory $file]} {
set ip_file [glob -nocomplain -directory $file *.xci]
puts "Importing IP: $ip_file"
import_ip $ip_file
}
}
}
import_ips_from_dir <rtl_folder>
import_files <rtl_folder>
root
└── benchmark
└── configuration
├── tapa_src # TAPA HLS
├── rtl
├── rtl_pipelined
└── constraint.tcl # coarse-grained placement
Generated using AutoSA. The targeted device is xcvp1802-lsvc4072-2MP-e-S
(VPK180 board). The target HLS frequency is 300 MHz. The top-level module is kernel0
. We vary the systolic array width and height. The four configurations are:
- 18x16
- 18x17
- 18x18
- 18x19
Note: The pipelined versions all failed to route because AutoBridge could not handle these many nodes (>200). Its ILP-based algorithm could not converge in 7 hours so the resulting floorplan is sub-optimal with very high inter-die crossings.
Change array_part[0]
and array_part[1]
to vary the width and height respectively:
./autosa ./autosa_tests/large/mm/kernel.c \
--config=./autosa_config/autosa_config.json \
--target=autosa_tapa \
--output-dir=./autosa.tmp/output \
--sa-sizes="{kernel[]->space_time[3];kernel[]->array_part[144,128,64];kernel[]->latency[8,8];kernel[]->simd[32]}" \
--data-pack-sizes="{kernel[]->cin[64,64,64];kernel[]->cout[64,64,64];kernel[]->w[64,64,64]}" \
--simd-info=./autosa_tests/large/mm/simd_info.json \
--host-serialize \
--hls
Generated using SODA. The targeted device is xcvp1802-lsvc4072-2MP-e-S
(VPK180 board). The target HLS frequency is 300 MHz. The top-level module is jacobi3d_kernel
. We vary the number of iterations to compute jacobi3d. The four configurations are:
- iter109
- iter115
- iter121
- iter124
The jacobi3d application has thin connections between operations with only 512 bits. Therefore, the frequency difference between the baseline RTL and the pipelined version is minimal.
sodac tests/src/jacobi3d.soda --xocl-kernel src/jacobi3d.cpp --xocl-interface tapa::mmap --frt-host src/jacobi3d_soda.host.cpp
Example jacobi3d.soda
configuration:
kernel: jacobi3d
burst width: 512
unroll factor: 16
input dram 0 float: t1(16, 16, *)
output dram 1 float: t0(0, 0, 0) = (t1(0, 0, 0)
+ t1(1, 0, 0) + t1(-1, 0, 0)
+ t1(0, 1, 0) + t1( 0, -1, 0)
+ t1(0, 0, 1) + t1( 0, 0, -1)
) * 0.142857142f
iterate: 50
border: ignore
cluster: coarse
Generated using CHIP-KNN. The targeted device is xcvh1582-vsva3697-2MP-e-S
(VHK158 board). The target HLS frequency is 300 MHz. The top-level module is Knn
. We vary the number of HBM ports manually. The four configurations are:
- knn27
- knn36
- knn45
- knn54
Generated using Serpens. The targeted device is xcvh1582-vsva3697-2MP-e-S
(VHK158 board). The target HLS frequency is 300 MHz. The top-level module is Serpens
. We vary the number of HBM ports manually in serpens.h
(constexpr int NUM_CH_SPARSE = 56; //or, 32, 40, 48, 56
). The four configurations are:
- serpens32
- serpens40
- serpens48
- serpens56