1 Introduction
Field Programmable Gate Arrays (FPGAs) are reconfigurable devices that can be programmed using a
hardware description language (HDL), such as Verilog or VHDL, to implement specialized accelerators. FPGAs have become a key component of heterogeneous high-performance computing systems, often coupled to other types of accelerators, such as GPUs, to accelerate data-intensive tasks or parts of applications that require low latency processing.
High-level synthesis (HLS) tools simplify the process of developing FPGA-based accelerators by automatically translating software descriptions (typically C or C++ code) into HDL. These tools promise to hide hardware-specific details from the user while still obtaining high
quality of results (QoR) in terms of latency, area usage, and clock frequency [
42].
However, existing HLS tools mainly target regular, arithmetic-intensive workloads operating on dense data structures typical of
digital signal processing (DSP) applications, which present an abundance of instruction-level parallelism. Solutions that generate accelerators from parallel high-level descriptions leverage data-parallel programming models that rely on regular computation and easily partitionable data sets (e.g., OpenCL), as they mainly replicate identical computational units and do not well support synchronization-intensive programs [
14]. Driven by the interest in big data and graph analytics, several approaches attempt to support the acceleration of parallel applications with irregular behaviour. These include solutions that target specific graph algorithms, approaches that try to optimize linear algebra operations (e.g., sparse matrix-vector and sparse matrix-matrix multiplication [
17]), and frameworks based on the
Gather-Apply-Scatter (GAS) paradigm [
33]. These solutions are highly specialized: they focus on specific graph algorithms, or they only try to generate and optimize with HLS computational engines that need to be supported by a software framework; the actual execution of graph algorithms mapped on such accelerators might thus increase in computational complexity with respect to vertex- or edge-centered formulations.
The OpenMP programming model for shared memory systems has often been successfully used to design parallel irregular algorithms [
7,
36,
38] thanks to its task-level and loop-level parallel constructs. Research on HLS previously explored the use of OpenMP annotations to synthesize applications with coarse-grained task-level parallelism [
15,
47]; however, these approaches employ multi-level designs driven by a microcontroller and cannot manage complex synchronization constructs. Furthermore, they still focus on generating parallel accelerators from programs with regular behaviours. The
Svelto methodology [
39] uses OpenMP annotation as a starting point to synthesize parallel multi-threaded accelerators, targeting the generation of accelerators starting from specific patterns of parallel graph algorithms. In this article, we propose a more general methodology for the
Synthesis of PARallel multi-threaded Accelerators (SPARTA), providing automated translation of parallel specifications written in OpenMP into efficient parallel accelerators.
SPARTA adopts a fundamentally different and more flexible approach than previous methodologies generating accelerators from OpenMP code. Instead of mapping OpenMP parallel regions marked by compiler directives (pragmas) to a pre-defined parallel architecture template, it translates pragmas into calls to OpenMP runtime primitives and implements corresponding low-level hardware components to be used in the synthesis backend. Accelerators generated with SPARTA can exploit both spatial and temporal parallelism, hiding the latency of external memory accesses through context switching (CS) across multiple execution threads. Moreover, SPARTA can instantiate an advanced hierarchical memory subsystem composed of a custom Network-on-Chip (NoC) that connects multiple external memory channels to each accelerator, memory-side caching, and on-chip private memories for each accelerator. These features allow SPARTA to deal with both regular and irregular behaviours in parallel algorithms and to generate efficient accelerators in terms of performance per area.
We evaluated SPARTA by synthesizing typical parallel graph algorithms, verifying the scalability and the QoR of the generated designs as the numbers of parallel accelerators and contexts change. We also compared accelerators for graph database queries generated with SPARTA to the ones generated with Svelto, obtaining, on average, a 2.29\(\times\) improvement in performance with similar architectural parameters (number of accelerators, contexts, and memory channels).
In summary, this article makes the following contributions:
—
A synthesis methodology to generate multi-threaded parallel accelerators (SPARTA) supporting OpenMP work-sharing and synchronization constructs through the implementation of low-level primitives from the OpenMP runtime system;
—
Support for both spatial and temporal parallelism, with multiple threads that can pause and resume execution on a core when needed;
—
A highly configurable memory system that mitigates the latency of external memory accesses.
The rest of the article is organized as follows:
Section 2 reviews related work on the synthesis of FPGA accelerators from parallel programming languages.
Sections 3 and
3.1 describe the SPARTA architecture and synthesis flow.
Section 4 presents experimental results and
Section 5 draws conclusions for the article.
3 SPARTA Synthesis Flow
The SPARTA design flow automatically generates parallel multi-threaded accelerators starting from OpenMP code through an enhanced HLS flow that we implemented within the open-source Bambu tool [
19] (
Figure 1). The input of the SPARTA synthesis methodology is a C/C++ description of a parallel application annotated with OpenMP directives, such as the one shown in
Figure 2(a). SPARTA includes a frontend derived from Clang, which parses the OpenMP annotations and performs initial HLS optimizations. OpenMP directives can be used to define parallel regions and shared data access policies, with many different work-sharing constructs (e.g.,
parallel for,
sections) and synchronization mechanisms [
12]. It is also possible to directly pass to our enhanced Bambu flow the LLVM IR exposing the OpenMP runtime calls, allowing SPARTA to process specifications generated through other frontends that support OpenMP (e.g., Flang and other domain-specific languages that leverage the OpenMP MLIR dialect).
A conventional compiler flow for parallel architectures translates an OpenMP-annotated description into a parallel application through several transformation passes. We will illustrate the transformations performed within the Clang LLVM compiler since we employ it as a frontend for the SPARTA methodology starting from C/C++ code. However, as previously highlighted, our approach is adaptable to other compiler infrastructures that support an OpenMP runtime library [
35]. Using a conventional OpenMP parallelization flow through Clang/LLVM greatly simplifies introducing support for new constructs that might be added to the OpenMP standard.
We can split the parallelization process into three phases: pre-processing of compiler directives, code transformations, and parallel API library linking. Pre-processing of compiler directives allows gathering information on how the input program should be parallelized, i.e., which code sections can be executed in parallel, how to distribute work across threads, data allocation policies, and more. After pre-processing, the compiler generates an intermediate representation (IR), instruments it with calls to OpenMP runtime functions, and links the OpenMP library, connecting each function call to its target-specific implementation.
In the SPARTA synthesis flow, the Clang-based Bambu frontend compiles the input description following the same steps, except that in the final phase, it maps calls to the OpenMP runtime onto corresponding calls to our custom HLS library. The SPARTA library in Bambu contains optimized Verilog primitives to handle synchronization, concurrency, and memory operations in a parallel architecture, functionally equivalent to the OpenMP runtime library primitives. Starting from the LLVM OpenMP runtime library, we streamlined and reduced it to essential functionality to generate small and fast components suitable to be synthesized on FPGA. For example, we removed logic to support the reuse of threads because hardware threads cannot be reassigned to other cores that require a different datapath and controller. We implemented OpenMP constructs replaceable with arithmetic and logic operations (e.g., the ones that compute the range of loop iterations belonging to a thread) with standard functional units.
Figure 2 shows how the SPARTA front end translates an OpenMP annotated code (
Figure 2(a)) into an IR with calls to the SPARTA library functions (
Figure 2(b)). The code in
Figure 2(a) denotes a parallel region with
M threads where each thread executes an iteration of a for loop. The IR in
Figure 2(b) shows how the transformations outline the loop body into a separate function (
omp\(\_\)outlined) and modify the OpenMP runtime library calls. The new function signature contains all parameters to execute an iteration of the loop: the thread ID and a reference to the arrays
A,
B, and
res. The function body is instrumented with API calls to the SPARTA Bambu library functions that retrieve the thread ID and synchronize with other threads.
The instrumented IR containing the API calls that control the parallel execution is subsequently optimized with hardware-oriented code transformations that facilitate the generation of the parallel architecture in the HLS backend.
3.1 The SPARTA Architecture
SPARTA generates a modular architecture following the content of the input OpenMP specification that can be customized through user-defined parameters. The main components are depicted in
Figure 3: sequential parts of the computation go through the standard HLS flow, while parallel constructs are translated into a set of parallel processing units called
cores. Each core contains a
finite state machine with datapath (FSMD) implementing the computation described within an OpenMP parallel region (e.g., the loop body in an
omp parallel for loop), a CS manager in case multiple threads are assigned to the same core, and local memory for data belonging to those threads. The number of cores and the number of threads (contexts) assigned to each core are configurable. The parallel region of the accelerator can also contain synchronization components, arbiters, and data shared among threads if needed. The accelerator is connected to a NoC that relays messages to an external memory (with support for multi-banked or pipelined memories); the number of channels between the accelerator and the NoC, as well as between the NoC and the external memory, is configurable. The user encodes information about parallelism, work-sharing policies, and synchronization requirements through the OpenMP annotations in the code and configures the rest of the architecture through command-line options.
3.1.1 Spatial and Temporal Multithreading.
Cores in a SPARTA accelerator execute concurrently, using the FPGA resources to exploit coarse-grained parallelism; instruction-level parallelism is extracted by the HLS tool during the implementation of the FSMD within each core. SPARTA accelerators can also support temporal multithreading by interleaving the execution of multiple threads on the same core instead of stalling execution and leaving the core underutilized when a thread needs to perform long latency operations (typically external memory accesses or synchronization, which might take tens of clock cycles), it is possible to perform a CS and execute a different thread with the same datapath, e.g., a different iteration of the same omp for loop.
Each core with CS enabled features a CS manager module to handle the active context and switch to a new one when a memory or synchronization operation starts. For example, on a read/write request to the external memory, the CS manager saves the active context (i.e., the thread state) and starts the execution of a different context in the next clock cycle. Once the core receives the memory response, the CS manager flags the previously paused context as ready and eventually resumes its execution when the currently active context is paused or completed. The CS operation has no overhead: in the clock cycle before the switch, the core executes normally, and at the start of the next clock cycle, the new context restarts from the state where it should have been after it was suspended. The CS manager always performs a switch when the active context starts a wait synchronization operation. This avoids deadlocks when the waited thread executes on the same core of the waiting thread, as the CS manager can flag the waiting thread as ready when the response from the waited thread is received. The CS manager also has a set of registers for each context to store the signals received while threads are not ready. When responses arrive at a core, the thread in execution is generally not the one that performed the request: the value is saved and forwarded to the actual thread originating the request when its execution is resumed. In this way, the CS functionality is entirely transparent to the rest of the accelerator and does not force other components to send signals multiple times or stall their execution.
In the core datapath, registers store intermediate signals between combinational regions; when many contexts share the same core, it is necessary to store a signal for each context. Each context thus needs its own set of registers and a control signal to select the active set of input and output registers; the new register set with the context selector control signal is commonly referred to as a register file. Since register files can consume considerably more resources than standard registers, it is fundamental to avoid them when possible. In the SPARTA architecture, CS can happen only during memory accesses or synchronization calls, which are known at compile time: the synthesis flow exploits this information to replace registers with register files only within control states where CS can happen, while other registers are left unchanged. As an example, let us take into account a core executing a floating-point add operation: two values are loaded from memory, the sum of the two values is computed, and the result is stored back in memory. Load and store operations can trigger a CS to hide the memory access latency thus registers connected to load and store operations in this core must be replaced with register files to save signal states and later restore them; instead, the floating-point sum can never be interrupted by a CS and thus none of the registers present in the floating-point adder logic needs to be replaced with a register file.
3.2 Thread Synchronization
Three synchronization modules represent a key component of the SPARTA architecture: the barrier manager, the reduction manager, and the critical manager. They are only included in the parallel region if they are needed, and their size is proportional to the number of cores that use them.
Barriers are added by the OpenMP runtime at the end of a parallel region to ensure that the execution can proceed safely if the threads have different execution times. In the OpenMP paradigm, when a thread encounters a barrier, it notifies all other threads in the same team that it has reached the barrier and waits until all other threads also signal that they have reached the end of their execution. To implement barrier synchronization, the SPARTA architecture includes local components for each core that generate completion signals and a centralized barrier manager that collects completion signals, updates a synchronization register, and, when all required threads have reached the barrier, sends them a command to resume execution. Sharing a unique barrier manager among all parallel cores ensures that synchronization happens safely without the possibility of race conditions. Furthermore, the same barrier manager component can also be used to implement the wait_thread OpenMP operation, which waits for the completion of one specific thread without adding more logic or duplicating the synchronization state.
In OpenMP reduction operations, the thread with ID 0 needs to retrieve partial reduction variables from all other threads and combine them. Each thread on a SPARTA core participating in a reduction communicates with the reduction manager to indicate the address of its reduction variable in its local memory; thread 0 then accesses the reduction manager to retrieve the addresses of all partial results it needs to merge. In fact, as will be explained in
Section 3.3, a thread can access the local memory of another thread if it knows the correct address. Such a scheme could have also been implemented by storing the addresses of partial reductions in the external shared memory but with the considerable latency overhead of additional memory requests. Instead, the SPARTA reduction manager implemented inside the accelerator allows one-cycle access to the information required to perform the reduction.
We chose to implement the reduction following a linear pattern: after thread 0 finishes its part of the computation, it collects all the values computed by the others and merges the results. Other reduction schemes may result in a different tradeoff between area occupation and speed. For example, using more threads to merge the results is possible at the cost of duplicating more logic. However, a linear reduction operation often requires five or six cycles per thread to be merged, which is negligible when confronted with the duration of the computations. For this reason, when multiple contexts can be activated, the policy of the CS manager is to choose the thread with the lower thread ID: this gives priority to thread 0, which is the one that could perform a reduction operation, and then to the other threads in order since the reduction proceeds following the thread IDs.
In the OpenMP paradigm, the critical pragma is used to identify a set of instructions that can only be executed by a single thread at once. SPARTA adds a critical manager that behaves like a mutex and a local component for each thread that encounters the critical section. When a thread reaches the critical section, the local component sends a signal to request control of the lock, and the critical manager grants it only if it is not currently assigned to another thread. If the lock is not acquired by the current thread, the CS manager performs a CS. The critical manager stores the state of the requests and the current lock owner and, like the CS manager, prioritizes the thread with a lower ID when multiple threads require access. When the thread finishes the execution of the critical section, it sends a signal to release the acquired lock to avoid deadlock and allow other threads to resume their execution.
3.3 Memory Hierarchy
SPARTA exploits a hierarchical memory model that allows reducing the number of requests to the external memory and keeping data inside the accelerator as much as possible. There are three possibilities to store variables, visible in
Figure 3: external memory outside the accelerator (high-latency memory),
block RAM (BRAM) modules in the FPGA fabric inside the accelerator that are accessible by all cores, and private BRAM modules inside parallel cores. External memory is usually suitable for input and output parameters that cannot fit into on-chip BRAMs, while variables that are defined and used during the computation are better allocated inside the accelerator, whether in shared or private memory blocks, to lower access latency and improve performance.
The various levels of the memory hierarchy are connected through an ad hoc bus infrastructure, depicted in
Figure 4, that can handle multiple concurrent memory requests. The local bus has one channel and connects cores within the same parallel region, while the main accelerator bus connects parallel regions with the on-chip shared memory and with the NoC toward the external memory. Each memory request originating from one of the parallel cores goes through a filter; the filter interprets its address and properly routes the request to internal BRAMs of other cores through the local bus or outside the parallel region through the main bus. Since BRAMs have two channels, local memories can support concurrent access from both the owner core and the local bus. SPARTA can connect the computation units directly to the local memory without using the local bus if it can infer at compile time the address of the requests and the variable is allocated inside the same core.
SPARTA memory addresses encode the physical location of the memory module where data is allocated;
Figure 5 shows the structure of external and internal memory addresses, which have the same length because they travel on the same local bus. The size of the address depends on the external memory size (which is likely considerably larger than internal memory) plus one bit that is added to differentiate between internal and external addresses. This allows a very simple decoding logic: if the external bit is equal to 1, the request is propagated to the external memory. For internal memory, a dedicated pagination algorithm allocates each memory variable considering the parallel region and thread it belongs to. The address of internal variables is composed of three parts, representing the address of the variable in the memory module, the thread that owns the variable, and the parallel region in which the memory module is allocated, respectively. To build the address, SPARTA first finds the largest internal memory module and computes the number of bits needed to represent it, so this number of bits can be reserved to univocally identify a variable inside each memory module. Similarly, the size of the second part of the address is set by looking at the largest number of threads assigned to the different parallel regions. Finally, a unique code is assigned to each parallel region, which is used by the filters to determine whether an address is local to the parallel region. Looking at the appropriate bits of the address, filters and arbiters in the memory interconnection system can properly forward a memory request to a different bus or component using combinatorial logic without adding delays to the requests. For example, if the parallel region ID in an address matches the one stored in the filter, the request is forwarded to the local bus, where the arbiter will find the correct core by looking at the thread ID. Instead, if the parallel region ID does not match or if the address has the external bit set to one, the request is sent to the main bus.
If two requests arrive in the same cycle, the bus arbiter can only accept one of them, and the other must be stalled; in this case, SPARTA generates a back pressure signal between arbiters and cores to resend requests that could not be accepted in the current cycle. With only one context in execution on the core, execution has to be stalled as it needs the result of the memory operation to proceed, but when multiple contexts are assigned to the same core, another context can be executed in the meantime. A First-In-First-Out (FIFO) queue holds requests before the memory filter so that the CS manager can switch the running context at the start of each memory request instead of when an arbiter accepts it; this prevents the effects of the arbitration from having repercussions on the core execution, as the FIFO is responsible for resending the requests to the arbiter while the core can continue its operation with another context.
3.4 External Memory Interface
The parallel accelerator generated by SPARTA is connected to an external memory module with a fixed number of memory channels. Thus, memory requests need arbitration as they can originate from multiple sources (i.e., the parallel cores) and be directed to multiple destinations. Instead of using dedicated hardware modules or another arbiter, we introduced a custom NoC interconnect that allows maintaining a short clock period. The NoC interconnect is composed of two unidirectional NoC components (
Figure 6), one for the request from the cores to the memory and the other for the opposite path.
The proposed NoC consists of a group of interconnected nodes following a butterfly structure, organized in \(n\) levels where each level has a number of nodes equal to \(2^{n}\), for a total number of nodes equal to \(n*2^{n}\). The number of nodes in the NoC must be at least equal to the maximum between the number of banks in the external memory and the number of cores in the parallel region (destination and source of the NoC, respectively). Each node in the NoC can be configured to be registered or combinatorial, providing a tradeoff between the frequency of the NoC and the number of cycles needed to traverse it: in fact, combinatorial nodes produce output values in the same cycle, increasing the critical path, while registered nodes introduce a delay. Nodes connected to external components are called I/O nodes and must be registered to avoid latches; for the rest of the nodes, the choice to use combinatorial nodes or have only registered ones is left to the user.
The number of cycles needed to traverse the NoC is equal to the number of registered nodes on the path of the request and, at most, is equal to the number of levels. To minimize the average travel time inside the NoC, SPARTA organises the registered nodes filling entire levels before going to the next one. For example, consider a NoC connecting 16 memory banks to 16 parallel cores: the number of levels must be 3 (as
\(2*2^{2}=8 {\lt} 16\)), but if we choose to use combinatorial nodes, it can be traversed in 2 cycles as only the 16 I/O nodes must be registered and they can be arranged in the first two levels. The NoC is resilient to congestion: if two requests want to go to the same node, there is no need to stall, as the node can redirect one of them on another route to the same destination. The NoC can thus accept multiple requests: each node has two internal inputs and two internal output connections, and an I/O node can take a request if one of its internal inputs is not active in the current cycle. Registered nodes can also store a request, while combinatorial nodes forward them immediately and do not store information. If both input internal connections are active, the I/O node of the NoC sends a back pressure signal to resend the request in the next cycle. When using multiple contexts, back pressure signals do not interfere with the execution of the cores as they are intercepted in the FIFO modules, which take the role of resending the requests, as described in
Section 3.3.
The SPARTA architecture allows the accelerator to connect to different types of external memory, with simple or pipelined memory access, and for both unified and banked multi-channel memories. The accelerator does not assume that the memory is immediately ready to accept a request and uses a protocol with back pressure that can be used to synchronize with it. Furthermore, SPARTA cores work under the assumption that answers from the memory could come out of order, as traffic in the NoC or the external memory could change the relative order of requests. Each request is tagged with an ID used to identify the thread that performed it and correctly forward the answer. If the external memory is banked, it is possible to add one cache in front of each bank to further reduce the average latency of the memory accesses. (There is no need to introduce a cache consistency mechanism, as each cache has its own set of addresses associated with a different memory bank.)
We introduced a set of custom pragmas and parameters to specify information about the accelerator interfaces for variables that will be stored in external memory and model their behaviour during the simulation of the generated accelerator (
Table 3). Specific variables can be allocated to specific banks to mimic access patterns of real memories. For example,
#pragma HLS\(\_\)interface a banked bank\(\_\)number = 16 bank\(\_\)chunk = 256 specifies that the
a variable needs to be stored in a multi-banked external memory that has 16 banks of 256 bytes each.
SPARTA accelerators, by default, exchange data with other accelerators and the external memory using a simple and efficient protocol called
minimal memory interface, described in
Table 2. The protocol requires a limited number of signals and logic elements to perform memory operations, reducing the size of internal components (particularly the NoC). However, SPARTA also supports more complex protocols such as AXI4 [
4] by adding a bridge component before each memory bank that translates each request and response from the minimal protocol to AXI4. The translation is completely transparent to the accelerator, correctly propagating back pressure signals from and to the external memory.
3.5 Hardware-Oriented Transformations
We describe here two key transformations of the SPARTA methodology that reduce the impact of parallelization on resource utilization: thread ID propagation and parallel API pointer promotion.
Thread ID propagation. SPARTA generates an array of cores by replicating the datapaths of omp\(\_\)outlined functions corresponding to the content of an OpenMP parallel region. While simply replicating the whole datapath provides a functional design, considering which parts are effectively used at runtime can reduce resource utilization. Because thread IDs are statically assigned to each core, our approach can use them to determine the resources of the datapath that each thread actually uses. In fact, in the OpenMP work-sharing model, it is possible to differentiate the computation assigned to threads in the same group, but only depending on the thread ID. Bambu features several code analyses and transformations that SPARTA can exploit to statically propagate thread IDs through the replicated datapaths and to automatically perform dead code elimination, leading to the generation of the minimum set of resources required for each of the cores. Static thread ID propagation also avoids the replication of sections of the datapath only used by one thread, such as resources that execute reduction operations.
Pointer promotion. The OpenMP API makes extensive use of pointers and multiple levels of indirection to exchange references to internal data structures and parameters between threads; such a programming model is generic and versatile, but it determines an overhead for memory accesses and pointer resolution. In fact, memory operations on FPGA typically need to access external memories and are thus significantly more expensive in terms of latency and energy with respect to arithmetic operations. Whenever possible, SPARTA leverages static resolution of multiple indirections to reduce the number of memory operations. Because the input application and the architectural parameters (e.g., number of contexts per core and memory architecture) are fixed, the SPARTA frontend can apply function versioning for each call to functions of the parallel API. After versioning, each function represents an instance of a hardware component, and static analysis can be performed on the IR to infer dynamically allocated data structures and memory accesses: simple and multiple indirections are solved with pointer promotion, and statically allocated data structures enable further optimizations during the HLS flow. Once parallel APIs versioning is completed, it is also possible to perform a more detailed alias analysis on pointers to understand what memory structures are accessed by each operation. In this way, it is possible to pinpoint shared memory variables whose location will be shared at runtime between threads and allocate them in shared BRAMs while the remaining private memory variables are allocated in the local memories.
4 Experimental Evaluation
We showcase the results that can be achieved through SPARTA through four experiments: the first consists of synthesizing three graph processing benchmarks to show speedup and scalability varying the number of cores and contexts (
Section 4.1), the second is a direct comparison between SPARTA and
Svelto [
39] on parallel database queries (
Section 4.3), the third is a comparison between SPARTA accelerators and kernels from the graph acceleration library of the commercial tool Vitis HLS (
Section 4.4) and the last one (
Section 4.5) is a comparison with LegUp [
15]. We also include in
Section 4.2 a focus on the effect of different NoC configurations and back pressure conditions on the performance of the
Graph Algorithm Platform Benchmark Suite (GAPBS) accelerators.
We use a custom version of Bambu HLS to run the experiments, which is available as an executable binary.
1 The following considerations are common to all tests: we assume to connect the accelerators to an external memory with 16 banks and one channel per bank; the memory delay is assumed to be constant. We indicate with
core the number of parallel accelerator cores, and
C the number of contexts on each core. We report area utilization in terms of
lookup tables (LUTs) and registers. SPARTA components also require one DSP block for each core and one BRAM for every two contexts if reductions are present;
Table 4 reports the area utilization of the main architecture components to isolate the contribution of components that do not depend on the input application. Most of the designs meet the imposed timing constraint after place-and-route except for the largest configurations, where Vivado (the backend tool for AMD/Xilinx FPGAs) was not able to achieve the requested clock period after the placement phase due to high utilization of the device resources.
4.1 GAP Benchmark Suite
The GAPBS [
7] is a collection of benchmarks intended to standardize evaluation for graph research, and it uses OpenMP to parallelize execution. We tested SPARTA on a subset of GAPBS benchmarks, excluding the ones that use OpenMP primitives that have not yet been implemented in SPARTA. The target is an AMD/Xilinx Alveo U280 board with a clock frequency of 250 MHz. In all experiments, the input graph, in Compressed Sparse Row format [
49], is assumed to be stored outside the FPGA in an external DRAM memory.
The
connected components (CC) benchmark finds the largest connected subgraph in the undirected input graph; we synthesized it with configurations that vary from one to eight cores and from one to eight contexts per core. With a larger number of cores/contexts, the memory channels utilization saturates, and there can be no further improvement, as the primary source of latency for the CC application is the access to the external memory. As can be seen in
Figure 7(a), the increase in performance over a sequential version of the application (one core, one context) grows linearly with the number of cores, keeping the number of contexts constant; in fact, having more cores that perform memory requests increases the utilization of the memory channels and the overall performance. We can see the same effect with more contexts on the same core to hide the memory latency: the configuration with eight contexts on one core has the same speedup as the one with eight cores and a single context. The configuration with eight cores and eight contexts is 36 times faster than the sequential version; this is slower than the theoretical speedup because the memory access pattern is irregular, bringing multiple requests on the same memory bank at the same time and thus producing delays and congestion on the NoC. LUTs and registers scale differently with the number of cores/contexts, as can be seen in
Figure 7(b): the number of LUTs grows with the number of OpenMP threads (e.g., 7
\(\times\) increase with respect to the sequential version with eight cores, eight contexts), while liveness analysis during the synthesis flow identifies the minimum number of registers to be replicated for CS, so the number of registers is mostly linked to the number of cores (4
\(\times\) more than in the sequential version).
The second GAPBS kernel we evaluated is PR, which assigns a value to each graph node based on its relative importance compared to the other nodes.
Figure 8(a) shows that the speedup over a sequential baseline proportionally increases with the number of cores, similar to the CC benchmark. However, in this case, increasing the number of contexts per core from four to eight does not lead to the same performance improvement because some sections of PR are computation-bound rather than memory-bound, leading to competition for parallel cores rather than stalls due to memory access operations. With the largest configuration, we achieve a 30
\(\times\) speedup with 6.5
\(\times\) the area occupation (considering LUTs, registers occupation only increases 4.5 times).
The last GAPBS benchmark we considered is
Triangle Count (TC), which counts the number of triangles in the input graph. After assessing how performance and resource consumption scale with varying numbers of cores and contexts (
Figure 9) on the Alveo target, we used this benchmark to perform a first comparison against the
Svelto synthesis methodology [
39]. There are a few differences between the two methods that need to be considered.
Svelto is based on a gcc/g++ frontend, while SPARTA uses Clang; the crossbar used in
Svelto has a different behaviour with respect to the NoC used in SPARTA, and the comparison is further complicated by the fact that
Svelto uses atomic-based synchronization while SPARTA implements OpenMP reductions. They also operate under different assumptions regarding the behaviour of the memory (i.e., SPARTA allows to use asynchronous external memories that can accept multiple requests at the same time while
Svelto only sends one request at a time), and
Svelto dynamically assigns loop iterations to the cores while SPARTA uses static scheduling. Nevertheless,
Svelto remains the state-of-the-art methodology closest to SPARTA, as described in
Section 2.1, and it is worthwhile to examine the QoR of accelerators generated with the two methods. Furthermore, since both use Bambu as HLS backend, we can safely state that the comparison only refers to the differences in the implementation of parallel accelerators rather than being influenced by different choices in the scheduling, allocation, and code generation steps.
The results in
Figure 10(a) show the relative performance of SPARTA against the
Svelto baseline with both tools targeting a Virtex 7 FPGA at 100 MHz; bars higher than one, thus represent configurations where SPARTA outperforms
Svelto. While
Svelto is faster on smaller configurations, SPARTA can achieve a higher level of parallelization as the number of cores and contexts grows. There are two main reasons for this: the first one is that the NoC in SPARTA is designed to be effective when there are many requests from different sources, which is achieved when the number of cores is higher; the second reason lies in the ability of SPARTA to allow multiple memory requests simultaneously, which is critical to exploit memory-level parallelism when multiple memory channels are available. In fact, SPARTA adds a tag to external memory requests, allowing it to send multiple pipelined, asynchronous requests, while
Svelto must wait for the memory to answer before performing the next request. With a number of OpenMP threads greater than the number of banks in the external memory, SPARTA can execute more requests in parallel than
Svelto, resulting in a higher speedup. Regarding area usage, as can be seen in the relative comparisons of
Figure 10(b) and (c), SPARTA consumes more resources than
Svelto on the TC benchmark. However, the
area-delay product (ADP) of SPARTA scales better with the number of cores/contexts: using the formula
we can see that, for example, the ADP of a SPARTA accelerator with 1 core and 1 context is 1.14 times the ADP of a
Svelto accelerator with the same configuration, and the ratio drops to 0.47 with 32 cores and 8 contexts (SPARTA is 3.8 times faster using only 1.7 times the LUTs and 2.4 times the registers). One reason for the higher resource consumption is that SPARTA is more general than
Svelto: SPARTA explicitly implements multiple OpenMP primitives and synchronization mechanisms, while
Svelto only supports a few patterns, and thus, it can make assumptions on the behaviour of the input program that simplifies the underlying architecture. Moreover, SPARTA implements reduction operations that require fewer cycles and memory operations but more area than the atomic operations implemented in
Svelto. Finally,
Svelto has a centralized dispatcher that distributes loop iterations to the cores, while each core in SPARTA contains dedicated logic to compute the current iteration index.
4.2 NoC Evaluation
In this section, we evaluate the performance cost of the NoC and the effect of back pressure on the latency of memory operations.
Table 5 shows data collected synthesizing and running the TC, PR and CC benchmarks with 32 cores and eight contexts, as it is the configuration with the largest occurrence of back pressure signals. We tested two different configurations of the NoC: one that registers only the 32 I/O nodes and leaves the other 32 combinatorial and one that only uses registered nodes. The difference between the two configurations of the NoC is the number of requests it can store, which is equal to the number of registered nodes, and the maximum latency of the NoC, which is two clock cycles in the first case and four in the second. The actual number of cycles needed to traverse the NoC equals the number of registered nodes on the request’s path plus the effect of congestion (NoC latency).
BP represents the total number of times the NoC could not accept a memory request because of congestion, measured through the number of back pressure signals received by the FIFO modules in front of all cores, while the number of memory operations is the sum of the requests from all cores on all memory channels. We can compute the average delay spent by a single request in the FIFO modules before entering the NoC by dividing the total number of back pressure signals by the number of memory operations; this is the effect of the arbitration on the main memory bus, and it is very small when compared with the latency of the memory (20 cycles) or the latency of the NoC. The average NoC latency is measured as the average time that it takes for a request to traverse it.
Using a registered NoC, which can store more requests and thus reduce the effect of the arbitration delay, has a negative effect on the total number of cycles for two reasons: while a registered NoC reduces the number of back pressure signals, it increases the minimum time needed to traverse the NoC from two cycles to four, and it increases the internal congestion (visible in the higher average NoC latency). For these reasons, we ran all the experiments in
Section 4.1 with a NoC with only the I/O nodes registered.
4.3 Parallel Queries
In this section, we generate accelerators for the same parallel database queries used in [
39], which are SPARQL queries from the LUBM benchmark and dataset [
23] translated into C++ and parallelized with OpenMP. SPARQL is a query language for graph databases in the
resource description format (RDF). RDF databases are naturally described as single labelled graphs, and SPARQL queries are defined as (sub)graph matching operations. The resulting code features nested for loops with different behaviours (e.g., number of OpenMP threads, number of loop iterations assigned to each thread) and irregular memory accesses. We compared results obtained with SPARTA and
Svelto on several different parallel queries and also ran the same experiments, adding a cache in the SPARTA accelerators in front of each memory bank (512 lines, 512 bits per line).
Figure 11 shows the relative performance of SPARTA against the
Svelto baseline, with and without caches; as before, bars higher than one represent configurations where SPARTA outperforms
Svelto.
The results we obtained are different depending on the specific query considered and the characteristics of its workload. In Q1, SPARTA without caches is always slower than
Svelto, as it contains a small group of OpenMP threads, each performing a large number of operations, and we have shown how SPARTA performs best with memory-bound workloads. Q4 and Q5 terminate in a few tens of thousands of cycles and are not very indicative of the achievable speedup, as most cycles are spent in synchronization operations. Data for Q5 with 8 cores and 8 contexts is missing as
Svelto does not terminate in that configuration. The results observed align with the ones from TC: SPARTA without caches has worse performances when using few threads but improves when the number of cores and contexts grows. In Q2, Q3, and Q7, SPARTA drastically outperforms
Svelto, even considering the higher resource utilization (
Figure 12), with between 6 and 11.5 times faster execution. In Q1 and Q6 instead,
Svelto is faster, and this can be explained by a lack of parallelism in the query: a lot of the work has to be executed by a small subset of the threads and the dynamic assignment of iterations in
Svelto is better suited for this than the static division of work in SPARTA. However, even in these cases, with enough cores and contexts, SPARTA can recover the disadvantage and at least reach the same performance, although using a larger area.
When using caches, SPARTA is almost always faster than Svelto, and this is particularly effective in the test cases where previously SPARTA was struggling: in both Q1 and Q6, the introduction of caches in SPARTA makes the generated accelerators more than twice as fast than before. With the largest configurations in Q2, Q3, and Q7, adding the cache reduces the speedup compared with the base version of SPARTA: these tests are the ones in which static scheduling can divide the computations between multiple cores and keep all of them working. Under this circumstance, the main advantage of SPARTA is the ability to parallelize multiple requests running different contexts and hide the memory latency while advancing the state of all the contexts simultaneously; adding caches reduces the average memory latency, and it can cause a loss of parallelization when using CS. In fact, if the memory requests are answered faster, the execution of the different contexts can become serialized and, depending on the data access pattern, lead to worse results.
In
Figure 13, we investigated the influence of CS across various configurations. Due to space constraints, we focused on analyzing the speedup resulting from increasing the number of contexts from 1 to 2, 4, and 8 for query Q7. The speedup was visualized using bar charts and compared against different numbers of cores and tools (Svelto, SPARTA, and SPARTA with caches). When the number of contexts was fixed at 1, we noticed that Svelto and SPARTA exhibited similar execution times, except in the case of 32 cores (see
Figure 11). As expected, introducing caches improved SPARTA’s performance across all scenarios. Transitioning from spatial to temporal parallelism, SPARTA and SPARTA with cache approaches demonstrated superior scalability compared to Svelto. In fact, in
Figure 13, the maximum speedup achieved by increasing the number of cores tended to decrease as expected, but the rate of decrease was lower for SPARTA and SPARTA with caches compared to Svelto. This limited scalability of Svelto is due to its interconnection, which scales less efficiently than the NoC used by SPARTA. Furthermore, SPARTA initiates a new memory operation before completing the previous one, enhancing channel utilization. Finally, with caches enabled in SPARTA, the impact of spatial parallelism is heightened while the impact of temporal parallelism is reduced. In
Figure 13, the maximum speedup with Svelto and SPARTA is around 6–7, while with SPARTA with caches, it is around 4.
4.4 Vitis HLS Graph Library
Commercial tool Vitis HLS from AMD/Xilinx provides a library of pre-optimized C++ kernels to implement graph processing accelerators.
2 The kernels from the Vitis library have been optimized for HLS through parallelization, loop pipelining, and the introduction of caches to reduce memory access times.
We utilized identical graphs to compare the results obtained through the Vitis Graph library with the SPARTA approach. Although our graphs are smaller in scale compared to those used with the Vitis Graph Library, they still provide a reasonable simulation and analysis time for conducting design space exploration. The graph characteristics are detailed in
Table 6. The graphs are stored in the Compressed Sparse Row matrix format as undirected. For TC and PR, directed graphs are required, thus only the lower triangular part of the matrix is utilized. The SPARTA configurations for TC, CC, and PR are optimized to fully utilize spatial and temporal parallelism, using 32 cores and eight contexts each. However, due to issues in the place and route phase, SPARTA cannot achieve the same frequency as Vitis with such large designs. As a result, the post-routing frequency achieved is 200/220 MHz instead of the 250 MHz obtained with smaller configurations.
In the Vitis benchmarks, the delay of the HBM interface between the accelerator and the memory ranges from 100 to 140 ns, depending on the address of the request. To approximate this behaviour in the SPARTA benchmarks, we increased the memory delay to 32 cycles, which at 200/220 MHz is equivalent to 160/145 ns. Additionally, we utilized the AXI4 protocol instead of the minimal interface to exchange data between the accelerators and the external memory.
It is worth noting that both SPARTA and Vitis target an Alveo U250 FPGA, part of the Ultrascale family. The results for each algorithm are reported separately in
Tables 7,
8, and
9.
SPARTA and Vitis can achieve different maximum frequencies. To account for this difference, we calculated the kernel execution time as the achieved clock period times the number of cycles (Time in the Tables). In addition, we computed the AreaXTime metric to normalize the performance of the different designs.
In the CC benchmark, SPARTA outperforms Vitis in terms of execution time by a factor of 10, with only a 30% greater area occupation. Additionally, the AreaXTime metric favours SPARTA by almost an order of magnitude.
In the TC benchmark, SPARTA achieves, on average, a faster execution time. It requires 14% less time to complete the computation of the four graphs. However, it occupies four times more of Vitis’s area.
In the PR benchmark, SPARTA achieves a speedup between four and six times compared to Vitis, occupying only twice the area. Consequently, it achieves an AreaXTime metric at most half of Vitis.
As performance results for the BFS graph library kernel are not available, we also used Vitis HLS to synthesize a C++ implementation of the BFS algorithm loosely based on the one available in MachSuite [
48]. Note that we are using the common frontier-based implementation rather than the direction-optimizing implementation [
6] (also provided in GAPBS). BFS is the prototypical irregular kernel [
1] where the number of iterations for each thread can be highly unbalanced and requires critical sections to synchronize the different threads.
We tried different configurations for SPARTA, varying the number of cores and contexts using the accelerator produced by Vitis HLS as a baseline.
Table 10 compares the accelerators produced by Vitis HLS and SPARTA without parallelization (one core and one context). There is only a small difference between the number of cycles for the two tools, while the area is more difficult to compare: Bambu does not require BRAMs for this design as it maps all logic on LUTs and registers, and almost half of the LUTs and registers used are dedicated to the 16 node NoC (as reported in
Table 4).
Figure 14(a) displays the speedup achieved using different configurations of SPARTA with respect to the number of cycles required by Vitis HLS. In the sequential version, most of the time is spent waiting for the external memory to provide data in both the SPARTA and Vitis implementations. In the SPARTA parallel accelerators, a similar pattern arises when the number of contexts per core is small: each core spends most of the time waiting for new data to process. As we increase the number of contexts, each core spends a greater fraction of time doing useful computations. Even when using multiple contexts, BFS does not achieve the same speedup as more regular applications such as the ones described in
Section 4.1: during the exploration of the graph, each thread can find a different number of elements, leading to a high load imbalance. Furthermore, all threads must be periodically synchronized to avoid exploring the same node of the graph multiple times and to compute the correct length of the shortest path from the origin. Despite this, SPARTA can achieve good speedup when using four contexts per core: with four cores, it achieves a speedup of almost five times, and with two cores, a speedup of three times compared to the Vitis HLS baseline.
Figure 14(b) compares the area utilization of SPARTA and Vitis HLS. When adding a core, SPARTA replicates almost the entire design and adds necessary communication and synchronization logic. Hence, resource utilization increases a little more than linearly: with four cores and one context, the area occupied is six times more than the baseline. Note that with 4 cores, the NoC is larger than the entire design generated by Vitis HLS. However, its size does not increase when more contexts per core are added.
4.5 LegUp
We compare SPARTA even with the published results obtained by LegUp [
15]. Specifically, we are using the Mandelbrot kernel from the latest open-source version of LegUp. This kernel is a parallelized version of the Mandelbrot iterative mathematical benchmark, which produces a fractal image and is parallelized with OpenMP for pragmas. The results obtained by LegUp on this benchmark are reported in
Table 11, as published in [
15]. SPARTA achieves better performance across all metrics when using the same Mandelbrot kernel on a Stratix IV device (see
Table 12). This performance advantage can be attributed to several factors, including SPARTA’s use of external memory for storing the Mandelbrot picture, its independence from a soft processor for OpenMP parallelization, and its simpler architecture.
Results from other benchmarks in [
15] will only further widen the performance gap between SPARTA and LegUp. These other benchmarks rely on Pthread primitives in addition to the OpenMP pragmas. Pthread primitives are partially implemented as software to control hardware threads, increasing the overhead introduced by the soft processor and affecting the overall execution.