This article presents a methodology for the Synthesis of PARallel multi-Threaded Accelerators (SPARTA) from OpenMP annotated C/C++ specifications. SPARTA extends an open-source HLS tool, enabling the generation of accelerators that provide latency tolerance for irregular memory accesses through multithreading, support fine-grained memory-level parallelism through a hot-potato deflection-based network-on-chip (NoC), support synchronization constructs, and can instantiate memory-side caches. Our approach is based on a custom runtime OpenMP library, providing flexibility and extensibility. Experimental results show high scalability when synthesizing irregular graph kernels. The accelerators generated with our approach are, on average, 2.29\(\times\) faster than state-of-the-art HLS methodologies.

1 Introduction

Field Programmable Gate Arrays (FPGAs) are reconfigurable devices that can be programmed using a hardware description language (HDL), such as Verilog or VHDL, to implement specialized accelerators. FPGAs have become a key component of heterogeneous high-performance computing systems, often coupled to other types of accelerators, such as GPUs, to accelerate data-intensive tasks or parts of applications that require low latency processing. High-level synthesis (HLS) tools simplify the process of developing FPGA-based accelerators by automatically translating software descriptions (typically C or C++ code) into HDL. These tools promise to hide hardware-specific details from the user while still obtaining high quality of results (QoR) in terms of latency, area usage, and clock frequency [42].

However, existing HLS tools mainly target regular, arithmetic-intensive workloads operating on dense data structures typical of digital signal processing (DSP) applications, which present an abundance of instruction-level parallelism. Solutions that generate accelerators from parallel high-level descriptions leverage data-parallel programming models that rely on regular computation and easily partitionable data sets (e.g., OpenCL), as they mainly replicate identical computational units and do not well support synchronization-intensive programs [14]. Driven by the interest in big data and graph analytics, several approaches attempt to support the acceleration of parallel applications with irregular behaviour. These include solutions that target specific graph algorithms, approaches that try to optimize linear algebra operations (e.g., sparse matrix-vector and sparse matrix-matrix multiplication [17]), and frameworks based on the Gather-Apply-Scatter (GAS) paradigm [33]. These solutions are highly specialized: they focus on specific graph algorithms, or they only try to generate and optimize with HLS computational engines that need to be supported by a software framework; the actual execution of graph algorithms mapped on such accelerators might thus increase in computational complexity with respect to vertex- or edge-centered formulations.

The OpenMP programming model for shared memory systems has often been successfully used to design parallel irregular algorithms [7, 36, 38] thanks to its task-level and loop-level parallel constructs. Research on HLS previously explored the use of OpenMP annotations to synthesize applications with coarse-grained task-level parallelism [15, 47]; however, these approaches employ multi-level designs driven by a microcontroller and cannot manage complex synchronization constructs. Furthermore, they still focus on generating parallel accelerators from programs with regular behaviours. The Svelto methodology [39] uses OpenMP annotation as a starting point to synthesize parallel multi-threaded accelerators, targeting the generation of accelerators starting from specific patterns of parallel graph algorithms. In this article, we propose a more general methodology for the Synthesis of PARallel multi-threaded Accelerators (SPARTA), providing automated translation of parallel specifications written in OpenMP into efficient parallel accelerators.

SPARTA adopts a fundamentally different and more flexible approach than previous methodologies generating accelerators from OpenMP code. Instead of mapping OpenMP parallel regions marked by compiler directives (pragmas) to a pre-defined parallel architecture template, it translates pragmas into calls to OpenMP runtime primitives and implements corresponding low-level hardware components to be used in the synthesis backend. Accelerators generated with SPARTA can exploit both spatial and temporal parallelism, hiding the latency of external memory accesses through context switching (CS) across multiple execution threads. Moreover, SPARTA can instantiate an advanced hierarchical memory subsystem composed of a custom Network-on-Chip (NoC) that connects multiple external memory channels to each accelerator, memory-side caching, and on-chip private memories for each accelerator. These features allow SPARTA to deal with both regular and irregular behaviours in parallel algorithms and to generate efficient accelerators in terms of performance per area.

We evaluated SPARTA by synthesizing typical parallel graph algorithms, verifying the scalability and the QoR of the generated designs as the numbers of parallel accelerators and contexts change. We also compared accelerators for graph database queries generated with SPARTA to the ones generated with Svelto, obtaining, on average, a 2.29\(\times\) improvement in performance with similar architectural parameters (number of accelerators, contexts, and memory channels).

In summary, this article makes the following contributions:

—

A synthesis methodology to generate multi-threaded parallel accelerators (SPARTA) supporting OpenMP work-sharing and synchronization constructs through the implementation of low-level primitives from the OpenMP runtime system;

—

Support for both spatial and temporal parallelism, with multiple threads that can pause and resume execution on a core when needed;

—

A highly configurable memory system that mitigates the latency of external memory accesses.

The rest of the article is organized as follows: Section 2 reviews related work on the synthesis of FPGA accelerators from parallel programming languages. Sections 3 and 3.1 describe the SPARTA architecture and synthesis flow. Section 4 presents experimental results and Section 5 draws conclusions for the article.

2 Related Work

SPARTA generates efficient, high-throughput parallel accelerators starting from OpenMP code. This approach is particularly suited to irregular parallel applications such as graph algorithms, thanks to the shared memory abstraction, the enhanced memory subsystem, and the combined use of spatial and temporal multithreading. The work touches on several areas related to the use of reconfigurable architectures and HLS methodologies for parallel applications.

2.1 HLS and Parallel Specifications

HLS is an established paradigm and an active area of research [16]; significant effort is dedicated to improving the QoR of generated accelerators, supporting flexible optimization and design space exploration, and introducing more and more system-level design features [9, 46, 50, 57]. Another important research topic is the synthesis of parallel accelerator designs starting from specifications written in user-friendly parallel programming application interfaces (APIs) such as OpenCL, CUDA, OpenMP, or Pthread.

Hardware vendors lead OpenCL efforts [32], but they mainly target data-parallel workloads with regular computation and often require significant restructuring of the code to match one of the supported parallel patterns. FCUDA [43] starts from kernels written using the CUDA API, designed targeting NVIDIA GPUs, and presents similar issues. The LegUP HLS tool [10] exploits OpenMP and Pthread specifications to generate architectures built of custom accelerators driven by a general-purpose microcontroller, only supporting two levels of nested parallelism (the external level through Pthread and the internal one through OpenMP) [15]. BŁYSK [47] extracts OpenMP tasks and maps them onto parallel accelerators that can share resources, also leveraging a microcontroller to perform the actual task scheduling. The CHAT methodology [24] extends the ROCCC HLS tool [54] to generate temporally multithreaded accelerators starting from loop constructs described in C without any support for synchronization. The Nymble HLS compiler supports multiple execution contexts [27] in spatially and temporally multithreaded accelerators generated from OpenMP loops [51]; the accelerators, however, can only access private memory and thus cannot share data across threads nor synchronize.

Svelto [39] is the synthesis methodology nearest to SPARTA: it synthesizes spatially and temporally multithreaded accelerators starting from OpenMP code, parallel cores access a shared memory through a multi-ported memory controller, and it supports synchronization through atomic memory operations. The main limitation of Svelto is that it can only map a small subset of OpenMP constructs to an architectural template, limiting the set of supported OpenMP pragmas (e.g., atomic memory operations are supported, but critical sections and barrier synchronization are not). Moreover, supported parallel constructs become kernel functions that do not allow inter-functional optimizations, local memories within each parallel core are not supported, and the multi-ported memory controller in the template has limited scalability.

Table 1 summarizes the considerations done in this section to highlight the differences between SPARTA and previous works. Looking at the supported input languages, we tried to identify how much a user would have to modify an input software implementation to use a certain tool: for example, FCUDA requires to heavily reorganize the code into data management and compute sections, Svelto requires the body of the OpenMP for loop to be outlined in a separate function, and SPARTA can process standard, unmodified OpenMP code. Methodologies that target a host+FPGA system use the processor to coordinate the execution of the accelerators, while SPARTA implements scheduling and synchronization within the accelerator itself. Finally, SPARTA supports different work-sharing and synchronization constructs, and nested parallelism; the addition of new constructs is simplified by using the low-level primitives of the OpenMP runtime.

Table 1.

Tool	Input language	Target architecture	Work-sharing	Synchronization	Nested parallelism
OpenCL [32]	OpenCL*	host \(+\) GPU/FPGA	-	-	-
FCUDA [43]	CUDA***	FPGA	-	-	-
LegUp [15]	Pthreads \(+\) OpenMP**	host \(+\) FPGA	for, master	critical, atomic	no
BŁYSK [47]	OpenMP	host \(+\) FPGA	task	taskwait	no
CHAT [24]	C loops**	host \(+\) FPGA	-	-	-
Nymble [51]	OpenMP	host \(+\) FPGA	for	no	no
Svelto [39]	OpenMP*	FPGA	for	atomic	no
SPARTA (this work)	OpenMP	FPGA	for, sections	critical, barrier	yes

Table 1. Qualitative Comparison between SPARTA and State-of-the-Art Methodologies for the Synthesis of Parallel Specifications

Asterisks represent how much the input code needs to be modified with respect to a software implementation.

2.2 Accelerator Designs for Graph Algorithms

Graph processing is an important element of data analytics, and it presents unique challenges (large datasets, unpredictable fine-grained memory accesses, unbalanced task-level parallelism) that are not well addressed by commodity processors relying on advanced cache architectures, data locality, and regular computation patterns, and might thus benefit from domain-specific accelerators [22].

Alongside proposals for processing-in-memory [3], flash-based [29], ASIC [25], and ReRAM-based [52] accelerators, multiple designs have targeted FPGAs to implement graph processing primitives. The FPGA co-processors on the hybrid Convey HC systems have been used to accelerate breadth-first search (BFS) [8], while [58] exploits an FPGA-Hybrid Memory Cube platform. Optimized implementations of PageRank (PR) and random walk are proposed in [59] and [20], respectively. GraphOps [45] provides a series of building blocks to implement graph analytics applications following a dataflow approach. Several designs are based on arrays of parallel processing engines [8, 13, 55]; some of them support the implementation of vertex-centric [44] or edge-centric [60, 61] graph algorithms. ScalaGraph [56] proposes a tile-based architecture that improves the scalability of the system as the number of processing elements increases. GraphLily [26] supports graph algorithms by accelerating their linear algebra formulation (i.e., by implementing custom Sparse-Matrix-Vector and Sparse-Matrix-Matrix multiplications units) and [2] provides a template for a GAS accelerator, requiring users to rewrite their applications and pre-partition the data.

All these works propose fixed or templated designs available as register-transfer level components, manually developed either with HDL or through HLS. Specializing the HLS process for the acceleration of graphs is not considered, thus similar approaches can only provide a limited library of primitives, and input applications may need to be restructured to allow mapping the computation onto the accelerators.

2.3 Synthesis Methodologies for Graph Algorithms

ElasticFlow [53] extends HLS with a methodology for the synthesis of irregular loop nests into a pipelined architecture. The approach is thus well suited to graph kernels that often feature irregular memory accesses and imbalanced computation; however, it only uses small on-chip memories to store the graphs, and it does not support atomic memory operations, relying on a reorder buffer to provide consistency. Castellana et al. [11] start from OpenMP-annotated nested loops to synthesize graph database queries expressed as graph pattern matching routines. The methodology leverages a distributed controller to map loop iterations onto a parallel array of accelerators, and it was later extended with dynamic scheduling [40]. All these solutions exploit task parallelism only by replicating the accelerator datapaths with limited resource reuse. The aforementioned Svelto [39] methodology has been designed to support the synthesis of irregular parallel applications, in particular graph kernels and graph database queries.

2.4 Architectural Features

The unique features of graph applications require architectural innovations to allow efficient execution under latency, throughput, or power consumption constraints. Particular care must be dedicated to the design of memory hierarchies [5] and to the interconnections between memory and computational elements, as graphs are large data structures that cannot be easily partitioned in a way that balances the work for each partition and require fine-grained, unpredictable, data accesses with poor spatial and temporal locality.

In SPARTA, we use a unidirectional, bufferless, deflection-routed NoC to connect the different parallel components. The idea of bufferless deflection-routed NoCs (BLESS) was proposed in [41] to reduce energy consumption and chip area by eliminating the need for routing or flow control buffers. While the deflection-based routing lowers the area and energy costs of the NoC, it might increase latency under high injection rates; however, SPARTA accelerator cores can tolerate a higher data access latency through asynchronous memory operations and CS. SPARTA uses a butterfly topology [18], where the last stage is connected to the first one (see also Figure 6 later). While adopting a different topology, the bufferless, deflection-based approach of SPARTA’s NoC is also similar to the one used in the Data Vortex network [34]. The Data Vortex network was originally devised for optical network interconnects due to its bufferless nature and was later implemented as a conventional network interconnect for high-performance computing systems. Its ability to support high injection rates of fine-grained transactions (memory word size) makes it effective for irregular applications [21]. The design has been later proposed as an intellectual property for a NoC [28], validated on FPGAs. Hoplite [30] is another noteworthy example of bufferless deflection routing NoC. Hoplite uses a torus topology with a very simple node switch implementation. The implementation of Hoplite’s node switch is similar to SPARTA’s one. However, SPARTA’s topology allows logarithmic hop counts. Kapre and Gray [31] thoroughly examine the cost of Hoplite’s switches, showing that the deflection routing algorithm can reach high clock frequencies while utilizing a minimal amount of resources of the FPGA fabric. Malik and Kapre [37] discuss an enhanced version of Hoplite with a butterfly topology; in the setup of this article, the parallel components are connected on the lower level of the network, while SPARTA can inject/receive requests through any of the network switches.

3 SPARTA Synthesis Flow

The SPARTA design flow automatically generates parallel multi-threaded accelerators starting from OpenMP code through an enhanced HLS flow that we implemented within the open-source Bambu tool [19] (Figure 1). The input of the SPARTA synthesis methodology is a C/C++ description of a parallel application annotated with OpenMP directives, such as the one shown in Figure 2(a). SPARTA includes a frontend derived from Clang, which parses the OpenMP annotations and performs initial HLS optimizations. OpenMP directives can be used to define parallel regions and shared data access policies, with many different work-sharing constructs (e.g., parallel for, sections) and synchronization mechanisms [12]. It is also possible to directly pass to our enhanced Bambu flow the LLVM IR exposing the OpenMP runtime calls, allowing SPARTA to process specifications generated through other frontends that support OpenMP (e.g., Flang and other domain-specific languages that leverage the OpenMP MLIR dialect).

Fig. 1.

Fig. 2.

A conventional compiler flow for parallel architectures translates an OpenMP-annotated description into a parallel application through several transformation passes. We will illustrate the transformations performed within the Clang LLVM compiler since we employ it as a frontend for the SPARTA methodology starting from C/C++ code. However, as previously highlighted, our approach is adaptable to other compiler infrastructures that support an OpenMP runtime library [35]. Using a conventional OpenMP parallelization flow through Clang/LLVM greatly simplifies introducing support for new constructs that might be added to the OpenMP standard.

We can split the parallelization process into three phases: pre-processing of compiler directives, code transformations, and parallel API library linking. Pre-processing of compiler directives allows gathering information on how the input program should be parallelized, i.e., which code sections can be executed in parallel, how to distribute work across threads, data allocation policies, and more. After pre-processing, the compiler generates an intermediate representation (IR), instruments it with calls to OpenMP runtime functions, and links the OpenMP library, connecting each function call to its target-specific implementation.

In the SPARTA synthesis flow, the Clang-based Bambu frontend compiles the input description following the same steps, except that in the final phase, it maps calls to the OpenMP runtime onto corresponding calls to our custom HLS library. The SPARTA library in Bambu contains optimized Verilog primitives to handle synchronization, concurrency, and memory operations in a parallel architecture, functionally equivalent to the OpenMP runtime library primitives. Starting from the LLVM OpenMP runtime library, we streamlined and reduced it to essential functionality to generate small and fast components suitable to be synthesized on FPGA. For example, we removed logic to support the reuse of threads because hardware threads cannot be reassigned to other cores that require a different datapath and controller. We implemented OpenMP constructs replaceable with arithmetic and logic operations (e.g., the ones that compute the range of loop iterations belonging to a thread) with standard functional units.

Figure 2 shows how the SPARTA front end translates an OpenMP annotated code (Figure 2(a)) into an IR with calls to the SPARTA library functions (Figure 2(b)). The code in Figure 2(a) denotes a parallel region with M threads where each thread executes an iteration of a for loop. The IR in Figure 2(b) shows how the transformations outline the loop body into a separate function (omp\(\_\)outlined) and modify the OpenMP runtime library calls. The new function signature contains all parameters to execute an iteration of the loop: the thread ID and a reference to the arrays A, B, and res. The function body is instrumented with API calls to the SPARTA Bambu library functions that retrieve the thread ID and synchronize with other threads.

The instrumented IR containing the API calls that control the parallel execution is subsequently optimized with hardware-oriented code transformations that facilitate the generation of the parallel architecture in the HLS backend.

3.1 The SPARTA Architecture

SPARTA generates a modular architecture following the content of the input OpenMP specification that can be customized through user-defined parameters. The main components are depicted in Figure 3: sequential parts of the computation go through the standard HLS flow, while parallel constructs are translated into a set of parallel processing units called cores. Each core contains a finite state machine with datapath (FSMD) implementing the computation described within an OpenMP parallel region (e.g., the loop body in an omp parallel for loop), a CS manager in case multiple threads are assigned to the same core, and local memory for data belonging to those threads. The number of cores and the number of threads (contexts) assigned to each core are configurable. The parallel region of the accelerator can also contain synchronization components, arbiters, and data shared among threads if needed. The accelerator is connected to a NoC that relays messages to an external memory (with support for multi-banked or pipelined memories); the number of channels between the accelerator and the NoC, as well as between the NoC and the external memory, is configurable. The user encodes information about parallelism, work-sharing policies, and synchronization requirements through the OpenMP annotations in the code and configures the rest of the architecture through command-line options.

Fig. 3.

3.1.1 Spatial and Temporal Multithreading.

Cores in a SPARTA accelerator execute concurrently, using the FPGA resources to exploit coarse-grained parallelism; instruction-level parallelism is extracted by the HLS tool during the implementation of the FSMD within each core. SPARTA accelerators can also support temporal multithreading by interleaving the execution of multiple threads on the same core instead of stalling execution and leaving the core underutilized when a thread needs to perform long latency operations (typically external memory accesses or synchronization, which might take tens of clock cycles), it is possible to perform a CS and execute a different thread with the same datapath, e.g., a different iteration of the same omp for loop.

Each core with CS enabled features a CS manager module to handle the active context and switch to a new one when a memory or synchronization operation starts. For example, on a read/write request to the external memory, the CS manager saves the active context (i.e., the thread state) and starts the execution of a different context in the next clock cycle. Once the core receives the memory response, the CS manager flags the previously paused context as ready and eventually resumes its execution when the currently active context is paused or completed. The CS operation has no overhead: in the clock cycle before the switch, the core executes normally, and at the start of the next clock cycle, the new context restarts from the state where it should have been after it was suspended. The CS manager always performs a switch when the active context starts a wait synchronization operation. This avoids deadlocks when the waited thread executes on the same core of the waiting thread, as the CS manager can flag the waiting thread as ready when the response from the waited thread is received. The CS manager also has a set of registers for each context to store the signals received while threads are not ready. When responses arrive at a core, the thread in execution is generally not the one that performed the request: the value is saved and forwarded to the actual thread originating the request when its execution is resumed. In this way, the CS functionality is entirely transparent to the rest of the accelerator and does not force other components to send signals multiple times or stall their execution.

In the core datapath, registers store intermediate signals between combinational regions; when many contexts share the same core, it is necessary to store a signal for each context. Each context thus needs its own set of registers and a control signal to select the active set of input and output registers; the new register set with the context selector control signal is commonly referred to as a register file. Since register files can consume considerably more resources than standard registers, it is fundamental to avoid them when possible. In the SPARTA architecture, CS can happen only during memory accesses or synchronization calls, which are known at compile time: the synthesis flow exploits this information to replace registers with register files only within control states where CS can happen, while other registers are left unchanged. As an example, let us take into account a core executing a floating-point add operation: two values are loaded from memory, the sum of the two values is computed, and the result is stored back in memory. Load and store operations can trigger a CS to hide the memory access latency thus registers connected to load and store operations in this core must be replaced with register files to save signal states and later restore them; instead, the floating-point sum can never be interrupted by a CS and thus none of the registers present in the floating-point adder logic needs to be replaced with a register file.

3.2 Thread Synchronization

Three synchronization modules represent a key component of the SPARTA architecture: the barrier manager, the reduction manager, and the critical manager. They are only included in the parallel region if they are needed, and their size is proportional to the number of cores that use them.

Barriers are added by the OpenMP runtime at the end of a parallel region to ensure that the execution can proceed safely if the threads have different execution times. In the OpenMP paradigm, when a thread encounters a barrier, it notifies all other threads in the same team that it has reached the barrier and waits until all other threads also signal that they have reached the end of their execution. To implement barrier synchronization, the SPARTA architecture includes local components for each core that generate completion signals and a centralized barrier manager that collects completion signals, updates a synchronization register, and, when all required threads have reached the barrier, sends them a command to resume execution. Sharing a unique barrier manager among all parallel cores ensures that synchronization happens safely without the possibility of race conditions. Furthermore, the same barrier manager component can also be used to implement the wait_thread OpenMP operation, which waits for the completion of one specific thread without adding more logic or duplicating the synchronization state.

In OpenMP reduction operations, the thread with ID 0 needs to retrieve partial reduction variables from all other threads and combine them. Each thread on a SPARTA core participating in a reduction communicates with the reduction manager to indicate the address of its reduction variable in its local memory; thread 0 then accesses the reduction manager to retrieve the addresses of all partial results it needs to merge. In fact, as will be explained in Section 3.3, a thread can access the local memory of another thread if it knows the correct address. Such a scheme could have also been implemented by storing the addresses of partial reductions in the external shared memory but with the considerable latency overhead of additional memory requests. Instead, the SPARTA reduction manager implemented inside the accelerator allows one-cycle access to the information required to perform the reduction.

We chose to implement the reduction following a linear pattern: after thread 0 finishes its part of the computation, it collects all the values computed by the others and merges the results. Other reduction schemes may result in a different tradeoff between area occupation and speed. For example, using more threads to merge the results is possible at the cost of duplicating more logic. However, a linear reduction operation often requires five or six cycles per thread to be merged, which is negligible when confronted with the duration of the computations. For this reason, when multiple contexts can be activated, the policy of the CS manager is to choose the thread with the lower thread ID: this gives priority to thread 0, which is the one that could perform a reduction operation, and then to the other threads in order since the reduction proceeds following the thread IDs.

In the OpenMP paradigm, the critical pragma is used to identify a set of instructions that can only be executed by a single thread at once. SPARTA adds a critical manager that behaves like a mutex and a local component for each thread that encounters the critical section. When a thread reaches the critical section, the local component sends a signal to request control of the lock, and the critical manager grants it only if it is not currently assigned to another thread. If the lock is not acquired by the current thread, the CS manager performs a CS. The critical manager stores the state of the requests and the current lock owner and, like the CS manager, prioritizes the thread with a lower ID when multiple threads require access. When the thread finishes the execution of the critical section, it sends a signal to release the acquired lock to avoid deadlock and allow other threads to resume their execution.

3.3 Memory Hierarchy

SPARTA exploits a hierarchical memory model that allows reducing the number of requests to the external memory and keeping data inside the accelerator as much as possible. There are three possibilities to store variables, visible in Figure 3: external memory outside the accelerator (high-latency memory), block RAM (BRAM) modules in the FPGA fabric inside the accelerator that are accessible by all cores, and private BRAM modules inside parallel cores. External memory is usually suitable for input and output parameters that cannot fit into on-chip BRAMs, while variables that are defined and used during the computation are better allocated inside the accelerator, whether in shared or private memory blocks, to lower access latency and improve performance.

The various levels of the memory hierarchy are connected through an ad hoc bus infrastructure, depicted in Figure 4, that can handle multiple concurrent memory requests. The local bus has one channel and connects cores within the same parallel region, while the main accelerator bus connects parallel regions with the on-chip shared memory and with the NoC toward the external memory. Each memory request originating from one of the parallel cores goes through a filter; the filter interprets its address and properly routes the request to internal BRAMs of other cores through the local bus or outside the parallel region through the main bus. Since BRAMs have two channels, local memories can support concurrent access from both the owner core and the local bus. SPARTA can connect the computation units directly to the local memory without using the local bus if it can infer at compile time the address of the requests and the variable is allocated inside the same core.

Fig. 4.

SPARTA memory addresses encode the physical location of the memory module where data is allocated; Figure 5 shows the structure of external and internal memory addresses, which have the same length because they travel on the same local bus. The size of the address depends on the external memory size (which is likely considerably larger than internal memory) plus one bit that is added to differentiate between internal and external addresses. This allows a very simple decoding logic: if the external bit is equal to 1, the request is propagated to the external memory. For internal memory, a dedicated pagination algorithm allocates each memory variable considering the parallel region and thread it belongs to. The address of internal variables is composed of three parts, representing the address of the variable in the memory module, the thread that owns the variable, and the parallel region in which the memory module is allocated, respectively. To build the address, SPARTA first finds the largest internal memory module and computes the number of bits needed to represent it, so this number of bits can be reserved to univocally identify a variable inside each memory module. Similarly, the size of the second part of the address is set by looking at the largest number of threads assigned to the different parallel regions. Finally, a unique code is assigned to each parallel region, which is used by the filters to determine whether an address is local to the parallel region. Looking at the appropriate bits of the address, filters and arbiters in the memory interconnection system can properly forward a memory request to a different bus or component using combinatorial logic without adding delays to the requests. For example, if the parallel region ID in an address matches the one stored in the filter, the request is forwarded to the local bus, where the arbiter will find the correct core by looking at the thread ID. Instead, if the parallel region ID does not match or if the address has the external bit set to one, the request is sent to the main bus.

Fig. 5.

If two requests arrive in the same cycle, the bus arbiter can only accept one of them, and the other must be stalled; in this case, SPARTA generates a back pressure signal between arbiters and cores to resend requests that could not be accepted in the current cycle. With only one context in execution on the core, execution has to be stalled as it needs the result of the memory operation to proceed, but when multiple contexts are assigned to the same core, another context can be executed in the meantime. A First-In-First-Out (FIFO) queue holds requests before the memory filter so that the CS manager can switch the running context at the start of each memory request instead of when an arbiter accepts it; this prevents the effects of the arbitration from having repercussions on the core execution, as the FIFO is responsible for resending the requests to the arbiter while the core can continue its operation with another context.

3.4 External Memory Interface

The parallel accelerator generated by SPARTA is connected to an external memory module with a fixed number of memory channels. Thus, memory requests need arbitration as they can originate from multiple sources (i.e., the parallel cores) and be directed to multiple destinations. Instead of using dedicated hardware modules or another arbiter, we introduced a custom NoC interconnect that allows maintaining a short clock period. The NoC interconnect is composed of two unidirectional NoC components (Figure 6), one for the request from the cores to the memory and the other for the opposite path.

Fig. 6.

The proposed NoC consists of a group of interconnected nodes following a butterfly structure, organized in \(n\) levels where each level has a number of nodes equal to \(2^{n}\), for a total number of nodes equal to \(n*2^{n}\). The number of nodes in the NoC must be at least equal to the maximum between the number of banks in the external memory and the number of cores in the parallel region (destination and source of the NoC, respectively). Each node in the NoC can be configured to be registered or combinatorial, providing a tradeoff between the frequency of the NoC and the number of cycles needed to traverse it: in fact, combinatorial nodes produce output values in the same cycle, increasing the critical path, while registered nodes introduce a delay. Nodes connected to external components are called I/O nodes and must be registered to avoid latches; for the rest of the nodes, the choice to use combinatorial nodes or have only registered ones is left to the user.

The number of cycles needed to traverse the NoC is equal to the number of registered nodes on the path of the request and, at most, is equal to the number of levels. To minimize the average travel time inside the NoC, SPARTA organises the registered nodes filling entire levels before going to the next one. For example, consider a NoC connecting 16 memory banks to 16 parallel cores: the number of levels must be 3 (as \(2*2^{2}=8 {\lt} 16\)), but if we choose to use combinatorial nodes, it can be traversed in 2 cycles as only the 16 I/O nodes must be registered and they can be arranged in the first two levels. The NoC is resilient to congestion: if two requests want to go to the same node, there is no need to stall, as the node can redirect one of them on another route to the same destination. The NoC can thus accept multiple requests: each node has two internal inputs and two internal output connections, and an I/O node can take a request if one of its internal inputs is not active in the current cycle. Registered nodes can also store a request, while combinatorial nodes forward them immediately and do not store information. If both input internal connections are active, the I/O node of the NoC sends a back pressure signal to resend the request in the next cycle. When using multiple contexts, back pressure signals do not interfere with the execution of the cores as they are intercepted in the FIFO modules, which take the role of resending the requests, as described in Section 3.3.

The SPARTA architecture allows the accelerator to connect to different types of external memory, with simple or pipelined memory access, and for both unified and banked multi-channel memories. The accelerator does not assume that the memory is immediately ready to accept a request and uses a protocol with back pressure that can be used to synchronize with it. Furthermore, SPARTA cores work under the assumption that answers from the memory could come out of order, as traffic in the NoC or the external memory could change the relative order of requests. Each request is tagged with an ID used to identify the thread that performed it and correctly forward the answer. If the external memory is banked, it is possible to add one cache in front of each bank to further reduce the average latency of the memory accesses. (There is no need to introduce a cache consistency mechanism, as each cache has its own set of addresses associated with a different memory bank.)

We introduced a set of custom pragmas and parameters to specify information about the accelerator interfaces for variables that will be stored in external memory and model their behaviour during the simulation of the generated accelerator (Table 3). Specific variables can be allocated to specific banks to mimic access patterns of real memories. For example, #pragma HLS\(\_\)interface a banked bank\(\_\)number = 16 bank\(\_\)chunk = 256 specifies that the a variable needs to be stored in a multi-banked external memory that has 16 banks of 256 bytes each.

SPARTA accelerators, by default, exchange data with other accelerators and the external memory using a simple and efficient protocol called minimal memory interface, described in Table 2. The protocol requires a limited number of signals and logic elements to perform memory operations, reducing the size of internal components (particularly the NoC). However, SPARTA also supports more complex protocols such as AXI4 [4] by adding a bridge component before each memory bank that translates each request and response from the minimal protocol to AXI4. The translation is completely transparent to the accelerator, correctly propagating back pressure signals from and to the external memory.

Table 2.

Signal	Meaning
Mout_oe_ram	Read operation (1 bit)
Mout_we_ram	Write operation (1 bit)
Mout_addr_ram	Address of the memory operation
Mout_Wdata_ram	Value to write
Mout_data_ram_size	Size in bits of the value to write
Mout_accelerator_context	Tag of the memory operation (parallel region ID and context ID)
Mout_back_pressure	The source has to resend the request (1 bit)

Table 2. Minimal Memory Interface Protocol

Table 3.

Parameter	Usage
Banked	Specify that the memory is banked
Bank number	Specify the number of banks
Bank allocation	Force a variable to be allocated on a specified group of banks
Bank chunk	Specify the size in bytes of the bank blocks
Cache	Adds caches to the accelerator
Cache size	Specify the size of the cache word
Cache depth	Specify the number of cache lines
AXI	Specify that the memory protocol is AXI4

Table 3. SPARTA Parameters Used to Describe External Memory Interfaces

3.5 Hardware-Oriented Transformations

We describe here two key transformations of the SPARTA methodology that reduce the impact of parallelization on resource utilization: thread ID propagation and parallel API pointer promotion.

Thread ID propagation. SPARTA generates an array of cores by replicating the datapaths of omp\(\_\)outlined functions corresponding to the content of an OpenMP parallel region. While simply replicating the whole datapath provides a functional design, considering which parts are effectively used at runtime can reduce resource utilization. Because thread IDs are statically assigned to each core, our approach can use them to determine the resources of the datapath that each thread actually uses. In fact, in the OpenMP work-sharing model, it is possible to differentiate the computation assigned to threads in the same group, but only depending on the thread ID. Bambu features several code analyses and transformations that SPARTA can exploit to statically propagate thread IDs through the replicated datapaths and to automatically perform dead code elimination, leading to the generation of the minimum set of resources required for each of the cores. Static thread ID propagation also avoids the replication of sections of the datapath only used by one thread, such as resources that execute reduction operations.

Pointer promotion. The OpenMP API makes extensive use of pointers and multiple levels of indirection to exchange references to internal data structures and parameters between threads; such a programming model is generic and versatile, but it determines an overhead for memory accesses and pointer resolution. In fact, memory operations on FPGA typically need to access external memories and are thus significantly more expensive in terms of latency and energy with respect to arithmetic operations. Whenever possible, SPARTA leverages static resolution of multiple indirections to reduce the number of memory operations. Because the input application and the architectural parameters (e.g., number of contexts per core and memory architecture) are fixed, the SPARTA frontend can apply function versioning for each call to functions of the parallel API. After versioning, each function represents an instance of a hardware component, and static analysis can be performed on the IR to infer dynamically allocated data structures and memory accesses: simple and multiple indirections are solved with pointer promotion, and statically allocated data structures enable further optimizations during the HLS flow. Once parallel APIs versioning is completed, it is also possible to perform a more detailed alias analysis on pointers to understand what memory structures are accessed by each operation. In this way, it is possible to pinpoint shared memory variables whose location will be shared at runtime between threads and allocate them in shared BRAMs while the remaining private memory variables are allocated in the local memories.

4 Experimental Evaluation

We showcase the results that can be achieved through SPARTA through four experiments: the first consists of synthesizing three graph processing benchmarks to show speedup and scalability varying the number of cores and contexts (Section 4.1), the second is a direct comparison between SPARTA and Svelto [39] on parallel database queries (Section 4.3), the third is a comparison between SPARTA accelerators and kernels from the graph acceleration library of the commercial tool Vitis HLS (Section 4.4) and the last one (Section 4.5) is a comparison with LegUp [15]. We also include in Section 4.2 a focus on the effect of different NoC configurations and back pressure conditions on the performance of the Graph Algorithm Platform Benchmark Suite (GAPBS) accelerators.

We use a custom version of Bambu HLS to run the experiments, which is available as an executable binary.¹ The following considerations are common to all tests: we assume to connect the accelerators to an external memory with 16 banks and one channel per bank; the memory delay is assumed to be constant. We indicate with core the number of parallel accelerator cores, and C the number of contexts on each core. We report area utilization in terms of lookup tables (LUTs) and registers. SPARTA components also require one DSP block for each core and one BRAM for every two contexts if reductions are present; Table 4 reports the area utilization of the main architecture components to isolate the contribution of components that do not depend on the input application. Most of the designs meet the imposed timing constraint after place-and-route except for the largest configurations, where Vivado (the backend tool for AMD/Xilinx FPGAs) was not able to achieve the requested clock period after the placement phase due to high utilization of the device resources.

Table 4.

Component name	LUTs	Registers	BRAMs
NoC (16/24 nodes)	6,543	6,274	0
NoC (32/64 nodes)	12,051	10,048	0
CS manager (2 C)	122	109	0
CS manager (4 C)	254	182	0
CS manager (8 C)	527	327	0
Cache (32 kB)	2,844	755	15
MinimalAXI4Adapter	46	161	0

Table 4. Resources Consumption of the Main SPARTA Architecture Components

For the NoC, we report the number of I/O nodes and the total number of nodes.

4.1 GAP Benchmark Suite

The GAPBS [7] is a collection of benchmarks intended to standardize evaluation for graph research, and it uses OpenMP to parallelize execution. We tested SPARTA on a subset of GAPBS benchmarks, excluding the ones that use OpenMP primitives that have not yet been implemented in SPARTA. The target is an AMD/Xilinx Alveo U280 board with a clock frequency of 250 MHz. In all experiments, the input graph, in Compressed Sparse Row format [49], is assumed to be stored outside the FPGA in an external DRAM memory.

The connected components (CC) benchmark finds the largest connected subgraph in the undirected input graph; we synthesized it with configurations that vary from one to eight cores and from one to eight contexts per core. With a larger number of cores/contexts, the memory channels utilization saturates, and there can be no further improvement, as the primary source of latency for the CC application is the access to the external memory. As can be seen in Figure 7(a), the increase in performance over a sequential version of the application (one core, one context) grows linearly with the number of cores, keeping the number of contexts constant; in fact, having more cores that perform memory requests increases the utilization of the memory channels and the overall performance. We can see the same effect with more contexts on the same core to hide the memory latency: the configuration with eight contexts on one core has the same speedup as the one with eight cores and a single context. The configuration with eight cores and eight contexts is 36 times faster than the sequential version; this is slower than the theoretical speedup because the memory access pattern is irregular, bringing multiple requests on the same memory bank at the same time and thus producing delays and congestion on the NoC. LUTs and registers scale differently with the number of cores/contexts, as can be seen in Figure 7(b): the number of LUTs grows with the number of OpenMP threads (e.g., 7\(\times\) increase with respect to the sequential version with eight cores, eight contexts), while liveness analysis during the synthesis flow identifies the minimum number of registers to be replicated for CS, so the number of registers is mostly linked to the number of cores (4\(\times\) more than in the sequential version).

Fig. 7.

The second GAPBS kernel we evaluated is PR, which assigns a value to each graph node based on its relative importance compared to the other nodes. Figure 8(a) shows that the speedup over a sequential baseline proportionally increases with the number of cores, similar to the CC benchmark. However, in this case, increasing the number of contexts per core from four to eight does not lead to the same performance improvement because some sections of PR are computation-bound rather than memory-bound, leading to competition for parallel cores rather than stalls due to memory access operations. With the largest configuration, we achieve a 30\(\times\) speedup with 6.5\(\times\) the area occupation (considering LUTs, registers occupation only increases 4.5 times).

Fig. 8.

The last GAPBS benchmark we considered is Triangle Count (TC), which counts the number of triangles in the input graph. After assessing how performance and resource consumption scale with varying numbers of cores and contexts (Figure 9) on the Alveo target, we used this benchmark to perform a first comparison against the Svelto synthesis methodology [39]. There are a few differences between the two methods that need to be considered. Svelto is based on a gcc/g++ frontend, while SPARTA uses Clang; the crossbar used in Svelto has a different behaviour with respect to the NoC used in SPARTA, and the comparison is further complicated by the fact that Svelto uses atomic-based synchronization while SPARTA implements OpenMP reductions. They also operate under different assumptions regarding the behaviour of the memory (i.e., SPARTA allows to use asynchronous external memories that can accept multiple requests at the same time while Svelto only sends one request at a time), and Svelto dynamically assigns loop iterations to the cores while SPARTA uses static scheduling. Nevertheless, Svelto remains the state-of-the-art methodology closest to SPARTA, as described in Section 2.1, and it is worthwhile to examine the QoR of accelerators generated with the two methods. Furthermore, since both use Bambu as HLS backend, we can safely state that the comparison only refers to the differences in the implementation of parallel accelerators rather than being influenced by different choices in the scheduling, allocation, and code generation steps.

Fig. 9.

The results in Figure 10(a) show the relative performance of SPARTA against the Svelto baseline with both tools targeting a Virtex 7 FPGA at 100 MHz; bars higher than one, thus represent configurations where SPARTA outperforms Svelto. While Svelto is faster on smaller configurations, SPARTA can achieve a higher level of parallelization as the number of cores and contexts grows. There are two main reasons for this: the first one is that the NoC in SPARTA is designed to be effective when there are many requests from different sources, which is achieved when the number of cores is higher; the second reason lies in the ability of SPARTA to allow multiple memory requests simultaneously, which is critical to exploit memory-level parallelism when multiple memory channels are available. In fact, SPARTA adds a tag to external memory requests, allowing it to send multiple pipelined, asynchronous requests, while Svelto must wait for the memory to answer before performing the next request. With a number of OpenMP threads greater than the number of banks in the external memory, SPARTA can execute more requests in parallel than Svelto, resulting in a higher speedup. Regarding area usage, as can be seen in the relative comparisons of Figure 10(b) and (c), SPARTA consumes more resources than Svelto on the TC benchmark. However, the area-delay product (ADP) of SPARTA scales better with the number of cores/contexts: using the formula

\begin{align*}\frac{ADP_{SPARTA}}{ADP_{Svelto}}=\frac{LUTs_{SPARTA}*Cycles_{SPARTA}}{LUTs_{ Svelto}*Cycles_{Svelto}},\end{align*}

we can see that, for example, the ADP of a SPARTA accelerator with 1 core and 1 context is 1.14 times the ADP of a Svelto accelerator with the same configuration, and the ratio drops to 0.47 with 32 cores and 8 contexts (SPARTA is 3.8 times faster using only 1.7 times the LUTs and 2.4 times the registers). One reason for the higher resource consumption is that SPARTA is more general than Svelto: SPARTA explicitly implements multiple OpenMP primitives and synchronization mechanisms, while Svelto only supports a few patterns, and thus, it can make assumptions on the behaviour of the input program that simplifies the underlying architecture. Moreover, SPARTA implements reduction operations that require fewer cycles and memory operations but more area than the atomic operations implemented in Svelto. Finally, Svelto has a centralized dispatcher that distributes loop iterations to the cores, while each core in SPARTA contains dedicated logic to compute the current iteration index.

Fig. 10.

4.2 NoC Evaluation

In this section, we evaluate the performance cost of the NoC and the effect of back pressure on the latency of memory operations. Table 5 shows data collected synthesizing and running the TC, PR and CC benchmarks with 32 cores and eight contexts, as it is the configuration with the largest occurrence of back pressure signals. We tested two different configurations of the NoC: one that registers only the 32 I/O nodes and leaves the other 32 combinatorial and one that only uses registered nodes. The difference between the two configurations of the NoC is the number of requests it can store, which is equal to the number of registered nodes, and the maximum latency of the NoC, which is two clock cycles in the first case and four in the second. The actual number of cycles needed to traverse the NoC equals the number of registered nodes on the request’s path plus the effect of congestion (NoC latency).

Table 5.

Benchmark	Cycles	BP	#Mem. Operations	Avg. BP Delay (cycles)	Avg. NoC Latency (cycles)
		NoC
TC_32Co_8cs	329159	964697	759812	1.27	7.29
PR_32Co_8cs	129064	168259	467939	0.36	4.42
CC_32Co_8cs	246424	2281643	1172090	1.95	7.67

Benchmark	Cycles	BP	#Mem. Operations	Avg. BP Delay (cycles)	Avg. NoC Latency (cycles)
		NoC Registered
TC_32Co_8cs	338701	508814	759812	0.67	12.00
PR_32Co_8cs	132046	148133	467939	0.32	8.03
CC_32Co_8cs	264787	1338848	1172090	1.14	12.26

Table 5. Congestion Overhead of the NoC on GAPBS Accelerators in Two Configurations.

BP indicates the total number of back pressure signals received by the FIFO in front of the cores.

BP represents the total number of times the NoC could not accept a memory request because of congestion, measured through the number of back pressure signals received by the FIFO modules in front of all cores, while the number of memory operations is the sum of the requests from all cores on all memory channels. We can compute the average delay spent by a single request in the FIFO modules before entering the NoC by dividing the total number of back pressure signals by the number of memory operations; this is the effect of the arbitration on the main memory bus, and it is very small when compared with the latency of the memory (20 cycles) or the latency of the NoC. The average NoC latency is measured as the average time that it takes for a request to traverse it.

Using a registered NoC, which can store more requests and thus reduce the effect of the arbitration delay, has a negative effect on the total number of cycles for two reasons: while a registered NoC reduces the number of back pressure signals, it increases the minimum time needed to traverse the NoC from two cycles to four, and it increases the internal congestion (visible in the higher average NoC latency). For these reasons, we ran all the experiments in Section 4.1 with a NoC with only the I/O nodes registered.

4.3 Parallel Queries

In this section, we generate accelerators for the same parallel database queries used in [39], which are SPARQL queries from the LUBM benchmark and dataset [23] translated into C++ and parallelized with OpenMP. SPARQL is a query language for graph databases in the resource description format (RDF). RDF databases are naturally described as single labelled graphs, and SPARQL queries are defined as (sub)graph matching operations. The resulting code features nested for loops with different behaviours (e.g., number of OpenMP threads, number of loop iterations assigned to each thread) and irregular memory accesses. We compared results obtained with SPARTA and Svelto on several different parallel queries and also ran the same experiments, adding a cache in the SPARTA accelerators in front of each memory bank (512 lines, 512 bits per line). Figure 11 shows the relative performance of SPARTA against the Svelto baseline, with and without caches; as before, bars higher than one represent configurations where SPARTA outperforms Svelto.

Fig. 11.

The results we obtained are different depending on the specific query considered and the characteristics of its workload. In Q1, SPARTA without caches is always slower than Svelto, as it contains a small group of OpenMP threads, each performing a large number of operations, and we have shown how SPARTA performs best with memory-bound workloads. Q4 and Q5 terminate in a few tens of thousands of cycles and are not very indicative of the achievable speedup, as most cycles are spent in synchronization operations. Data for Q5 with 8 cores and 8 contexts is missing as Svelto does not terminate in that configuration. The results observed align with the ones from TC: SPARTA without caches has worse performances when using few threads but improves when the number of cores and contexts grows. In Q2, Q3, and Q7, SPARTA drastically outperforms Svelto, even considering the higher resource utilization (Figure 12), with between 6 and 11.5 times faster execution. In Q1 and Q6 instead, Svelto is faster, and this can be explained by a lack of parallelism in the query: a lot of the work has to be executed by a small subset of the threads and the dynamic assignment of iterations in Svelto is better suited for this than the static division of work in SPARTA. However, even in these cases, with enough cores and contexts, SPARTA can recover the disadvantage and at least reach the same performance, although using a larger area.

Fig. 12.

When using caches, SPARTA is almost always faster than Svelto, and this is particularly effective in the test cases where previously SPARTA was struggling: in both Q1 and Q6, the introduction of caches in SPARTA makes the generated accelerators more than twice as fast than before. With the largest configurations in Q2, Q3, and Q7, adding the cache reduces the speedup compared with the base version of SPARTA: these tests are the ones in which static scheduling can divide the computations between multiple cores and keep all of them working. Under this circumstance, the main advantage of SPARTA is the ability to parallelize multiple requests running different contexts and hide the memory latency while advancing the state of all the contexts simultaneously; adding caches reduces the average memory latency, and it can cause a loss of parallelization when using CS. In fact, if the memory requests are answered faster, the execution of the different contexts can become serialized and, depending on the data access pattern, lead to worse results.

In Figure 13, we investigated the influence of CS across various configurations. Due to space constraints, we focused on analyzing the speedup resulting from increasing the number of contexts from 1 to 2, 4, and 8 for query Q7. The speedup was visualized using bar charts and compared against different numbers of cores and tools (Svelto, SPARTA, and SPARTA with caches). When the number of contexts was fixed at 1, we noticed that Svelto and SPARTA exhibited similar execution times, except in the case of 32 cores (see Figure 11). As expected, introducing caches improved SPARTA’s performance across all scenarios. Transitioning from spatial to temporal parallelism, SPARTA and SPARTA with cache approaches demonstrated superior scalability compared to Svelto. In fact, in Figure 13, the maximum speedup achieved by increasing the number of cores tended to decrease as expected, but the rate of decrease was lower for SPARTA and SPARTA with caches compared to Svelto. This limited scalability of Svelto is due to its interconnection, which scales less efficiently than the NoC used by SPARTA. Furthermore, SPARTA initiates a new memory operation before completing the previous one, enhancing channel utilization. Finally, with caches enabled in SPARTA, the impact of spatial parallelism is heightened while the impact of temporal parallelism is reduced. In Figure 13, the maximum speedup with Svelto and SPARTA is around 6–7, while with SPARTA with caches, it is around 4.

Fig. 13.

4.4 Vitis HLS Graph Library

Commercial tool Vitis HLS from AMD/Xilinx provides a library of pre-optimized C++ kernels to implement graph processing accelerators.² The kernels from the Vitis library have been optimized for HLS through parallelization, loop pipelining, and the introduction of caches to reduce memory access times.

We utilized identical graphs to compare the results obtained through the Vitis Graph library with the SPARTA approach. Although our graphs are smaller in scale compared to those used with the Vitis Graph Library, they still provide a reasonable simulation and analysis time for conducting design space exploration. The graph characteristics are detailed in Table 6. The graphs are stored in the Compressed Sparse Row matrix format as undirected. For TC and PR, directed graphs are required, thus only the lower triangular part of the matrix is utilized. The SPARTA configurations for TC, CC, and PR are optimized to fully utilize spatial and temporal parallelism, using 32 cores and eight contexts each. However, due to issues in the place and route phase, SPARTA cannot achieve the same frequency as Vitis with such large designs. As a result, the post-routing frequency achieved is 200/220 MHz instead of the 250 MHz obtained with smaller configurations.

Table 6.

Name	Vertices	Edges
Graph_1	4,096	65,390
Graph_2	8,192	98,230
Graph_3	8,192	131,036
Graph_4	16,384	131,036

Table 6. Characteristics of the Graphs Used to Compare Vitis and SPARTA

In the Vitis benchmarks, the delay of the HBM interface between the accelerator and the memory ranges from 100 to 140 ns, depending on the address of the request. To approximate this behaviour in the SPARTA benchmarks, we increased the memory delay to 32 cycles, which at 200/220 MHz is equivalent to 160/145 ns. Additionally, we utilized the AXI4 protocol instead of the minimal interface to exchange data between the accelerators and the external memory.

It is worth noting that both SPARTA and Vitis target an Alveo U250 FPGA, part of the Ultrascale family. The results for each algorithm are reported separately in Tables 7, 8, and 9.

Table 7.

Graph	Cycles	Frequency	Time (ms)	LUTs	AreaXTime
Graph_1 SPARTA	148,599	220 MHz	0.675	137,407	92,700
Graph_1 Vitis	2,358,543	280 MHz	8.423	103,923	875,031
Graph_2 SPARTA	255,000	220 MHz	1.158	137,407	159,076
Graph_2 Vitis	3,960,784	280 MHz	14.14	103,923	1,469,471
Graph_3 SPARTA	322,363	220 MHz	1.464	137,407	201,098
Graph_3 Vitis	4,661,624	280 MHz	16.64	103,923	1,729,486
Graph_4 SPARTA	422,074	220 MHz	1.912	137,407	263,301
Graph_4 Vitis	6,456,582	280 MHz	23.05	103,923	2,395,425

Table 7. Comparison between SPARTA and Vitis on the CC Benchmark

Table 8.

Graph	Cycles	Frequency	Time (ms)	LUTs	AreaXTime
Graph_1 SPARTA	359,552	220 MHz	1.632	91,449	149,278
Graph_1 Vitis	456,756	300 MHz	1.521	21,001	31,942
Graph_2 SPARTA	378,962	220 MHz	1.720	91,449	157,336
Graph_2 Vitis	560,060	300 MHz	1.865	21,001	39,166
Graph_3 SPARTA	552,599	220 MHz	2.509	91,449	201,098
Graph_3 Vitis	916,216	300 MHz	3.051	21,001	64,074
Graph_4 SPARTA	292,759	220 MHz	1.329	91,449	263,301
Graph_4 Vitis	590,690	300 MHz	1.967	21,001	41,308

Table 8. Comparison between SPARTA and Vitis on the TC Benchmark

Table 9.

Graph	Cycles	Frequency	Time (ms)	LUTs	AreaXTime
Graph_1 SPARTA	692,441	200 MHz	3.462	205,444	711,289
Graph_1 Vitis	3,703,603	300 MHz	12.333	88,372	1,089,891
Graph_2 SPARTA	783,614	200 MHz	3.918	205,444	804,943
Graph_2 Vitis	5,659,159	300 MHz	18.845	88,372	1,665,370
Graph_3 SPARTA	971,440	200 MHz	4.857	205,444	997,882
Graph_3 Vitis	7,509,609	300 MHz	25.007	88,372	2,209,918
Graph_4 SPARTA	1,088,871	200 MHz	5.444	205,444	1,118,518
Graph_4 Vitis	7,614,714	300 MHz	25.375	88,372	2,240,848

Table 9. Comparison between SPARTA and Vitis on the PR Benchmark

SPARTA and Vitis can achieve different maximum frequencies. To account for this difference, we calculated the kernel execution time as the achieved clock period times the number of cycles (Time in the Tables). In addition, we computed the AreaXTime metric to normalize the performance of the different designs.

In the CC benchmark, SPARTA outperforms Vitis in terms of execution time by a factor of 10, with only a 30% greater area occupation. Additionally, the AreaXTime metric favours SPARTA by almost an order of magnitude.

In the TC benchmark, SPARTA achieves, on average, a faster execution time. It requires 14% less time to complete the computation of the four graphs. However, it occupies four times more of Vitis’s area.

In the PR benchmark, SPARTA achieves a speedup between four and six times compared to Vitis, occupying only twice the area. Consequently, it achieves an AreaXTime metric at most half of Vitis.

As performance results for the BFS graph library kernel are not available, we also used Vitis HLS to synthesize a C++ implementation of the BFS algorithm loosely based on the one available in MachSuite [48]. Note that we are using the common frontier-based implementation rather than the direction-optimizing implementation [6] (also provided in GAPBS). BFS is the prototypical irregular kernel [1] where the number of iterations for each thread can be highly unbalanced and requires critical sections to synchronize the different threads.

We tried different configurations for SPARTA, varying the number of cores and contexts using the accelerator produced by Vitis HLS as a baseline. Table 10 compares the accelerators produced by Vitis HLS and SPARTA without parallelization (one core and one context). There is only a small difference between the number of cycles for the two tools, while the area is more difficult to compare: Bambu does not require BRAMs for this design as it maps all logic on LUTs and registers, and almost half of the LUTs and registers used are dedicated to the 16 node NoC (as reported in Table 4).

Table 10.

Results	SPARTA	Vitis HLS
Cycles	32,622,452	33,287,532
LUTs	13,163	4,852
Registers	13,944	8,164
BRAM	0	263

Table 10. Comparison between SPARTA and Vitis HLS Accelerators for the Sequential BFS Algorithm

SPARTA results include the cost of the NoC and the adapter from the minimal protocol to AXI4.

Figure 14(a) displays the speedup achieved using different configurations of SPARTA with respect to the number of cycles required by Vitis HLS. In the sequential version, most of the time is spent waiting for the external memory to provide data in both the SPARTA and Vitis implementations. In the SPARTA parallel accelerators, a similar pattern arises when the number of contexts per core is small: each core spends most of the time waiting for new data to process. As we increase the number of contexts, each core spends a greater fraction of time doing useful computations. Even when using multiple contexts, BFS does not achieve the same speedup as more regular applications such as the ones described in Section 4.1: during the exploration of the graph, each thread can find a different number of elements, leading to a high load imbalance. Furthermore, all threads must be periodically synchronized to avoid exploring the same node of the graph multiple times and to compute the correct length of the shortest path from the origin. Despite this, SPARTA can achieve good speedup when using four contexts per core: with four cores, it achieves a speedup of almost five times, and with two cores, a speedup of three times compared to the Vitis HLS baseline. Figure 14(b) compares the area utilization of SPARTA and Vitis HLS. When adding a core, SPARTA replicates almost the entire design and adds necessary communication and synchronization logic. Hence, resource utilization increases a little more than linearly: with four cores and one context, the area occupied is six times more than the baseline. Note that with 4 cores, the NoC is larger than the entire design generated by Vitis HLS. However, its size does not increase when more contexts per core are added.

Fig. 14.

4.5 LegUp

We compare SPARTA even with the published results obtained by LegUp [15]. Specifically, we are using the Mandelbrot kernel from the latest open-source version of LegUp. This kernel is a parallelized version of the Mandelbrot iterative mathematical benchmark, which produces a fractal image and is parallelized with OpenMP for pragmas. The results obtained by LegUp on this benchmark are reported in Table 11, as published in [15]. SPARTA achieves better performance across all metrics when using the same Mandelbrot kernel on a Stratix IV device (see Table 12). This performance advantage can be attributed to several factors, including SPARTA’s use of external memory for storing the Mandelbrot picture, its independence from a soft processor for OpenMP parallelization, and its simpler architecture.

Table 11.

Core	Cycles	Frequency	LUTs	M9Ks	DSPs
1	2,523,455	135 MHz	22,111	134	28
4	627,059	127 MHz	26,869	134	88
8	355,224	124 MHz	34,187	158	168
12	241,805	127 MHz	37,476	158	248
16	185,863	127 MHz	40,890	158	328
32	126,828	112 MHz	61,251	182	648
48	113,593	104 MHz	81,224	206	968

Table 11. The LegUp Results for the Mandelbrot Kernel

Accelerator indicates the number of accelerators used, and M9K is the number of memory components.

Table 12.

Accelerator	Cycles	Frequency	LUTs	M9Ks	DSPs
1	2,294,063	200 MHz	4,878	1	12
2	1,151,243	200 MHz	5,198	2	24
4	579,830	200 MHz	5,928	4	48
8	291,057	197 MHz	7,553	8	96
16	139,607	196 MHz	9,939	16	192
32	73,039	163 MHz	19,793	32	384

Table 12. The SPARTA Results for the Mandelbrot Kernel

Accelerator indicates the number of accelerators used, and M9K is the number of memory components.

Results from other benchmarks in [15] will only further widen the performance gap between SPARTA and LegUp. These other benchmarks rely on Pthread primitives in addition to the OpenMP pragmas. Pthread primitives are partially implemented as software to control hardware threads, increasing the overhead introduced by the soft processor and affecting the overall execution.

5 Conclusion

Automating the design of complex parallel architectures can greatly improve developers’ productivity and lower the access barrier to FPGA acceleration for non-experts. In this article we presented SPARTA, a methodology to support the synthesis of parallel applications written in OpenMP through HLS. SPARTA exploits a compiler frontend to translate OpenMP pragmas into low-level calls to primitive functions and a library of corresponding custom hardware components that are allocated in the HLS tool backend. With SPARTA, it is possible to synthesize a variety of OpenMP work-sharing and synchronization constructs and to assign multiple threads to a single accelerator core to hide the latency of slower operations. The SPARTA architecture is modular and flexible, featuring multiple parallel cores, a hierarchical memory system, and an efficient NoC interface toward external memories. We tested our methodology on parallel graph applications, showing linear scaling in both performance and area consumption when the number of parallel cores increases, and we compared our results against a state-of-the-art methodology for the synthesis of parallel accelerators, obtaining on average a 2.29\(\times\) speedup. Future developments could include support for OpenMP tasks, dynamic scheduling, atomic memory operations, and the introduction of a broadcasting mechanism for memory responses; the latter would greatly reduce latency for patterns such as matrix-vector multiplication, where multiple cores require the same information stored in external memory.

Footnotes

https://release.bambuhls.eu

https://docs.xilinx.com/r/en-US/Vitis_Libraries/graph/index.html

References

[1]

Graph500. 2023. Retrieved from https://graph500.org

Abstract

1 Introduction

2 Related Work

2.1 HLS and Parallel Specifications

2.2 Accelerator Designs for Graph Algorithms

2.3 Synthesis Methodologies for Graph Algorithms

2.4 Architectural Features

3 SPARTA Synthesis Flow

3.1 The SPARTA Architecture

3.1.1 Spatial and Temporal Multithreading.

3.2 Thread Synchronization

3.3 Memory Hierarchy

3.4 External Memory Interface

3.5 Hardware-Oriented Transformations

4 Experimental Evaluation

4.1 GAP Benchmark Suite

4.2 NoC Evaluation

4.3 Parallel Queries

4.4 Vitis HLS Graph Library

4.5 LegUp

5 Conclusion

Footnotes

References

Index Terms

Recommendations

High-level synthesis of functional patterns with Lift

Supporting multiple accelerators in high-level programming models

Developmental directions in parallel accelerators

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations