[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators

Published: 25 December 2024 Publication History

Abstract

This article presents a methodology for the Synthesis of PARallel multi-Threaded Accelerators (SPARTA) from OpenMP annotated C/C++ specifications. SPARTA extends an open-source HLS tool, enabling the generation of accelerators that provide latency tolerance for irregular memory accesses through multithreading, support fine-grained memory-level parallelism through a hot-potato deflection-based network-on-chip (NoC), support synchronization constructs, and can instantiate memory-side caches. Our approach is based on a custom runtime OpenMP library, providing flexibility and extensibility. Experimental results show high scalability when synthesizing irregular graph kernels. The accelerators generated with our approach are, on average, 2.29\(\times\) faster than state-of-the-art HLS methodologies.

1 Introduction

Field Programmable Gate Arrays (FPGAs) are reconfigurable devices that can be programmed using a hardware description language (HDL), such as Verilog or VHDL, to implement specialized accelerators. FPGAs have become a key component of heterogeneous high-performance computing systems, often coupled to other types of accelerators, such as GPUs, to accelerate data-intensive tasks or parts of applications that require low latency processing. High-level synthesis (HLS) tools simplify the process of developing FPGA-based accelerators by automatically translating software descriptions (typically C or C++ code) into HDL. These tools promise to hide hardware-specific details from the user while still obtaining high quality of results (QoR) in terms of latency, area usage, and clock frequency [42].
However, existing HLS tools mainly target regular, arithmetic-intensive workloads operating on dense data structures typical of digital signal processing (DSP) applications, which present an abundance of instruction-level parallelism. Solutions that generate accelerators from parallel high-level descriptions leverage data-parallel programming models that rely on regular computation and easily partitionable data sets (e.g., OpenCL), as they mainly replicate identical computational units and do not well support synchronization-intensive programs [14]. Driven by the interest in big data and graph analytics, several approaches attempt to support the acceleration of parallel applications with irregular behaviour. These include solutions that target specific graph algorithms, approaches that try to optimize linear algebra operations (e.g., sparse matrix-vector and sparse matrix-matrix multiplication [17]), and frameworks based on the Gather-Apply-Scatter (GAS) paradigm [33]. These solutions are highly specialized: they focus on specific graph algorithms, or they only try to generate and optimize with HLS computational engines that need to be supported by a software framework; the actual execution of graph algorithms mapped on such accelerators might thus increase in computational complexity with respect to vertex- or edge-centered formulations.
The OpenMP programming model for shared memory systems has often been successfully used to design parallel irregular algorithms [7, 36, 38] thanks to its task-level and loop-level parallel constructs. Research on HLS previously explored the use of OpenMP annotations to synthesize applications with coarse-grained task-level parallelism [15, 47]; however, these approaches employ multi-level designs driven by a microcontroller and cannot manage complex synchronization constructs. Furthermore, they still focus on generating parallel accelerators from programs with regular behaviours. The Svelto methodology [39] uses OpenMP annotation as a starting point to synthesize parallel multi-threaded accelerators, targeting the generation of accelerators starting from specific patterns of parallel graph algorithms. In this article, we propose a more general methodology for the Synthesis of PARallel multi-threaded Accelerators (SPARTA), providing automated translation of parallel specifications written in OpenMP into efficient parallel accelerators.
SPARTA adopts a fundamentally different and more flexible approach than previous methodologies generating accelerators from OpenMP code. Instead of mapping OpenMP parallel regions marked by compiler directives (pragmas) to a pre-defined parallel architecture template, it translates pragmas into calls to OpenMP runtime primitives and implements corresponding low-level hardware components to be used in the synthesis backend. Accelerators generated with SPARTA can exploit both spatial and temporal parallelism, hiding the latency of external memory accesses through context switching (CS) across multiple execution threads. Moreover, SPARTA can instantiate an advanced hierarchical memory subsystem composed of a custom Network-on-Chip (NoC) that connects multiple external memory channels to each accelerator, memory-side caching, and on-chip private memories for each accelerator. These features allow SPARTA to deal with both regular and irregular behaviours in parallel algorithms and to generate efficient accelerators in terms of performance per area.
We evaluated SPARTA by synthesizing typical parallel graph algorithms, verifying the scalability and the QoR of the generated designs as the numbers of parallel accelerators and contexts change. We also compared accelerators for graph database queries generated with SPARTA to the ones generated with Svelto, obtaining, on average, a 2.29\(\times\) improvement in performance with similar architectural parameters (number of accelerators, contexts, and memory channels).
In summary, this article makes the following contributions:
A synthesis methodology to generate multi-threaded parallel accelerators (SPARTA) supporting OpenMP work-sharing and synchronization constructs through the implementation of low-level primitives from the OpenMP runtime system;
Support for both spatial and temporal parallelism, with multiple threads that can pause and resume execution on a core when needed;
A highly configurable memory system that mitigates the latency of external memory accesses.
The rest of the article is organized as follows: Section 2 reviews related work on the synthesis of FPGA accelerators from parallel programming languages. Sections 3 and 3.1 describe the SPARTA architecture and synthesis flow. Section 4 presents experimental results and Section 5 draws conclusions for the article.

2 Related Work

SPARTA generates efficient, high-throughput parallel accelerators starting from OpenMP code. This approach is particularly suited to irregular parallel applications such as graph algorithms, thanks to the shared memory abstraction, the enhanced memory subsystem, and the combined use of spatial and temporal multithreading. The work touches on several areas related to the use of reconfigurable architectures and HLS methodologies for parallel applications.

2.1 HLS and Parallel Specifications

HLS is an established paradigm and an active area of research [16]; significant effort is dedicated to improving the QoR of generated accelerators, supporting flexible optimization and design space exploration, and introducing more and more system-level design features [9, 46, 50, 57]. Another important research topic is the synthesis of parallel accelerator designs starting from specifications written in user-friendly parallel programming application interfaces (APIs) such as OpenCL, CUDA, OpenMP, or Pthread.
Hardware vendors lead OpenCL efforts [32], but they mainly target data-parallel workloads with regular computation and often require significant restructuring of the code to match one of the supported parallel patterns. FCUDA [43] starts from kernels written using the CUDA API, designed targeting NVIDIA GPUs, and presents similar issues. The LegUP HLS tool [10] exploits OpenMP and Pthread specifications to generate architectures built of custom accelerators driven by a general-purpose microcontroller, only supporting two levels of nested parallelism (the external level through Pthread and the internal one through OpenMP) [15]. BŁYSK [47] extracts OpenMP tasks and maps them onto parallel accelerators that can share resources, also leveraging a microcontroller to perform the actual task scheduling. The CHAT methodology [24] extends the ROCCC HLS tool [54] to generate temporally multithreaded accelerators starting from loop constructs described in C without any support for synchronization. The Nymble HLS compiler supports multiple execution contexts [27] in spatially and temporally multithreaded accelerators generated from OpenMP loops [51]; the accelerators, however, can only access private memory and thus cannot share data across threads nor synchronize.
Svelto [39] is the synthesis methodology nearest to SPARTA: it synthesizes spatially and temporally multithreaded accelerators starting from OpenMP code, parallel cores access a shared memory through a multi-ported memory controller, and it supports synchronization through atomic memory operations. The main limitation of Svelto is that it can only map a small subset of OpenMP constructs to an architectural template, limiting the set of supported OpenMP pragmas (e.g., atomic memory operations are supported, but critical sections and barrier synchronization are not). Moreover, supported parallel constructs become kernel functions that do not allow inter-functional optimizations, local memories within each parallel core are not supported, and the multi-ported memory controller in the template has limited scalability.
Table 1 summarizes the considerations done in this section to highlight the differences between SPARTA and previous works. Looking at the supported input languages, we tried to identify how much a user would have to modify an input software implementation to use a certain tool: for example, FCUDA requires to heavily reorganize the code into data management and compute sections, Svelto requires the body of the OpenMP for loop to be outlined in a separate function, and SPARTA can process standard, unmodified OpenMP code. Methodologies that target a host+FPGA system use the processor to coordinate the execution of the accelerators, while SPARTA implements scheduling and synchronization within the accelerator itself. Finally, SPARTA supports different work-sharing and synchronization constructs, and nested parallelism; the addition of new constructs is simplified by using the low-level primitives of the OpenMP runtime.
Table 1.
ToolInput languageTarget architectureWork-sharingSynchronizationNested parallelism
OpenCL [32]OpenCL*host \(+\) GPU/FPGA---
FCUDA [43]CUDA***FPGA---
LegUp [15]Pthreads \(+\) OpenMP**host \(+\) FPGAfor, mastercritical, atomicno
BŁYSK [47]OpenMPhost \(+\) FPGAtasktaskwaitno
CHAT [24]C loops**host \(+\) FPGA---
Nymble [51]OpenMPhost \(+\) FPGAfornono
Svelto [39]OpenMP*FPGAforatomicno
SPARTA (this work)OpenMPFPGAfor, sectionscritical, barrieryes
Table 1. Qualitative Comparison between SPARTA and State-of-the-Art Methodologies for the Synthesis of Parallel Specifications
Asterisks represent how much the input code needs to be modified with respect to a software implementation.

2.2 Accelerator Designs for Graph Algorithms

Graph processing is an important element of data analytics, and it presents unique challenges (large datasets, unpredictable fine-grained memory accesses, unbalanced task-level parallelism) that are not well addressed by commodity processors relying on advanced cache architectures, data locality, and regular computation patterns, and might thus benefit from domain-specific accelerators [22].
Alongside proposals for processing-in-memory [3], flash-based [29], ASIC [25], and ReRAM-based [52] accelerators, multiple designs have targeted FPGAs to implement graph processing primitives. The FPGA co-processors on the hybrid Convey HC systems have been used to accelerate breadth-first search (BFS) [8], while [58] exploits an FPGA-Hybrid Memory Cube platform. Optimized implementations of PageRank (PR) and random walk are proposed in [59] and [20], respectively. GraphOps [45] provides a series of building blocks to implement graph analytics applications following a dataflow approach. Several designs are based on arrays of parallel processing engines [8, 13, 55]; some of them support the implementation of vertex-centric [44] or edge-centric [60, 61] graph algorithms. ScalaGraph [56] proposes a tile-based architecture that improves the scalability of the system as the number of processing elements increases. GraphLily [26] supports graph algorithms by accelerating their linear algebra formulation (i.e., by implementing custom Sparse-Matrix-Vector and Sparse-Matrix-Matrix multiplications units) and [2] provides a template for a GAS accelerator, requiring users to rewrite their applications and pre-partition the data.
All these works propose fixed or templated designs available as register-transfer level components, manually developed either with HDL or through HLS. Specializing the HLS process for the acceleration of graphs is not considered, thus similar approaches can only provide a limited library of primitives, and input applications may need to be restructured to allow mapping the computation onto the accelerators.

2.3 Synthesis Methodologies for Graph Algorithms

ElasticFlow [53] extends HLS with a methodology for the synthesis of irregular loop nests into a pipelined architecture. The approach is thus well suited to graph kernels that often feature irregular memory accesses and imbalanced computation; however, it only uses small on-chip memories to store the graphs, and it does not support atomic memory operations, relying on a reorder buffer to provide consistency. Castellana et al. [11] start from OpenMP-annotated nested loops to synthesize graph database queries expressed as graph pattern matching routines. The methodology leverages a distributed controller to map loop iterations onto a parallel array of accelerators, and it was later extended with dynamic scheduling [40]. All these solutions exploit task parallelism only by replicating the accelerator datapaths with limited resource reuse. The aforementioned Svelto [39] methodology has been designed to support the synthesis of irregular parallel applications, in particular graph kernels and graph database queries.

2.4 Architectural Features

The unique features of graph applications require architectural innovations to allow efficient execution under latency, throughput, or power consumption constraints. Particular care must be dedicated to the design of memory hierarchies [5] and to the interconnections between memory and computational elements, as graphs are large data structures that cannot be easily partitioned in a way that balances the work for each partition and require fine-grained, unpredictable, data accesses with poor spatial and temporal locality.
In SPARTA, we use a unidirectional, bufferless, deflection-routed NoC to connect the different parallel components. The idea of bufferless deflection-routed NoCs (BLESS) was proposed in [41] to reduce energy consumption and chip area by eliminating the need for routing or flow control buffers. While the deflection-based routing lowers the area and energy costs of the NoC, it might increase latency under high injection rates; however, SPARTA accelerator cores can tolerate a higher data access latency through asynchronous memory operations and CS. SPARTA uses a butterfly topology [18], where the last stage is connected to the first one (see also Figure 6 later). While adopting a different topology, the bufferless, deflection-based approach of SPARTA’s NoC is also similar to the one used in the Data Vortex network [34]. The Data Vortex network was originally devised for optical network interconnects due to its bufferless nature and was later implemented as a conventional network interconnect for high-performance computing systems. Its ability to support high injection rates of fine-grained transactions (memory word size) makes it effective for irregular applications [21]. The design has been later proposed as an intellectual property for a NoC [28], validated on FPGAs. Hoplite [30] is another noteworthy example of bufferless deflection routing NoC. Hoplite uses a torus topology with a very simple node switch implementation. The implementation of Hoplite’s node switch is similar to SPARTA’s one. However, SPARTA’s topology allows logarithmic hop counts. Kapre and Gray [31] thoroughly examine the cost of Hoplite’s switches, showing that the deflection routing algorithm can reach high clock frequencies while utilizing a minimal amount of resources of the FPGA fabric. Malik and Kapre [37] discuss an enhanced version of Hoplite with a butterfly topology; in the setup of this article, the parallel components are connected on the lower level of the network, while SPARTA can inject/receive requests through any of the network switches.

3 SPARTA Synthesis Flow

The SPARTA design flow automatically generates parallel multi-threaded accelerators starting from OpenMP code through an enhanced HLS flow that we implemented within the open-source Bambu tool [19] (Figure 1). The input of the SPARTA synthesis methodology is a C/C++ description of a parallel application annotated with OpenMP directives, such as the one shown in Figure 2(a). SPARTA includes a frontend derived from Clang, which parses the OpenMP annotations and performs initial HLS optimizations. OpenMP directives can be used to define parallel regions and shared data access policies, with many different work-sharing constructs (e.g., parallel for, sections) and synchronization mechanisms [12]. It is also possible to directly pass to our enhanced Bambu flow the LLVM IR exposing the OpenMP runtime calls, allowing SPARTA to process specifications generated through other frontends that support OpenMP (e.g., Flang and other domain-specific languages that leverage the OpenMP MLIR dialect).
Fig. 1.
Fig. 1. Enhanced SPARTA-Bambu HLS flow.
Fig. 2.
Fig. 2. Translation of OpenMP pragmas into calls to the SPARTA library functions.
A conventional compiler flow for parallel architectures translates an OpenMP-annotated description into a parallel application through several transformation passes. We will illustrate the transformations performed within the Clang LLVM compiler since we employ it as a frontend for the SPARTA methodology starting from C/C++ code. However, as previously highlighted, our approach is adaptable to other compiler infrastructures that support an OpenMP runtime library [35]. Using a conventional OpenMP parallelization flow through Clang/LLVM greatly simplifies introducing support for new constructs that might be added to the OpenMP standard.
We can split the parallelization process into three phases: pre-processing of compiler directives, code transformations, and parallel API library linking. Pre-processing of compiler directives allows gathering information on how the input program should be parallelized, i.e., which code sections can be executed in parallel, how to distribute work across threads, data allocation policies, and more. After pre-processing, the compiler generates an intermediate representation (IR), instruments it with calls to OpenMP runtime functions, and links the OpenMP library, connecting each function call to its target-specific implementation.
In the SPARTA synthesis flow, the Clang-based Bambu frontend compiles the input description following the same steps, except that in the final phase, it maps calls to the OpenMP runtime onto corresponding calls to our custom HLS library. The SPARTA library in Bambu contains optimized Verilog primitives to handle synchronization, concurrency, and memory operations in a parallel architecture, functionally equivalent to the OpenMP runtime library primitives. Starting from the LLVM OpenMP runtime library, we streamlined and reduced it to essential functionality to generate small and fast components suitable to be synthesized on FPGA. For example, we removed logic to support the reuse of threads because hardware threads cannot be reassigned to other cores that require a different datapath and controller. We implemented OpenMP constructs replaceable with arithmetic and logic operations (e.g., the ones that compute the range of loop iterations belonging to a thread) with standard functional units.
Figure 2 shows how the SPARTA front end translates an OpenMP annotated code (Figure 2(a)) into an IR with calls to the SPARTA library functions (Figure 2(b)). The code in Figure 2(a) denotes a parallel region with M threads where each thread executes an iteration of a for loop. The IR in Figure 2(b) shows how the transformations outline the loop body into a separate function (omp\(\_\)outlined) and modify the OpenMP runtime library calls. The new function signature contains all parameters to execute an iteration of the loop: the thread ID and a reference to the arrays A, B, and res. The function body is instrumented with API calls to the SPARTA Bambu library functions that retrieve the thread ID and synchronize with other threads.
The instrumented IR containing the API calls that control the parallel execution is subsequently optimized with hardware-oriented code transformations that facilitate the generation of the parallel architecture in the HLS backend.

3.1 The SPARTA Architecture

SPARTA generates a modular architecture following the content of the input OpenMP specification that can be customized through user-defined parameters. The main components are depicted in Figure 3: sequential parts of the computation go through the standard HLS flow, while parallel constructs are translated into a set of parallel processing units called cores. Each core contains a finite state machine with datapath (FSMD) implementing the computation described within an OpenMP parallel region (e.g., the loop body in an omp parallel for loop), a CS manager in case multiple threads are assigned to the same core, and local memory for data belonging to those threads. The number of cores and the number of threads (contexts) assigned to each core are configurable. The parallel region of the accelerator can also contain synchronization components, arbiters, and data shared among threads if needed. The accelerator is connected to a NoC that relays messages to an external memory (with support for multi-banked or pipelined memories); the number of channels between the accelerator and the NoC, as well as between the NoC and the external memory, is configurable. The user encodes information about parallelism, work-sharing policies, and synchronization requirements through the OpenMP annotations in the code and configures the rest of the architecture through command-line options.
Fig. 3.
Fig. 3. SPARTA accelerator architecture.

3.1.1 Spatial and Temporal Multithreading.

Cores in a SPARTA accelerator execute concurrently, using the FPGA resources to exploit coarse-grained parallelism; instruction-level parallelism is extracted by the HLS tool during the implementation of the FSMD within each core. SPARTA accelerators can also support temporal multithreading by interleaving the execution of multiple threads on the same core instead of stalling execution and leaving the core underutilized when a thread needs to perform long latency operations (typically external memory accesses or synchronization, which might take tens of clock cycles), it is possible to perform a CS and execute a different thread with the same datapath, e.g., a different iteration of the same omp for loop.
Each core with CS enabled features a CS manager module to handle the active context and switch to a new one when a memory or synchronization operation starts. For example, on a read/write request to the external memory, the CS manager saves the active context (i.e., the thread state) and starts the execution of a different context in the next clock cycle. Once the core receives the memory response, the CS manager flags the previously paused context as ready and eventually resumes its execution when the currently active context is paused or completed. The CS operation has no overhead: in the clock cycle before the switch, the core executes normally, and at the start of the next clock cycle, the new context restarts from the state where it should have been after it was suspended. The CS manager always performs a switch when the active context starts a wait synchronization operation. This avoids deadlocks when the waited thread executes on the same core of the waiting thread, as the CS manager can flag the waiting thread as ready when the response from the waited thread is received. The CS manager also has a set of registers for each context to store the signals received while threads are not ready. When responses arrive at a core, the thread in execution is generally not the one that performed the request: the value is saved and forwarded to the actual thread originating the request when its execution is resumed. In this way, the CS functionality is entirely transparent to the rest of the accelerator and does not force other components to send signals multiple times or stall their execution.
In the core datapath, registers store intermediate signals between combinational regions; when many contexts share the same core, it is necessary to store a signal for each context. Each context thus needs its own set of registers and a control signal to select the active set of input and output registers; the new register set with the context selector control signal is commonly referred to as a register file. Since register files can consume considerably more resources than standard registers, it is fundamental to avoid them when possible. In the SPARTA architecture, CS can happen only during memory accesses or synchronization calls, which are known at compile time: the synthesis flow exploits this information to replace registers with register files only within control states where CS can happen, while other registers are left unchanged. As an example, let us take into account a core executing a floating-point add operation: two values are loaded from memory, the sum of the two values is computed, and the result is stored back in memory. Load and store operations can trigger a CS to hide the memory access latency thus registers connected to load and store operations in this core must be replaced with register files to save signal states and later restore them; instead, the floating-point sum can never be interrupted by a CS and thus none of the registers present in the floating-point adder logic needs to be replaced with a register file.

3.2 Thread Synchronization

Three synchronization modules represent a key component of the SPARTA architecture: the barrier manager, the reduction manager, and the critical manager. They are only included in the parallel region if they are needed, and their size is proportional to the number of cores that use them.
Barriers are added by the OpenMP runtime at the end of a parallel region to ensure that the execution can proceed safely if the threads have different execution times. In the OpenMP paradigm, when a thread encounters a barrier, it notifies all other threads in the same team that it has reached the barrier and waits until all other threads also signal that they have reached the end of their execution. To implement barrier synchronization, the SPARTA architecture includes local components for each core that generate completion signals and a centralized barrier manager that collects completion signals, updates a synchronization register, and, when all required threads have reached the barrier, sends them a command to resume execution. Sharing a unique barrier manager among all parallel cores ensures that synchronization happens safely without the possibility of race conditions. Furthermore, the same barrier manager component can also be used to implement the wait_thread OpenMP operation, which waits for the completion of one specific thread without adding more logic or duplicating the synchronization state.
In OpenMP reduction operations, the thread with ID 0 needs to retrieve partial reduction variables from all other threads and combine them. Each thread on a SPARTA core participating in a reduction communicates with the reduction manager to indicate the address of its reduction variable in its local memory; thread 0 then accesses the reduction manager to retrieve the addresses of all partial results it needs to merge. In fact, as will be explained in Section 3.3, a thread can access the local memory of another thread if it knows the correct address. Such a scheme could have also been implemented by storing the addresses of partial reductions in the external shared memory but with the considerable latency overhead of additional memory requests. Instead, the SPARTA reduction manager implemented inside the accelerator allows one-cycle access to the information required to perform the reduction.
We chose to implement the reduction following a linear pattern: after thread 0 finishes its part of the computation, it collects all the values computed by the others and merges the results. Other reduction schemes may result in a different tradeoff between area occupation and speed. For example, using more threads to merge the results is possible at the cost of duplicating more logic. However, a linear reduction operation often requires five or six cycles per thread to be merged, which is negligible when confronted with the duration of the computations. For this reason, when multiple contexts can be activated, the policy of the CS manager is to choose the thread with the lower thread ID: this gives priority to thread 0, which is the one that could perform a reduction operation, and then to the other threads in order since the reduction proceeds following the thread IDs.
In the OpenMP paradigm, the critical pragma is used to identify a set of instructions that can only be executed by a single thread at once. SPARTA adds a critical manager that behaves like a mutex and a local component for each thread that encounters the critical section. When a thread reaches the critical section, the local component sends a signal to request control of the lock, and the critical manager grants it only if it is not currently assigned to another thread. If the lock is not acquired by the current thread, the CS manager performs a CS. The critical manager stores the state of the requests and the current lock owner and, like the CS manager, prioritizes the thread with a lower ID when multiple threads require access. When the thread finishes the execution of the critical section, it sends a signal to release the acquired lock to avoid deadlock and allow other threads to resume their execution.

3.3 Memory Hierarchy

SPARTA exploits a hierarchical memory model that allows reducing the number of requests to the external memory and keeping data inside the accelerator as much as possible. There are three possibilities to store variables, visible in Figure 3: external memory outside the accelerator (high-latency memory), block RAM (BRAM) modules in the FPGA fabric inside the accelerator that are accessible by all cores, and private BRAM modules inside parallel cores. External memory is usually suitable for input and output parameters that cannot fit into on-chip BRAMs, while variables that are defined and used during the computation are better allocated inside the accelerator, whether in shared or private memory blocks, to lower access latency and improve performance.
The various levels of the memory hierarchy are connected through an ad hoc bus infrastructure, depicted in Figure 4, that can handle multiple concurrent memory requests. The local bus has one channel and connects cores within the same parallel region, while the main accelerator bus connects parallel regions with the on-chip shared memory and with the NoC toward the external memory. Each memory request originating from one of the parallel cores goes through a filter; the filter interprets its address and properly routes the request to internal BRAMs of other cores through the local bus or outside the parallel region through the main bus. Since BRAMs have two channels, local memories can support concurrent access from both the owner core and the local bus. SPARTA can connect the computation units directly to the local memory without using the local bus if it can infer at compile time the address of the requests and the variable is allocated inside the same core.
Fig. 4.
Fig. 4. Paths of memory requests from Core 0 to (1) the local memory of Core N, (2) on-chip shared memory, and (3) external memory.
SPARTA memory addresses encode the physical location of the memory module where data is allocated; Figure 5 shows the structure of external and internal memory addresses, which have the same length because they travel on the same local bus. The size of the address depends on the external memory size (which is likely considerably larger than internal memory) plus one bit that is added to differentiate between internal and external addresses. This allows a very simple decoding logic: if the external bit is equal to 1, the request is propagated to the external memory. For internal memory, a dedicated pagination algorithm allocates each memory variable considering the parallel region and thread it belongs to. The address of internal variables is composed of three parts, representing the address of the variable in the memory module, the thread that owns the variable, and the parallel region in which the memory module is allocated, respectively. To build the address, SPARTA first finds the largest internal memory module and computes the number of bits needed to represent it, so this number of bits can be reserved to univocally identify a variable inside each memory module. Similarly, the size of the second part of the address is set by looking at the largest number of threads assigned to the different parallel regions. Finally, a unique code is assigned to each parallel region, which is used by the filters to determine whether an address is local to the parallel region. Looking at the appropriate bits of the address, filters and arbiters in the memory interconnection system can properly forward a memory request to a different bus or component using combinatorial logic without adding delays to the requests. For example, if the parallel region ID in an address matches the one stored in the filter, the request is forwarded to the local bus, where the arbiter will find the correct core by looking at the thread ID. Instead, if the parallel region ID does not match or if the address has the external bit set to one, the request is sent to the main bus.
Fig. 5.
Fig. 5. Memory address format for an architecture with P parallel regions, T threads per region, X bytes of local memory per thread, and Y bits of external memory. The external bit is set if the addressed item is allocated outside the accelerator.
If two requests arrive in the same cycle, the bus arbiter can only accept one of them, and the other must be stalled; in this case, SPARTA generates a back pressure signal between arbiters and cores to resend requests that could not be accepted in the current cycle. With only one context in execution on the core, execution has to be stalled as it needs the result of the memory operation to proceed, but when multiple contexts are assigned to the same core, another context can be executed in the meantime. A First-In-First-Out (FIFO) queue holds requests before the memory filter so that the CS manager can switch the running context at the start of each memory request instead of when an arbiter accepts it; this prevents the effects of the arbitration from having repercussions on the core execution, as the FIFO is responsible for resending the requests to the arbiter while the core can continue its operation with another context.

3.4 External Memory Interface

The parallel accelerator generated by SPARTA is connected to an external memory module with a fixed number of memory channels. Thus, memory requests need arbitration as they can originate from multiple sources (i.e., the parallel cores) and be directed to multiple destinations. Instead of using dedicated hardware modules or another arbiter, we introduced a custom NoC interconnect that allows maintaining a short clock period. The NoC interconnect is composed of two unidirectional NoC components (Figure 6), one for the request from the cores to the memory and the other for the opposite path.
Fig. 6.
Fig. 6. The architecture of a unidirectional butterfly NoC component with 24 nodes organized in 3 levels.
The proposed NoC consists of a group of interconnected nodes following a butterfly structure, organized in \(n\) levels where each level has a number of nodes equal to \(2^{n}\), for a total number of nodes equal to \(n*2^{n}\). The number of nodes in the NoC must be at least equal to the maximum between the number of banks in the external memory and the number of cores in the parallel region (destination and source of the NoC, respectively). Each node in the NoC can be configured to be registered or combinatorial, providing a tradeoff between the frequency of the NoC and the number of cycles needed to traverse it: in fact, combinatorial nodes produce output values in the same cycle, increasing the critical path, while registered nodes introduce a delay. Nodes connected to external components are called I/O nodes and must be registered to avoid latches; for the rest of the nodes, the choice to use combinatorial nodes or have only registered ones is left to the user.
The number of cycles needed to traverse the NoC is equal to the number of registered nodes on the path of the request and, at most, is equal to the number of levels. To minimize the average travel time inside the NoC, SPARTA organises the registered nodes filling entire levels before going to the next one. For example, consider a NoC connecting 16 memory banks to 16 parallel cores: the number of levels must be 3 (as \(2*2^{2}=8 {\lt} 16\)), but if we choose to use combinatorial nodes, it can be traversed in 2 cycles as only the 16 I/O nodes must be registered and they can be arranged in the first two levels. The NoC is resilient to congestion: if two requests want to go to the same node, there is no need to stall, as the node can redirect one of them on another route to the same destination. The NoC can thus accept multiple requests: each node has two internal inputs and two internal output connections, and an I/O node can take a request if one of its internal inputs is not active in the current cycle. Registered nodes can also store a request, while combinatorial nodes forward them immediately and do not store information. If both input internal connections are active, the I/O node of the NoC sends a back pressure signal to resend the request in the next cycle. When using multiple contexts, back pressure signals do not interfere with the execution of the cores as they are intercepted in the FIFO modules, which take the role of resending the requests, as described in Section 3.3.
The SPARTA architecture allows the accelerator to connect to different types of external memory, with simple or pipelined memory access, and for both unified and banked multi-channel memories. The accelerator does not assume that the memory is immediately ready to accept a request and uses a protocol with back pressure that can be used to synchronize with it. Furthermore, SPARTA cores work under the assumption that answers from the memory could come out of order, as traffic in the NoC or the external memory could change the relative order of requests. Each request is tagged with an ID used to identify the thread that performed it and correctly forward the answer. If the external memory is banked, it is possible to add one cache in front of each bank to further reduce the average latency of the memory accesses. (There is no need to introduce a cache consistency mechanism, as each cache has its own set of addresses associated with a different memory bank.)
We introduced a set of custom pragmas and parameters to specify information about the accelerator interfaces for variables that will be stored in external memory and model their behaviour during the simulation of the generated accelerator (Table 3). Specific variables can be allocated to specific banks to mimic access patterns of real memories. For example, #pragma HLS\(\_\)interface a banked bank\(\_\)number = 16 bank\(\_\)chunk = 256 specifies that the a variable needs to be stored in a multi-banked external memory that has 16 banks of 256 bytes each.
SPARTA accelerators, by default, exchange data with other accelerators and the external memory using a simple and efficient protocol called minimal memory interface, described in Table 2. The protocol requires a limited number of signals and logic elements to perform memory operations, reducing the size of internal components (particularly the NoC). However, SPARTA also supports more complex protocols such as AXI4 [4] by adding a bridge component before each memory bank that translates each request and response from the minimal protocol to AXI4. The translation is completely transparent to the accelerator, correctly propagating back pressure signals from and to the external memory.
Table 2.
SignalMeaning
Mout_oe_ramRead operation (1 bit)
Mout_we_ramWrite operation (1 bit)
Mout_addr_ramAddress of the memory operation
Mout_Wdata_ramValue to write
Mout_data_ram_sizeSize in bits of the value to write
Mout_accelerator_contextTag of the memory operation (parallel region ID and context ID)
Mout_back_pressureThe source has to resend the request (1 bit)
Table 2. Minimal Memory Interface Protocol
Table 3.
ParameterUsage
BankedSpecify that the memory is banked
Bank numberSpecify the number of banks
Bank allocationForce a variable to be allocated on a specified group of banks
Bank chunkSpecify the size in bytes of the bank blocks
CacheAdds caches to the accelerator
Cache sizeSpecify the size of the cache word
Cache depthSpecify the number of cache lines
AXISpecify that the memory protocol is AXI4
Table 3. SPARTA Parameters Used to Describe External Memory Interfaces

3.5 Hardware-Oriented Transformations

We describe here two key transformations of the SPARTA methodology that reduce the impact of parallelization on resource utilization: thread ID propagation and parallel API pointer promotion.
Thread ID propagation. SPARTA generates an array of cores by replicating the datapaths of omp\(\_\)outlined functions corresponding to the content of an OpenMP parallel region. While simply replicating the whole datapath provides a functional design, considering which parts are effectively used at runtime can reduce resource utilization. Because thread IDs are statically assigned to each core, our approach can use them to determine the resources of the datapath that each thread actually uses. In fact, in the OpenMP work-sharing model, it is possible to differentiate the computation assigned to threads in the same group, but only depending on the thread ID. Bambu features several code analyses and transformations that SPARTA can exploit to statically propagate thread IDs through the replicated datapaths and to automatically perform dead code elimination, leading to the generation of the minimum set of resources required for each of the cores. Static thread ID propagation also avoids the replication of sections of the datapath only used by one thread, such as resources that execute reduction operations.
Pointer promotion. The OpenMP API makes extensive use of pointers and multiple levels of indirection to exchange references to internal data structures and parameters between threads; such a programming model is generic and versatile, but it determines an overhead for memory accesses and pointer resolution. In fact, memory operations on FPGA typically need to access external memories and are thus significantly more expensive in terms of latency and energy with respect to arithmetic operations. Whenever possible, SPARTA leverages static resolution of multiple indirections to reduce the number of memory operations. Because the input application and the architectural parameters (e.g., number of contexts per core and memory architecture) are fixed, the SPARTA frontend can apply function versioning for each call to functions of the parallel API. After versioning, each function represents an instance of a hardware component, and static analysis can be performed on the IR to infer dynamically allocated data structures and memory accesses: simple and multiple indirections are solved with pointer promotion, and statically allocated data structures enable further optimizations during the HLS flow. Once parallel APIs versioning is completed, it is also possible to perform a more detailed alias analysis on pointers to understand what memory structures are accessed by each operation. In this way, it is possible to pinpoint shared memory variables whose location will be shared at runtime between threads and allocate them in shared BRAMs while the remaining private memory variables are allocated in the local memories.

4 Experimental Evaluation

We showcase the results that can be achieved through SPARTA through four experiments: the first consists of synthesizing three graph processing benchmarks to show speedup and scalability varying the number of cores and contexts (Section 4.1), the second is a direct comparison between SPARTA and Svelto [39] on parallel database queries (Section 4.3), the third is a comparison between SPARTA accelerators and kernels from the graph acceleration library of the commercial tool Vitis HLS (Section 4.4) and the last one (Section 4.5) is a comparison with LegUp [15]. We also include in Section 4.2 a focus on the effect of different NoC configurations and back pressure conditions on the performance of the Graph Algorithm Platform Benchmark Suite (GAPBS) accelerators.
We use a custom version of Bambu HLS to run the experiments, which is available as an executable binary.1 The following considerations are common to all tests: we assume to connect the accelerators to an external memory with 16 banks and one channel per bank; the memory delay is assumed to be constant. We indicate with core the number of parallel accelerator cores, and C the number of contexts on each core. We report area utilization in terms of lookup tables (LUTs) and registers. SPARTA components also require one DSP block for each core and one BRAM for every two contexts if reductions are present; Table 4 reports the area utilization of the main architecture components to isolate the contribution of components that do not depend on the input application. Most of the designs meet the imposed timing constraint after place-and-route except for the largest configurations, where Vivado (the backend tool for AMD/Xilinx FPGAs) was not able to achieve the requested clock period after the placement phase due to high utilization of the device resources.
Table 4.
Component nameLUTsRegistersBRAMs
NoC (16/24 nodes)6,5436,2740
NoC (32/64 nodes)12,05110,0480
CS manager (2 C)1221090
CS manager (4 C)2541820
CS manager (8 C)5273270
Cache (32 kB)2,84475515
MinimalAXI4Adapter461610
Table 4. Resources Consumption of the Main SPARTA Architecture Components
For the NoC, we report the number of I/O nodes and the total number of nodes.

4.1 GAP Benchmark Suite

The GAPBS [7] is a collection of benchmarks intended to standardize evaluation for graph research, and it uses OpenMP to parallelize execution. We tested SPARTA on a subset of GAPBS benchmarks, excluding the ones that use OpenMP primitives that have not yet been implemented in SPARTA. The target is an AMD/Xilinx Alveo U280 board with a clock frequency of 250 MHz. In all experiments, the input graph, in Compressed Sparse Row format [49], is assumed to be stored outside the FPGA in an external DRAM memory.
The connected components (CC) benchmark finds the largest connected subgraph in the undirected input graph; we synthesized it with configurations that vary from one to eight cores and from one to eight contexts per core. With a larger number of cores/contexts, the memory channels utilization saturates, and there can be no further improvement, as the primary source of latency for the CC application is the access to the external memory. As can be seen in Figure 7(a), the increase in performance over a sequential version of the application (one core, one context) grows linearly with the number of cores, keeping the number of contexts constant; in fact, having more cores that perform memory requests increases the utilization of the memory channels and the overall performance. We can see the same effect with more contexts on the same core to hide the memory latency: the configuration with eight contexts on one core has the same speedup as the one with eight cores and a single context. The configuration with eight cores and eight contexts is 36 times faster than the sequential version; this is slower than the theoretical speedup because the memory access pattern is irregular, bringing multiple requests on the same memory bank at the same time and thus producing delays and congestion on the NoC. LUTs and registers scale differently with the number of cores/contexts, as can be seen in Figure 7(b): the number of LUTs grows with the number of OpenMP threads (e.g., 7\(\times\) increase with respect to the sequential version with eight cores, eight contexts), while liveness analysis during the synthesis flow identifies the minimum number of registers to be replicated for CS, so the number of registers is mostly linked to the number of cores (4\(\times\) more than in the sequential version).
Fig. 7.
Fig. 7. CC benchmark synthesized varying the number of cores and contexts.
The second GAPBS kernel we evaluated is PR, which assigns a value to each graph node based on its relative importance compared to the other nodes. Figure 8(a) shows that the speedup over a sequential baseline proportionally increases with the number of cores, similar to the CC benchmark. However, in this case, increasing the number of contexts per core from four to eight does not lead to the same performance improvement because some sections of PR are computation-bound rather than memory-bound, leading to competition for parallel cores rather than stalls due to memory access operations. With the largest configuration, we achieve a 30\(\times\) speedup with 6.5\(\times\) the area occupation (considering LUTs, registers occupation only increases 4.5 times).
Fig. 8.
Fig. 8. Benchmark synthesized varying the number of cores and contexts.
The last GAPBS benchmark we considered is Triangle Count (TC), which counts the number of triangles in the input graph. After assessing how performance and resource consumption scale with varying numbers of cores and contexts (Figure 9) on the Alveo target, we used this benchmark to perform a first comparison against the Svelto synthesis methodology [39]. There are a few differences between the two methods that need to be considered. Svelto is based on a gcc/g++ frontend, while SPARTA uses Clang; the crossbar used in Svelto has a different behaviour with respect to the NoC used in SPARTA, and the comparison is further complicated by the fact that Svelto uses atomic-based synchronization while SPARTA implements OpenMP reductions. They also operate under different assumptions regarding the behaviour of the memory (i.e., SPARTA allows to use asynchronous external memories that can accept multiple requests at the same time while Svelto only sends one request at a time), and Svelto dynamically assigns loop iterations to the cores while SPARTA uses static scheduling. Nevertheless, Svelto remains the state-of-the-art methodology closest to SPARTA, as described in Section 2.1, and it is worthwhile to examine the QoR of accelerators generated with the two methods. Furthermore, since both use Bambu as HLS backend, we can safely state that the comparison only refers to the differences in the implementation of parallel accelerators rather than being influenced by different choices in the scheduling, allocation, and code generation steps.
Fig. 9.
Fig. 9. TC benchmark synthesized varying the number of cores and contexts.
The results in Figure 10(a) show the relative performance of SPARTA against the Svelto baseline with both tools targeting a Virtex 7 FPGA at 100 MHz; bars higher than one, thus represent configurations where SPARTA outperforms Svelto. While Svelto is faster on smaller configurations, SPARTA can achieve a higher level of parallelization as the number of cores and contexts grows. There are two main reasons for this: the first one is that the NoC in SPARTA is designed to be effective when there are many requests from different sources, which is achieved when the number of cores is higher; the second reason lies in the ability of SPARTA to allow multiple memory requests simultaneously, which is critical to exploit memory-level parallelism when multiple memory channels are available. In fact, SPARTA adds a tag to external memory requests, allowing it to send multiple pipelined, asynchronous requests, while Svelto must wait for the memory to answer before performing the next request. With a number of OpenMP threads greater than the number of banks in the external memory, SPARTA can execute more requests in parallel than Svelto, resulting in a higher speedup. Regarding area usage, as can be seen in the relative comparisons of Figure 10(b) and (c), SPARTA consumes more resources than Svelto on the TC benchmark. However, the area-delay product (ADP) of SPARTA scales better with the number of cores/contexts: using the formula
\begin{align*}\frac{ADP_{SPARTA}}{ADP_{Svelto}}=\frac{LUTs_{SPARTA}*Cycles_{SPARTA}}{LUTs_{ Svelto}*Cycles_{Svelto}},\end{align*}
we can see that, for example, the ADP of a SPARTA accelerator with 1 core and 1 context is 1.14 times the ADP of a Svelto accelerator with the same configuration, and the ratio drops to 0.47 with 32 cores and 8 contexts (SPARTA is 3.8 times faster using only 1.7 times the LUTs and 2.4 times the registers). One reason for the higher resource consumption is that SPARTA is more general than Svelto: SPARTA explicitly implements multiple OpenMP primitives and synchronization mechanisms, while Svelto only supports a few patterns, and thus, it can make assumptions on the behaviour of the input program that simplifies the underlying architecture. Moreover, SPARTA implements reduction operations that require fewer cycles and memory operations but more area than the atomic operations implemented in Svelto. Finally, Svelto has a centralized dispatcher that distributes loop iterations to the cores, while each core in SPARTA contains dedicated logic to compute the current iteration index.
Fig. 10.
Fig. 10. Comparison between accelerators generated through SPARTA and Svelto on the TC benchmark.

4.2 NoC Evaluation

In this section, we evaluate the performance cost of the NoC and the effect of back pressure on the latency of memory operations. Table 5 shows data collected synthesizing and running the TC, PR and CC benchmarks with 32 cores and eight contexts, as it is the configuration with the largest occurrence of back pressure signals. We tested two different configurations of the NoC: one that registers only the 32 I/O nodes and leaves the other 32 combinatorial and one that only uses registered nodes. The difference between the two configurations of the NoC is the number of requests it can store, which is equal to the number of registered nodes, and the maximum latency of the NoC, which is two clock cycles in the first case and four in the second. The actual number of cycles needed to traverse the NoC equals the number of registered nodes on the request’s path plus the effect of congestion (NoC latency).
Table 5.
  NoC  
BenchmarkCyclesBP#Mem. OperationsAvg. BP Delay (cycles)Avg. NoC Latency (cycles)
TC_32Co_8cs3291599646977598121.277.29
PR_32Co_8cs1290641682594679390.364.42
CC_32Co_8cs246424228164311720901.957.67
  NoC Registered  
BenchmarkCyclesBP#Mem. OperationsAvg. BP Delay (cycles)Avg. NoC Latency (cycles)
TC_32Co_8cs3387015088147598120.6712.00
PR_32Co_8cs1320461481334679390.328.03
CC_32Co_8cs264787133884811720901.1412.26
Table 5. Congestion Overhead of the NoC on GAPBS Accelerators in Two Configurations.
BP indicates the total number of back pressure signals received by the FIFO in front of the cores.
BP represents the total number of times the NoC could not accept a memory request because of congestion, measured through the number of back pressure signals received by the FIFO modules in front of all cores, while the number of memory operations is the sum of the requests from all cores on all memory channels. We can compute the average delay spent by a single request in the FIFO modules before entering the NoC by dividing the total number of back pressure signals by the number of memory operations; this is the effect of the arbitration on the main memory bus, and it is very small when compared with the latency of the memory (20 cycles) or the latency of the NoC. The average NoC latency is measured as the average time that it takes for a request to traverse it.
Using a registered NoC, which can store more requests and thus reduce the effect of the arbitration delay, has a negative effect on the total number of cycles for two reasons: while a registered NoC reduces the number of back pressure signals, it increases the minimum time needed to traverse the NoC from two cycles to four, and it increases the internal congestion (visible in the higher average NoC latency). For these reasons, we ran all the experiments in Section 4.1 with a NoC with only the I/O nodes registered.

4.3 Parallel Queries

In this section, we generate accelerators for the same parallel database queries used in [39], which are SPARQL queries from the LUBM benchmark and dataset [23] translated into C++ and parallelized with OpenMP. SPARQL is a query language for graph databases in the resource description format (RDF). RDF databases are naturally described as single labelled graphs, and SPARQL queries are defined as (sub)graph matching operations. The resulting code features nested for loops with different behaviours (e.g., number of OpenMP threads, number of loop iterations assigned to each thread) and irregular memory accesses. We compared results obtained with SPARTA and Svelto on several different parallel queries and also ran the same experiments, adding a cache in the SPARTA accelerators in front of each memory bank (512 lines, 512 bits per line). Figure 11 shows the relative performance of SPARTA against the Svelto baseline, with and without caches; as before, bars higher than one represent configurations where SPARTA outperforms Svelto.
Fig. 11.
Fig. 11. Speedup of accelerators generated by SPARTA with respect to Svelto on the parallel queries.
The results we obtained are different depending on the specific query considered and the characteristics of its workload. In Q1, SPARTA without caches is always slower than Svelto, as it contains a small group of OpenMP threads, each performing a large number of operations, and we have shown how SPARTA performs best with memory-bound workloads. Q4 and Q5 terminate in a few tens of thousands of cycles and are not very indicative of the achievable speedup, as most cycles are spent in synchronization operations. Data for Q5 with 8 cores and 8 contexts is missing as Svelto does not terminate in that configuration. The results observed align with the ones from TC: SPARTA without caches has worse performances when using few threads but improves when the number of cores and contexts grows. In Q2, Q3, and Q7, SPARTA drastically outperforms Svelto, even considering the higher resource utilization (Figure 12), with between 6 and 11.5 times faster execution. In Q1 and Q6 instead, Svelto is faster, and this can be explained by a lack of parallelism in the query: a lot of the work has to be executed by a small subset of the threads and the dynamic assignment of iterations in Svelto is better suited for this than the static division of work in SPARTA. However, even in these cases, with enough cores and contexts, SPARTA can recover the disadvantage and at least reach the same performance, although using a larger area.
Fig. 12.
Fig. 12. Relative comparison of resources consumption between SPARTA and Svelto for query Q2. Results for the other queries are omitted as they have very similar behaviour.
When using caches, SPARTA is almost always faster than Svelto, and this is particularly effective in the test cases where previously SPARTA was struggling: in both Q1 and Q6, the introduction of caches in SPARTA makes the generated accelerators more than twice as fast than before. With the largest configurations in Q2, Q3, and Q7, adding the cache reduces the speedup compared with the base version of SPARTA: these tests are the ones in which static scheduling can divide the computations between multiple cores and keep all of them working. Under this circumstance, the main advantage of SPARTA is the ability to parallelize multiple requests running different contexts and hide the memory latency while advancing the state of all the contexts simultaneously; adding caches reduces the average memory latency, and it can cause a loss of parallelization when using CS. In fact, if the memory requests are answered faster, the execution of the different contexts can become serialized and, depending on the data access pattern, lead to worse results.
In Figure 13, we investigated the influence of CS across various configurations. Due to space constraints, we focused on analyzing the speedup resulting from increasing the number of contexts from 1 to 2, 4, and 8 for query Q7. The speedup was visualized using bar charts and compared against different numbers of cores and tools (Svelto, SPARTA, and SPARTA with caches). When the number of contexts was fixed at 1, we noticed that Svelto and SPARTA exhibited similar execution times, except in the case of 32 cores (see Figure 11). As expected, introducing caches improved SPARTA’s performance across all scenarios. Transitioning from spatial to temporal parallelism, SPARTA and SPARTA with cache approaches demonstrated superior scalability compared to Svelto. In fact, in Figure 13, the maximum speedup achieved by increasing the number of cores tended to decrease as expected, but the rate of decrease was lower for SPARTA and SPARTA with caches compared to Svelto. This limited scalability of Svelto is due to its interconnection, which scales less efficiently than the NoC used by SPARTA. Furthermore, SPARTA initiates a new memory operation before completing the previous one, enhancing channel utilization. Finally, with caches enabled in SPARTA, the impact of spatial parallelism is heightened while the impact of temporal parallelism is reduced. In Figure 13, the maximum speedup with Svelto and SPARTA is around 6–7, while with SPARTA with caches, it is around 4.
Fig. 13.
Fig. 13. impact on the speedup considering various core counts and tools (e.g., Svelto, SPARTA, and SPARTA with cache) with respect to benchmark query Q7. The baseline refers to the execution time of query Q7 when a single context is utilized for a specific number of cores. The bars represent the speedup achieved by expanding the context to 2, 4, and 8.

4.4 Vitis HLS Graph Library

Commercial tool Vitis HLS from AMD/Xilinx provides a library of pre-optimized C++ kernels to implement graph processing accelerators.2 The kernels from the Vitis library have been optimized for HLS through parallelization, loop pipelining, and the introduction of caches to reduce memory access times.
We utilized identical graphs to compare the results obtained through the Vitis Graph library with the SPARTA approach. Although our graphs are smaller in scale compared to those used with the Vitis Graph Library, they still provide a reasonable simulation and analysis time for conducting design space exploration. The graph characteristics are detailed in Table 6. The graphs are stored in the Compressed Sparse Row matrix format as undirected. For TC and PR, directed graphs are required, thus only the lower triangular part of the matrix is utilized. The SPARTA configurations for TC, CC, and PR are optimized to fully utilize spatial and temporal parallelism, using 32 cores and eight contexts each. However, due to issues in the place and route phase, SPARTA cannot achieve the same frequency as Vitis with such large designs. As a result, the post-routing frequency achieved is 200/220 MHz instead of the 250 MHz obtained with smaller configurations.
Table 6.
NameVerticesEdges
Graph_14,09665,390
Graph_28,19298,230
Graph_38,192131,036
Graph_416,384131,036
Table 6. Characteristics of the Graphs Used to Compare Vitis and SPARTA
In the Vitis benchmarks, the delay of the HBM interface between the accelerator and the memory ranges from 100 to 140 ns, depending on the address of the request. To approximate this behaviour in the SPARTA benchmarks, we increased the memory delay to 32 cycles, which at 200/220 MHz is equivalent to 160/145 ns. Additionally, we utilized the AXI4 protocol instead of the minimal interface to exchange data between the accelerators and the external memory.
It is worth noting that both SPARTA and Vitis target an Alveo U250 FPGA, part of the Ultrascale family. The results for each algorithm are reported separately in Tables 7, 8, and 9.
Table 7.
GraphCyclesFrequencyTime (ms)LUTsAreaXTime
Graph_1 SPARTA148,599220 MHz0.675137,40792,700
Graph_1 Vitis2,358,543280 MHz8.423103,923875,031
Graph_2 SPARTA255,000220 MHz1.158137,407159,076
Graph_2 Vitis3,960,784280 MHz14.14103,9231,469,471
Graph_3 SPARTA322,363220 MHz1.464137,407201,098
Graph_3 Vitis4,661,624280 MHz16.64103,9231,729,486
Graph_4 SPARTA422,074220 MHz1.912137,407263,301
Graph_4 Vitis6,456,582280 MHz23.05103,9232,395,425
Table 7. Comparison between SPARTA and Vitis on the CC Benchmark
Table 8.
GraphCyclesFrequencyTime (ms)LUTsAreaXTime
Graph_1 SPARTA359,552220 MHz1.63291,449149,278
Graph_1 Vitis456,756300 MHz1.52121,00131,942
Graph_2 SPARTA378,962220 MHz1.72091,449157,336
Graph_2 Vitis560,060300 MHz1.86521,00139,166
Graph_3 SPARTA552,599220 MHz2.50991,449201,098
Graph_3 Vitis916,216300 MHz3.05121,00164,074
Graph_4 SPARTA292,759220 MHz1.32991,449263,301
Graph_4 Vitis590,690300 MHz1.96721,00141,308
Table 8. Comparison between SPARTA and Vitis on the TC Benchmark
Table 9.
GraphCyclesFrequencyTime (ms)LUTsAreaXTime
Graph_1 SPARTA692,441200 MHz3.462205,444711,289
Graph_1 Vitis3,703,603300 MHz12.33388,3721,089,891
Graph_2 SPARTA783,614200 MHz3.918205,444804,943
Graph_2 Vitis5,659,159300 MHz18.84588,3721,665,370
Graph_3 SPARTA971,440200 MHz4.857205,444997,882
Graph_3 Vitis7,509,609300 MHz25.00788,3722,209,918
Graph_4 SPARTA1,088,871200 MHz5.444205,4441,118,518
Graph_4 Vitis7,614,714300 MHz25.37588,3722,240,848
Table 9. Comparison between SPARTA and Vitis on the PR Benchmark
SPARTA and Vitis can achieve different maximum frequencies. To account for this difference, we calculated the kernel execution time as the achieved clock period times the number of cycles (Time in the Tables). In addition, we computed the AreaXTime metric to normalize the performance of the different designs.
In the CC benchmark, SPARTA outperforms Vitis in terms of execution time by a factor of 10, with only a 30% greater area occupation. Additionally, the AreaXTime metric favours SPARTA by almost an order of magnitude.
In the TC benchmark, SPARTA achieves, on average, a faster execution time. It requires 14% less time to complete the computation of the four graphs. However, it occupies four times more of Vitis’s area.
In the PR benchmark, SPARTA achieves a speedup between four and six times compared to Vitis, occupying only twice the area. Consequently, it achieves an AreaXTime metric at most half of Vitis.
As performance results for the BFS graph library kernel are not available, we also used Vitis HLS to synthesize a C++ implementation of the BFS algorithm loosely based on the one available in MachSuite [48]. Note that we are using the common frontier-based implementation rather than the direction-optimizing implementation [6] (also provided in GAPBS). BFS is the prototypical irregular kernel [1] where the number of iterations for each thread can be highly unbalanced and requires critical sections to synchronize the different threads.
We tried different configurations for SPARTA, varying the number of cores and contexts using the accelerator produced by Vitis HLS as a baseline. Table 10 compares the accelerators produced by Vitis HLS and SPARTA without parallelization (one core and one context). There is only a small difference between the number of cycles for the two tools, while the area is more difficult to compare: Bambu does not require BRAMs for this design as it maps all logic on LUTs and registers, and almost half of the LUTs and registers used are dedicated to the 16 node NoC (as reported in Table 4).
Table 10.
ResultsSPARTAVitis HLS
Cycles32,622,45233,287,532
LUTs13,1634,852
Registers13,9448,164
BRAM0263
Table 10. Comparison between SPARTA and Vitis HLS Accelerators for the Sequential BFS Algorithm
SPARTA results include the cost of the NoC and the adapter from the minimal protocol to AXI4.
Figure 14(a) displays the speedup achieved using different configurations of SPARTA with respect to the number of cycles required by Vitis HLS. In the sequential version, most of the time is spent waiting for the external memory to provide data in both the SPARTA and Vitis implementations. In the SPARTA parallel accelerators, a similar pattern arises when the number of contexts per core is small: each core spends most of the time waiting for new data to process. As we increase the number of contexts, each core spends a greater fraction of time doing useful computations. Even when using multiple contexts, BFS does not achieve the same speedup as more regular applications such as the ones described in Section 4.1: during the exploration of the graph, each thread can find a different number of elements, leading to a high load imbalance. Furthermore, all threads must be periodically synchronized to avoid exploring the same node of the graph multiple times and to compute the correct length of the shortest path from the origin. Despite this, SPARTA can achieve good speedup when using four contexts per core: with four cores, it achieves a speedup of almost five times, and with two cores, a speedup of three times compared to the Vitis HLS baseline. Figure 14(b) compares the area utilization of SPARTA and Vitis HLS. When adding a core, SPARTA replicates almost the entire design and adds necessary communication and synchronization logic. Hence, resource utilization increases a little more than linearly: with four cores and one context, the area occupied is six times more than the baseline. Note that with 4 cores, the NoC is larger than the entire design generated by Vitis HLS. However, its size does not increase when more contexts per core are added.
Fig. 14.
Fig. 14. Benchmark synthesized varying the number of cores and contexts.

4.5 LegUp

We compare SPARTA even with the published results obtained by LegUp [15]. Specifically, we are using the Mandelbrot kernel from the latest open-source version of LegUp. This kernel is a parallelized version of the Mandelbrot iterative mathematical benchmark, which produces a fractal image and is parallelized with OpenMP for pragmas. The results obtained by LegUp on this benchmark are reported in Table 11, as published in [15]. SPARTA achieves better performance across all metrics when using the same Mandelbrot kernel on a Stratix IV device (see Table 12). This performance advantage can be attributed to several factors, including SPARTA’s use of external memory for storing the Mandelbrot picture, its independence from a soft processor for OpenMP parallelization, and its simpler architecture.
Table 11.
CoreCyclesFrequencyLUTsM9KsDSPs
12,523,455135 MHz22,11113428
4627,059127 MHz26,86913488
8355,224124 MHz34,187158168
12241,805127 MHz37,476158248
16185,863127 MHz40,890158328
32126,828112 MHz61,251182648
48113,593104 MHz81,224206968
Table 11. The LegUp Results for the Mandelbrot Kernel
Accelerator indicates the number of accelerators used, and M9K is the number of memory components.
Table 12.
AcceleratorCyclesFrequencyLUTsM9KsDSPs
12,294,063200 MHz4,878112
21,151,243200 MHz5,198224
4579,830200 MHz5,928448
8291,057197 MHz7,553896
16139,607196 MHz9,93916192
3273,039163 MHz19,79332384
Table 12. The SPARTA Results for the Mandelbrot Kernel
Accelerator indicates the number of accelerators used, and M9K is the number of memory components.
Results from other benchmarks in [15] will only further widen the performance gap between SPARTA and LegUp. These other benchmarks rely on Pthread primitives in addition to the OpenMP pragmas. Pthread primitives are partially implemented as software to control hardware threads, increasing the overhead introduced by the soft processor and affecting the overall execution.

5 Conclusion

Automating the design of complex parallel architectures can greatly improve developers’ productivity and lower the access barrier to FPGA acceleration for non-experts. In this article we presented SPARTA, a methodology to support the synthesis of parallel applications written in OpenMP through HLS. SPARTA exploits a compiler frontend to translate OpenMP pragmas into low-level calls to primitive functions and a library of corresponding custom hardware components that are allocated in the HLS tool backend. With SPARTA, it is possible to synthesize a variety of OpenMP work-sharing and synchronization constructs and to assign multiple threads to a single accelerator core to hide the latency of slower operations. The SPARTA architecture is modular and flexible, featuring multiple parallel cores, a hierarchical memory system, and an efficient NoC interface toward external memories. We tested our methodology on parallel graph applications, showing linear scaling in both performance and area consumption when the number of parallel cores increases, and we compared our results against a state-of-the-art methodology for the synthesis of parallel accelerators, obtaining on average a 2.29\(\times\) speedup. Future developments could include support for OpenMP tasks, dynamic scheduling, atomic memory operations, and the introduction of a broadcasting mechanism for memory responses; the latter would greatly reduce latency for patterns such as matrix-vector multiplication, where multiple cores require the same information stored in external memory.

Footnotes

References

[1]
Graph500. 2023. Retrieved from https://graph500.org
[2]
Hamzeh Ahangari, Muhammet Mustafa Özdal, and Özcan Öztürk. 2023. HLS-Based High-Throughput and Work-Efficient Synthesizable Graph Processing Template Pipeline. ACM Transactions on Embedded Computing Systems 22, 2 (2023), 1–24.
[3]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA). 105–117.
[4]
ARM Developers. 2020. AMBA AXI and ACE Protocol Specification. Retrieved from https://developer.arm.com/documentation/ihi0022/e/AMBA-AXI3-and-AXI4-Protocol-Specification
[5]
Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. 2019. Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). 373–386.
[6]
Scott Beamer, Krste Asanovic, and David A. Patterson. 2012. Direction-Optimizing Breadth-First Search. In Proceedings of the SC Conference on High Performance Computing Networking, Storage and Analysis (SC ’12). 12.
[7]
Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. Retrieved from http://arxiv.org/abs/1508.03619
[8]
B. Betkaoui, Yu Wang, D.B. Thomas, and W. Luk. 2012. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration. In Proceedings of the IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors (ASAP ’12). 8–15.
[9]
Nicolas Bohm Agostini, Serena Curzel, Vinay Amatya, Cheng Tan, Marco Minutoli, Vito Giovanni Castellana, Joseph Manzano, David Kaeli, and Antonino Tumeo. 2022. An MLIR-based Compiler Flow for System-Level Design and Hardware Acceleration. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design (ICCAD). 1–9.
[10]
Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark Aldham, Tomasz Czajkowski, Stephen Brown, and Jason Anderson. 2013. From Software to Accelerators with LegUp High-Level Synthesis. In Proceedings of the 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 1–9.
[11]
Vito Giovanni Castellana, Marco Minutoli, Alessandro Morari, Antonino Tumeo, Marco Lattuada, and Fabrizio Ferrandi. 2015. High Level Synthesis of RDF Queries for Graph Analytics. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 323–330.
[12]
Rohit Chandra, Leo Dagum, Ramesh Menon, David Kohr, Dror Maydan, and Jeff McDonald. 2001. Parallel Programming in OpenMP. Morgan Kaufmann.
[13]
Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2021. ThunderGP: HLS-Based Graph Processing Framework on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 69–80.
[14]
Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending High-Level Synthesis for Task-Parallel Programs. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 204–213.
[15]
Jongsok Choi, Stephen Brown, and Jason Anderson. 2013. From Software Threads to Parallel Hardware in High-Level Synthesis for FPGAs. In Proceedings of the 2013 International Conference on Field-Programmable Technology (FPT). 270–277.
[16]
Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Peichen Pan, Kees Vissers, and Zhiru Zhang. 2022. FPGA HLS Today: Successes, Challenges, and Opportunities. ACM Transactions on Reconfigurable Technology and Systems 15, 4 (2022), 1–42.
[17]
Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. 2022. High-Performance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS: A Case Study on SpMV. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 54–64.
[18]
José Duato, Sudhakar Yalamanchili, and Lionel Ni. 2003. Interconnection Networks: An Engineering Approach. Morgan Kaufmann, 359–444.
[19]
F. Ferrandi, V. G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, and A. Tumeo. 2021. Bambu: An Open-Source Research Framework for the High-Level Synthesis of Complex Applications. In Proceedings of the 58th ACM/IEEE Design Automation Conference (DAC). 1327–1330.
[20]
Yingxue Gao, Teng Wang, Lei Gong, Chao Wang, Xi Li, and Xuehai Zhou. 2023. FastRW: A Dataflow-Efficient and Memory-Aware Accelerator for Graph Random Walk on FPGAs. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). 1–6.
[21]
Roberto Gioiosa, Antonino Tumeo, Jian Yin, Thomas Warfel, David Haglin, and Santiago Betelu. 2017. Exploring DataVortex Systems for Irregular Applications. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 409–418. DOI:
[22]
Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei Liao, and Hai Jin. 2019. A Survey on Graph Processing Accelerators: Challenges and Opportunities. Journal of Computer Science and Technology 34 (2019), 339–371.
[23]
Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A Benchmark for OWL Knowledge Base Systems. Journal of Web Semantics 3, 2–3 (2005), 158–182.
[24]
R. J. Halstead and W. Najjar. 2013. Compiled Multithreaded Data Paths on FPGAs for Dynamic Workloads. In Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 1–10.
[25]
T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-Performance and Energy-Efficient Accelerator for Graph Analytics. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–13.
[26]
Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating Graph Linear Algebra on HBM-Equipped FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design (ICCAD). 1–9.
[27]
J. Huthmann, J. Oppermann, and A. Koch. 2014. Automatic High-Level Synthesis of Multi-Threaded Hardware Accelerators. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL). 1–4.
[28]
Data Vortex Inc. 2023. Data Vortex Network on Chip IP Block. Retrieved from https://www.datavortex.com/dv-noc-brief
[29]
Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, and Arvind. 2018. GraFBoost: Using Accelerated Flash Storage for External Graph Analytics. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 411–424.
[30]
Nachiket Kapre and Jan Gray. 2015. Hoplite: Building Austere Overlay NoCs for FPGAs. In Proceedings of the 25th International Conference on Field Programmable Logic and Applications (FPL). 1–8.
[31]
Nachiket Kapre and Jan Gray. 2017. Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs. ACM Transactions on Reconfigurable Technology and Systems 10, 2 (Mar 2017), 1–24.
[32]
Khronos Group. [n. d.]. OpenCL - Open Standard for Parallel Programming of Heterogeneous Systems. Retrieved from https://www.khronos.org/opencl/
[33]
Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. 2019. GPOP: A Cache and Memory-Efficient Framework for Graph Processing over Partitions. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 393–394.
[34]
Odile Liboiron-Ladouceur, Assaf Shacham, Benjamin A. Small, Benjamin G. Lee, Howard Wang, Caroline P. Lai, Aleksandr Biberman, and Keren Bergman. 2008. The Data Vortex Optical Packet Switched Interconnection Network. Journal of Lightwave Technology 26, 13 (2008), 1777–1789. DOI:
[35]
LLVM developers. [n. d.]. LLVM OpenMP Runtime Library. Retrieved from https://raw.githubusercontent.com/llvm/llvm-project/main/openmp/runtime/doc/Reference.pdf
[36]
Hao Lu, Mahantesh Halappanavar, and Ananth Kalyanaraman. 2015. Parallel Heuristics for Scalable Community Detection. Parallel Computing 47, C (Aug. 2015), 19–37.
[37]
Gurshaant Singh Malik and Nachiket Kapre. 2019. Enhancing Butterfly Fat Tree NoCs for FPGAs with Lightweight Flow Control. In Proceedings of the 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 154–162. DOI:
[38]
Fredrik Manne and Mahantesh Halappanavar. 2014. New Effective Multithreaded Matching Algorithms. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS). 519–528.
[39]
Marco Minutoli, Vito Giovanni Castellana, Nicola Saporetti, Stefano Devecchi, Marco Lattuada, Pietro Fezzardi, Antonino Tumeo, and Fabrizio Ferrandi. 2022. Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics. IEEE Transactions on Computers 71, 3 (2022), 520–533.
[40]
Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Marco Lattuada, and Fabrizio Ferrandi. 2016. Efficient Synthesis Of Graph Methods: A Dynamically Scheduled Architecture. In Proceedings of the 35th International Conference on Computer-Aided Design (ICCAD).
[41]
Thomas Moscibroda and Onur Mutlu. 2009. A Case for Bufferless Routing in On-Chip Networks. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA). 196–207.
[42]
Razvan Nane, Vlad Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu Ting Chen, Hsuan Hsiao, Stephen Dean Brown, Fabrizio Ferrandi, Jason Helge Anderson, and Koen Bertels. 2016. A Survey and Evaluation of FPGA High-Level Synthesis Tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10 (2016), 1591–1604.
[43]
Tan Nguyen, Yao Cheny, Kyle Rupnow, Swathi Gurumani, and Deming Chen. 2016. SoC, NoC and Hierarchical Bus Implementations of Applications on FPGAs Using the FCUDA Flow. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 661–666.
[44]
E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin. 2014. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM). 25–28.
[45]
Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for Graph Analytics Acceleration. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 111–117.
[46]
Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini, Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, and Zhiru Zhang. 2022. Accelerator Design with Decoupled Hardware Customizations: Benefits and Challenges. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC). 1351–1354.
[47]
Artur Podobas and Mats Brorsson. 2016. Empowering OpenMP with Automatically Generated Hardware. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). IEEE.
[48]
Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks for Accelerator Design and Customized architectures. In Proceedings of the 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 110–119.
[49]
Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). SIAM.
[50]
Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2022. Automated Accelerator Optimization Aided by Graph Neural Networks. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC). 55–60.
[51]
L. Sommer, J. Oppermann, J. Hofmann, and A. Koch. 2017. Synthesis of Interleaved Multithreaded Accelerators from OpenMP Loops. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1–7.
[52]
Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating Graph Processing Using ReRAM. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). 531–543.
[53]
Mingxing Tan, Gai Liu, Ritchie Zhao, Steve Dai, and Zhiru Zhang. 2015. ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 78–85.
[54]
J. Villarreal, A. Park, W. Najjar, and R. Halstead. 2010. Designing Modular Hardware Accelerators in C with ROCCC 2.0. In Proceedings of the 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 127–134.
[55]
S. Windh, P. Budhkar, and W. A. Najjar. 2015. CAMs as Synchronizing Caches for Multithreaded Irregular Applications on FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 331–336.
[56]
Pengcheng Yao, Long Zheng, Yu Huang, Qinggang Wang, Chuangyi Gui, Zhen Zeng, Xiaofei Liao, Hai Jin, and Jingling Xue. 2022. ScalaGraph: A Scalable Accelerator for Massively Parallel Graph Processing. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA). 199–212.
[57]
Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA). 741–755.
[58]
Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-Based Graph Processor Using Hybrid Memory Cube: A Case for Breadth First Search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 207–216.
[59]
Shijie Zhou, Charalampos Chelmis, and Viktor K. Prasanna. 2015. Optimizing Memory Performance for FPGA Implementation of Pagerank. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig). 1–6.
[60]
Shijie Zhou, Rajgopal Kannan, Viktor K. Prasanna, Guna Seetharaman, and Qing Wu. 2019. HitGraph: High-Throughput Graph Processing Framework on FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2249–2264.
[61]
Shijie Zhou, Rajgopal Kannan, Hanqing Zeng, and Viktor K. Prasanna. 2018. An FPGA Framework for Edge-Centric Graph Processing. In Proceedings of the ACM International Conference on Computing Frontiers (CF). 69–77.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 18, Issue 1
March 2025
319 pages
EISSN:1936-7414
DOI:10.1145/3703028
  • Editor:
  • Deming Chen
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 December 2024
Online AM: 12 July 2024
Accepted: 03 July 2024
Revised: 28 May 2024
Received: 31 December 2023
Published in TRETS Volume 18, Issue 1

Check for updates

Author Tags

  1. Design automation
  2. FPGA architecture
  3. graph algorithms
  4. parallelism

Qualifiers

  • Research-article

Funding Sources

  • Spoke 1 (FutureHPC & BigData) of the Italian Research Center
  • MUR Mission 4 - Next Generation EU
  • Adaptive Tunability for Synthesis and Control via Autonomous Learning on Edge (AT SCALE)
  • Laboratory Directed Research and Development (LDRD) Initiative

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 356
    Total Downloads
  • Downloads (Last 12 months)356
  • Downloads (Last 6 weeks)149
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media