research-article

Open access

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors

Authors:

Ching-Jui Lee,

Tsung Tai YehAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 21, Issue 3

Article No.: 43, Pages 1 - 24

https://doi.org/10.1145/3653363

Published: 14 September 2024 Publication History

PDF eReader

Abstract

Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each layer and result in various tensor sizes in each layer. Mapping these irregular tensors to the systolic array while fully utilizing the entire PEs in a systolic array is challenging. Furthermore, modern DNN systolic accelerators typically employ a single dataflow. However, such a dataflow is not optimal for every DNN model.

This work proposes ReSA, a reconfigurable dataflow architecture that aims to minimize the execution time of a DNN model by mapping tiny tensors on the spatially partitioned systolic array. Unlike conventional systolic array architectures, the ReSA data path controller enables the execution of the input, weight, and output-stationary dataflow on PEs. ReSA also decomposes the coarse-grain systolic array into multiple small ones to reduce the fragmentation issue on the tensor mapping. Each small systolic sub-array unit relies on our data arbiter to dispatch tensors to each other through the simple interconnected network. Furthermore, ReSA reorders the memory access to overlap the memory load and execution stages to hide the memory latency when tackling tiny tensors. Finally, ReSA splits tensors of each layer into multiple small ones and searches for the best dataflow for each tensor on the host side. Then, ReSA encodes the predefined dataflow in our proposed instruction to notify the systolic array to switch the dataflow correctly. As a result, our optimization on the systolic array architecture achieves a geometric mean speedup of 1.87× over the weight-stationary systolic array architecture across nine different DNN models.

1 Introduction

Deep neural networks (DNN) present many practical solutions to problems such as classification [42], recognition [17], and language translation [35]. The success of DNN stimulates a significant interest in finding an efficient way to solve these problems quickly. For instance, many recent end-user programs include DNN inference functions that pursue execution with low latency and energy efficiency. Recent work [7, 25, 27] proves that dataflow processing is one of the effective approaches to improve the performance of DNN inference work. DNN accelerators [1, 6, 8, 20, 21, 22] typically consist of a 2D array of processing elements (PEs). Then, these accelerators orchestrate the data of DNN models to fit their dataflow architectures. However, the execution time of the DNN model would differ with the change of the dataflow on the DNN accelerators.

The systolic architecture [24] is widely used to accelerate operations of DNN. The simple and regular dataflows make data pass quickly between PEs and reuse data as much as possible. In general, Weight-Stationary (WS) [4, 5, 22, 32, 49], Input-Stationary (IS) [34] and Output-Stationary(OS) [8, 41] are usually used in systolic array. If the size of a systolic array is infinite, then each dataflow would perform similarly in terms of latency. Unfortunately, in practice, the number of PEs is finite and fixed. Figure 1 presents the latency of different DNN models normalized to the conventional systolic array using a weight-stationary that contains \(128 \times 128\) PEs. Although these three dataflows can apply to the systolic architecture, the execution latency of DNN applications is much different, because a systolic array has a limited number of PEs. Thus, the systolic array often divides a large tensor into multiple small tiles to fit the size of the systolic array. However, each dataflow results in tiles with different sizes that reflect the PE utilization and performance variation.

Fig. 1.

The systolic array usually transfers data across PEs in a rigid direction. Thus, it is difficult to fully utilize the entire PEs when the granularity of tensor data varies. Recent work [13, 39, 40, 47] has changed the existing hardware and software to improve PE utilization in modern DNN accelerators. One of the solutions to raise the PE utilization of the spatial DNN accelerator is to use a spatially partitioned architecture that alleviates the under-utilization on the 2D computing array. For instance, Simba [39] distributes data of DNN models to its multi-chip-module-based (MCM) architecture. In addition, Reference [40] maximizes the CNN accelerator efficiency by mapping tensors on each partition element of the FPGA. Planaria [13] decomposes a systolic array into multiple sub-arrays interconnected with its sophisticated network for multi-tenant DNN workloads. In addition, Planaria also uses its omnidirectional dataflow to improve the utilization of the systolic array through fine-grained tensor mapping. CMSA [47] leverages multi-dataflow design and folds small tensors, deploying them to idle PEs. However, the sophisticated on-chip network increases the complexity and the power consumption of spatially partitioned systolic architecture.

To facilitate the performance of the systolic array further, this article presents ReSA, a reconfigurable systolic architecture that concurrently executes multiple tiny tensors with different dataflows on a systolic array. Mapping these tiny tensors into the systolic array while fully utilizing the entire systolic array is challenging. Thus, ReSA employs a simple sub-array fission with PEs that can fit different dataflows to improve the performance of systolic architecture. At first, to alleviate the under-utilization problem on the systolic array, ReSA spatially partitions a \(128 \times 128\) systolic array into multiple sub-arrays. Furthermore, ReSA also splits tensors in each layer into different sizes to fit our sub-arrays well. Each sub-array is composed of a dataflow controller that switches the dataflow of each sub-array to satisfy the need of each tensor. ReSA also modifies the structure of the PE in each sub-array to work on different dataflows. These sub-arrays are connected through a simple interconnected network. Thus, ReSA’s data arbiter can distribute tensor data to a group of sub-arrays. In addition, the systolic array often uses the coarse-grained command to fetch a large amount of data from the main memory. Unlike a conventional systolic array, ReSA breaks the tensors of a DNN model into multiple tiny tensors, which increases the number of memory transactions. To improve the efficiency of the memory system on the systolic array, ReSA overlaps the fine-grained data load with the execution stage to hide such a memory access latency. As a result, ReSA achieves a geometric mean speedup of 1.87\(\times\) over the weight-stationary systolic array by employing different dataflows on its sub-array architecture with an interconnected network and efficient memory system.

In the end, this article makes the following contributions:

—

Hardware-aware layer splitting: ReSA defines the dataflow and the tile size of each DNN model layer through offline profiling. Then, the ReSA host scheduler issues each layer of DNN models to each sub-array. As a result, ReSA achieves the geometric mean speedup of 1.87\(\times\) over the weight-stationary systolic array-based accelerator across nine different DNN models.

—

Dynamic sub-array fission: This work divides the systolic array into multiple sub-arrays and offers numerous combinations of the sub-array fission. Each sub-array has its own on-chip buffer space. Furthermore, the size of the sub-array fission is dynamically varied with the tile size on each layer of DNN models. The ReSA data router distributes data across PEs in a sub-array. As a result, ReSA is adapted to varying sizes of each DNN model’s layers since sub-arrays connectivity.

—

Micro-architecture design for heterogeneous dataflows: This work devises the micro-architecture of the data path controller to dynamically switch the data path of different dataflows on the systolic array. Furthermore, this article also changes the PE micro-architecture to be compatible with different dataflows. Hence. ReSA enables heterogeneous dataflows on a systolic array DNN accelerator.

—

Memory arrangement for workloads with many concurrent tiny tensors: ReSA employs a memory access policy for workloads that contain multiple concurrent tiny tensors, called multi-tile workloads in the following paragraphs, to make the most of the double-buffering design, reducing the extra memory wait cycles compared to Planaria [13].

2 Characterizations On Systolic Array

This section aims to indicate the opportunity and challenge when using reconfigurable dataflow. Thus, this section starts by comparing the execution latency between different dataflows on a systolic array. Then, we compare PE utilization of the systolic accelerator when using different dataflows on DNN models. Finally, we point out the problem shown in the memory system of ReSA that contains multiple sub-arrays.

2.1 Latency on Different DNN Dataflows

Figure 2 illustrates the workflow of weight, input, and output-stationary dataflow on a systolic array, which assumes that the height and width of input, output, and weight matrices are smaller than the height and width of the systolic array. Thus, this simple example in Figure 2 does not consider the temporal folding. The number of cycles required by each dataflow stationary policy can be determined using the parameters provided in Table 1. In Figure 2, the size of the systolic array is represented as \(N \times N\). It should be noted that weight and input-stationary dataflow require the initial fetching of the weight matrix and input matrix from the SRAM buffer, which are then distributed to a group of PEs. The preloaded data can be reused during the systolic execution. However, hiding the latency associated with SRAM buffer access is challenging, which takes \((kw \times kh \times C)\) cycles, equivalent to the number of occupied PEs in height. The systolic array can reduce this latency through the use of a dedicated network. However, implementing such a network may require high bandwidth or a sophisticated network topology to dispatch data to multiple PEs efficiently. For the sake of simplicity in the network-on-chip (NoC) design on the systolic array, we assume that the maximum number of cycles required by the input matrix and the weight matrix dispatch is N, the same as the width of the PE array, as shown in Table 1. Unlike input-stationary and weight-stationary, output-stationary does not require initialization and produces the output matrix during execution. However, after the systolic processing, the resulting data remains in the PE arrays, necessitating the transfer of these data back to the SRAM buffer, which takes \((Ow \times Oh)\) cycles, equal to the height of the PEs used due, as previously assumed based on the NoC bandwidth.

Table 1.

Parameters	Description
Ow	The width of output feature map
Oh	The height of output feature map
kw	The width of filter
kh	The height of filter
F	The number of filters
C	The number of channels
N	The width and height of systolic array

Table 1. Parameters of CNN Models

Fig. 2.

Figure 2(a) describes the execution time of a convolution layer using the weight-stationary dataflow on a systolic array. We assume it takes \(kw \times kh \times C\) cycles to load the weight matrix into the PEs’ registers. Then, the input matrix requires \((Ow \times Oh) + (kw \times kh \times C)\) cycles to flow into the systolic array horizontally. Subsequently, the input matrix needs an additional \(F\) cycles to pass through each filter. As a result, the number of cycles taken by the weight-stationary dataflow is \((kw \times kh \times C) + (Ow \times Oh) + (kw \times kh \times C) + F\).

Figure 2(b) illustrates the execution time of a convolution layer using input-stationary on a systolic array. Like the weight-stationary dataflow, loading the input matrix into the PEs’ registers initially takes \((kw \times kh \times C)\) cycles. In Figure 2(b), the weight matrix streams into the systolic array horizontally and takes \((kw \times kh \times C) + F\) cycles. Subsequently, the weight matrix requires \((Ow \times Oh)\) cycles to pass through the entire input matrix. Therefore, the total number of cycles taken by the input-stationary dataflow is \((kw \times kh \times C) + (kw \times kh \times C) + F + (Ow \times Oh)\).

Figure 2(c) shows the execution time of a convolution layer using output-stationary on a systolic array. In Figure 2(c), it can be observed that the input matrix and weight matrix are pushed into the systolic array, taking \((kw \times \ kh \times C) + (Ow \times Oh)\) to complete. Subsequently, it requires \(F\) cycles to pass through the output matrix horizontally. Finally, it takes \((Ow \times Oh)\) cycles to retrieve the result from the PEs. Hence, the number of cycles taken by the output-stationary dataflow is \((kw \times kh \times C) + (Ow \times Oh) + F + (Ow \times Oh)\).

2.2 Latency Analysis on Different Dataflows

Figure 3 compares the performance of selected FastRCNN [14] convolution layers on a \(128 \times 128\) systolic array with different dataflows. DNN models often contain multiple layers with different shapes. This irregularity varies the results of the latency on different dataflows. For instance, layers A1, B1, and C1 are best on weight, output, and input-stationary policy, respectively. Not a single dataflow dominates all kinds of layers in the DNN model.

Fig. 3.

The typical systolic architecture cuts a tensor into small tiles when the size of a tensor is greater than the size of a systolic array. For instance, a systolic array has \(128 \times 128\) PEs. The systolic array begins to process the division of the tensor when the value of \(kw \times kh \times C\) or \(F\) exceeds 128. Unlike the WS, the input-stationary (IS) refers to the value of \(kw \times kh \times C\) and \(Ow \times Oh\), and the output-stationary (OS) considers the value of \(Ow \times Oh\) and \(F\) to determine the time to cut a tensor, respectively.

2.3 The PE Utilization of Systolic Array

The utilization of PEs impacts the throughput of the systolic array. The size of the systolic array, the shape of a tensor, and the choice of dataflow determine the PE utilization in the systolic array. Figure 4 illustrates the PE utilization when the size of the systolic array varies across different dataflows in MobileNet [18]. The utilization of PEs in a systolic array is calculated using Equation (1).

\begin{equation} \frac{\sum [(\frac{\#\ active\ PEs\ of\ a\ layer}{\#\ total\ PEs}) \times (elapsed\ cycle\ of\ a\ layer)]}{(elapsed\ cycle\ of\ a\ model)} \times 100\% \end{equation}

(1)

Fig. 4.

Figure 4 presents the average PE utilization on the MobileNet. We measure the number of active PEs in each layer of the MobileNet and show the average PE utilization when the number of PEs varies on the systolic array. We follow the parameters of the MobileNet to generate the tensor data. Then, we feed this generated tensor data in our systolic array simulator. Each bar in Figure 4 presents the average PE utilization on the WS/OS/IS dataflow. The WS/OS/IS dataflow version shown in Figure 4 is based on the Timeloop [33] and ScaleSim [37]. The dataflow used in the Figure 4 experiment is based on the description of Section 2.1.

In Figure 4, the utilization of PEs varies across dataflows due to the different granularity of each tile in a tensor based on the chosen dataflow. For instance, in the weight-stationary dataflow, the size of the weight matrix, \(F \times (kw \times kh \times C)\), determines the number of active PEs. However, in the input-stationary dataflow, the size of the input matrix, \((Ow \times Oh) \times (kw \times kh \times C)\), determines the number of active PEs. In the output-stationary dataflow, the number of active PEs is determined by the size of the output matrix, \(F \times (Ow \times Oh)\). The size of the tile data varies and a tile data derived from different dataflows will not always fit the dimension of the systolic array. Consequently, PE utilization worsens as the size of the systolic array increases. As shown in Figure 4, the utilization of PEs is generally lower for the \(256 \times 256\) systolic array compared to other sizes. Therefore, this work proposes a configurable systolic array to enhance PE utilization when the size of tensors varies.

2.4 Memory System for Multiple Tiny Tensors

Conventionally, the systolic architecture aims to fetch the size of tensors that can fill the entire systolic array while hiding the memory latency through the on-chip SRAM buffer with a double-buffering policy. A recent systolic accelerator, Planaria [13], for multi-tenant workloads, divides the SRAM buffer and systolic array to cater to each DNN model task referred to as the thread. Planaria partitions a systolic array into multiple small sub-arrays where each of which contains \(32 \times 32\) PEs. Each sub-array is interconnected through the on-chip network. Thus, each thread can ask for different numbers of sub-arrays based on Planaria’s scheduler. Planaria allocates SRAM buffer space to each model task based on the number of sub-arrays obtained by a thread. Planaria gathers partial tensor data from different threads, loading and writing back all of them between DRAM and SRAM. In the initial state, to fetch tensor data from the off-chip DRAM memory to fulfill its on-chip SRAM buffer, Planaria initially loads partial weight and input data from DRAM to SRAM memory. After this initial phase, Planaria can overlap the computation and memory load and store in such a middle state. In the final state, Planaria completes its computational work and the output data is written back from its SRAM buffer to the DRAM memory. However, such a memory system exposes the inefficiency when waiting for the completion of the initial state and final state that results in PEs being idle. The memory wait cycle is the cycle of (initial + max(0, middle - computation) + final). As depicted in Figure 5, the percentages of the memory wait cycle are 18.7%, 25.7%, 32.6%, and 35.8% when using 1, 2, 4, 8, and 16 threads, respectively, which influences the total latency significantly. To increase the overlapping time of memory and computation during model execution, our work shortens the time of the initial state by reducing the amount of tensor data completed in the initial state and the final state.

Fig. 5.

2.5 Conventional Tensor Mapping on Systolic Architecture

To tackle large tensors efficiently, the conventional systolic architecture often consists of a large PE array where the size is 128 \(\times\) 128 and 256 \(\times\) 256. However, such a large PE array exposes a challenging problem when the size of tensors becomes small and varies. Figure 6(a) depicts the tensor mapping of GoogleNet’s convolution layer [42] on a 128 \(\times\) 128 systolic array that uses the weight-stationary dataflow policy. Parameters of this layer are \((Ow, Oh, kw, kh, C, F) = (56, 56, 1, 1, 64, 64)\). Since the size of the weight is \((1 \times 1 \times 64 \times 64)\) as \((kw, kh, C, F)\), only \(64 \times 64\) PEs out of \(128 \times 128\) PE array are used, as shown in Figure 6(a). Thus, the PE utilization is \(\frac{64 \times 64}{128 \times 128} = 25\%\). In this layer of GoogleNet [42], its weight tensor is small and cannot fulfill the entire systolic array. However, the utilization of the systolic array will not be changed even though the size of inputs \((Ow \times Oh)\) is becoming large when using the weight-stationary dataflow policy.

Fig. 6.

2.6 Tensor Mapping on Spatially Partitioned Systolic Architecture

One of the solutions to tackle the under-utilization problem of the systolic array when the size of the tensor varies is to partition the entire systolic array into multiple small sub-arrays. Each sub-array can perform operations independently. Furthermore, the systolic architecture can gather multiple sub-arrays to a larger one through an interconnected on-chip network to fit tensors with different sizes. To resolve the under-utilization problem of the systolic array shown in Figure 6(a), as illustrated in Figure 6(b), the weight data is duplicated and each copy of the weight data is distributed in advance to each sub-array for the data reuse on the weight-stationary dataflow. Next, the input tensor is divided into four chunks and each chunk is processed in different sub-arrays concurrently. Thus, in Figure 6(b), the spatially partitioned systolic architecture can fully utilize the entire systolic array by intelligently splitting and distributing tensors to each sub-array. However, this spatially partitioned systolic architecture will not always fully utilize the entire systolic array. Furthermore, the sophisticated on-chip network that connects multiple tiny sub-arrays also increases power and area consumption on the systolic architecture. Unlike the typical spatially partitioned systolic architecture, our work employs multiple dataflows on the spatially partitioned systolic architecture to reduce the overhead of the on-chip network while improving the latency of applications and the PE utilization of the systolic array.

3 ReSA System Architecture

This section presents the overview of ReSA system architecture and introduces the ReSA’s tensor splitting and mapping policy.

3.1 Overview of ReSA’s Architecture

The reconfigurable dataflow architecture—ReSA—aims to improve the performance of DNN inference applications through different dataflows and the spatial partition of PEs. Figure 7 presents the overview of ReSA’s architecture. ReSA divides the entire systolic array into multiple sub-arrays. The size of the sub-array is predetermined. For instance, the number of sub-arrays in Figure 7 is \(4 \times 4\) (16 PE arrays). The host processor issues a series of DNN model layers to the accelerator. ReSA determines the dataflow of each model layer through offline profiling. First, our profiling tool examines the latency of each model layer on different combinations of the sub-array. For instance, the possible combinations of the sub-array in Figure 7 includes \((1 \times 1), (1 \times 2), (2 \times 1), (2 \times 2), (1 \times 4), (4 \times 1), (2 \times 4)\), \((4 \times 2)\), and \((4 \times 4)\). Note that all sub-arrays should connect with the same combination of the sub-array. That is, given the combination of sub-arrays, the number of independent PE arrays in the accelerator is decided. For instance, if the combination of sub-array is \((2 \times 4)\), then there would be \(\frac{(4 \times 4)}{(2 \times 4)} = 2\) independent PE arrays. We call these independent PE arrays threads in the following paragraph. Second, our profiling tool finds the best tiling size of the parameters \(Ow, Oh, F, C\) of a model layer with the best-fit dataflow and sub-array fission. We place the results of our profiling on the memory of the host processor. The host processor issues a command to notify the systolic array controller. Then, the systolic array controller directs the execution on the systolic array accordingly.

Fig. 7.

The systolic array controller decodes the command from the host processor and manages the dataflow on different PE array combinations. Figure 8 shows the micro-architecture of a sub-array. Each sub-array has its own data path controller. The ReSA data path controller redirects the data path of input and weight data in a sub-array to cater to different dataflows. Furthermore, the conventional systolic array contains a unified on-chip buffer shared by the entire systolic array. Thus, the data must stream from the left and the top of the systolic array into the systolic array. Unlike the traditional systolic array-based accelerators, ReSA decomposes the unified on-chip buffer into multiple small buffers. Multiple sub-arrays share an on-chip buffer in ReSA. ReSA uses the on-chip buffer to hide the memory access latency through double buffering. Moreover, ReSA’s data arbiter connects to a group of sub-arrays through the dedicated network to enhance the connectivity of sub-arrays. Hence, ReSA can multi-cast data to multiple sub-arrays that belong to the same group.

Fig. 8.

Due to the connectivity of ReSA’s sub-arrays, ReSA can allocate multiple sub-arrays to a model layer. As shown in Figure 7, ReSA still streams data in the left-to-right and the top-to-bottom direction. Unlike Planaria [13], ReSA simplifies the NoCs of the systolic array but still can assign different kinds of sub-array combinations to a model layer.

Furthermore, ReSA adds multiplexers (MUX) and demultiplexers (DEMUX) in each sub-array. ReSA’s MUX and DEMUX switch a sub-array’s data that originates from its neighbor or the on-chip buffer and helps to bypass the accumulator. Moreover, each sub-array also has its accumulator that handles and sends results back to the on-chip buffer. Our distributed accumulator shortens the path to the output buffer against the conventional systolic array when PEs are not fully utilized. Each ReSA’s sub-array can execute its dataflow based on software-defined results. Furthermore, fine-grained ReSA’s sub-array also improves the utilization of the systolic array.

3.2 The ReSA Control Command

The ReSA requires a command to send its profiling results to the systolic architecture. Then, the systolic architecture will use this information to serve each split tensor at runtime. Figure 9 illustrates the fields of ReSA’s command. The length of ReSA’s command is 64 bits. The ReSA allocates 10 bits for each \((Ow, Oh, C, F)\) parameter and 4 bits for each \((kw, kh)\) parameter based on the value of parameters on common DNN models. The ReSA contains multiple sub-arrays and each of them is responsible for a tile of split tensor. Thus, the ReSA empirically assigns 10 bits to the \(Tiles\) field that indicates the number of tiles in a split tensor. In addition, the ReSA has three dataflow options requiring 2 bits to represent the type of dataflow used by a split tensor. The ReSA also supports 9 different sub-array fissions that necessitate 4 bits. Finally, the ReSA’s controller decodes this command and uses the information to guide the operation of each split tensor on the systolic architecture.

Fig. 9.

3.3 The ReSA Host Layer Scheduler

The ReSA host layer scheduler aims to manage different DNN models executed on the ReSA. Figure 10 presents the ReSA host scheduler and its profiling table stored in the main memory. In Figure 10, the model ID is used to identify the type of a DNN model and the layer ID indicates the index of the layer in a DNN model. Furthermore, the tile field records the value of tile, Ow, Oh, kw, kh, C, and F parameters as shown in Figure 9. The sub-array field maintains the value of the R (the height of the sub-array combination) and the S (the width of the sub-array combination). At last, the dataflow ID keeps the dataflow used by a layer. The number of DNN model layers determines the size of the profiling table. For instance, the profiling table of ResNet [17] with 53 CONV layers takes 3.24 KB. The ReSA scheduler scans through entries of that profiling table to find the one that matches the ready DNN model’s layer. Note that the value of \(R \times S\) should check the one in Figure 11. Then, the ReSA scheduler issues the dataflow of that ready layer to the accelerator, configures the combination of PE arrays, and cuts the layer into tiles based on the entry in that profiling table. As a result, with the profiling table, the accelerator can quickly find out how to configure the hardware and software at runtime.

Fig. 10.

Fig. 11.

4 Hardware-aware Layer Splitting and Data Mapping

This section discusses the ReSA hardware-aware tensor data splitting policy and the combinations of the sub-array fission where the size of the sub-array fission can be changed dynamically.

4.1 The ReSA Tensor Splitting Policy

ReSA aims to optimize the latency of the DNN inference applications where the parameters of models and inputs are predefined and never changed during inference processing. Thus, ReSA performs offline profiling on the systolic architecture to determine the size of each split tensor and its associated dataflow. Then, ReSA uses its profiling results to execute tensors at runtime. Model parameters such as \(Ow, Oh, C,\) and \(F\) determine the size of each split tensor. To fit the size of the sub-array fission, ReSA scans through all combinations of these four parameters with the size that exponentially grows from (1, 1, 1, 1) to \((Ow, Oh, C, F)\). As illustrated in Algorithm 1, ReSA takes \(O((\lg _2 n)^4)\) time to work out the size of the split tensor. ReSA enables tensors to use the WS, IS, and OS dataflow on its systolic architecture. Thus, ReSA considers these three dataflows in the selection of the split tensor, as illustrated in line 16 of Algorithm 1. Although finding the best configuration for each split tensor takes time, ReSA’s profiling only takes once, and its results are reused multiple times for the DNN inference processing. Finally, the ReSA creates a table that only takes several KBs and stores the profiling results of each split tensor in the main memory (please see Section 3.3). As a result, ReSA can quickly refer to the profiling results stored in the host memory and issue commands to notify the systolic architecture accordingly.

4.2 The ReSA Sub-array Fission

To decrease the overhead of the on-chip network on the spatially partitioned systolic architecture, ReSA divides a \(128 \times 128\) systolic array into 16 \(32 \times 32\) sub-arrays interconnected with each other through a simple on-chip network connection. Thus, ReSA can provide several sub-array fissions for tensors with different sizes and shapes. Figure 11 presents nine possible sub-array fissions for the tensor data allocation on each layer of a DNN model on a systolic array. As illustrated in Figure 12, unlike Planaria [13], ReSA removes the upward and the right-to-left on-chip network connection in a systolic array. Planaria uses 2\(\times\) network connections over the ReSA’s network topology. It is indeed that Planaria can create more different sub-array fissions via its omni-directional network. For instance, Planaria can offer 1 \(\times\) 8 sub-array fission that is not included in the ReSA’s systolic array fission. However, the sub-array fission of ReSA has satisfied the need of most tensors from our evaluation while cutting down roughly 50% on-chip network connections against Planaria. Hence, ReSA provides flexible tensor mapping through multiple different sub-array fissions connected with a simple on-chip network.

Fig. 12.

5 ReSA Micro-architecture and Design

As illustrated in Section 3, ReSA presents its proposed methodology to split tensors into multiple small pieces and map each of them to its sub-array fission. This section introduces novel micro-architectures of ReSA to execute commands from its software scheduler and optimize the processing of each tiny tensor. Thus, this section depicts the design details of ReSA’s micro-architecture that contains the multi-dataflow PEs, data path controller, tensor data router, and optimized memory access policy.

5.1 Micro-architecture of the ReSA Heterogeneous Dataflow PE

The PE in the systolic array performs a multiply-accumulate (MAC) operation. However, the logic circuit of MAC implementation is different with the change of dataflows. Figure 13 compares the logic circuit design of the MAC operation on different dataflows in a systolic array. Weight-stationary loads weight data and places the data in the buffers of PEs. Then, in Figure 13(a), this weight data multiplies with the input data and accumulate with the partial sum from its neighbor PE. Furthermore, there is a register that stores the result of the MAC operation in a PE. Thus, weight-stationary completes MAC \((PartialSum^{\prime } = Input \times Weight + PartialSum)\) operation in Figure 13(a), where \(PartialSum\) is the partial sum gotten from the previous PE and the PE uses \(PartialSum\) to yield \(PartialSum^{\prime }\). Then, \(PartialSum^{\prime }\) will be passed to the next PE. Unlike weight-stationary, input-stationary stores input data in the buffers of PEs, as illustrated in Figure 13(b). Contrary to weight-stationary and input-stationary, output-stationary reuses the output data and conduct \((ACC^{\prime } = ACC + Weight \times Input)\) MAC operation in Figure 13(c), where \(ACC\) is the old value while \(ACC^{\prime }\) is the value after update. Output-stationary stores the partial sum in the accumulator and keeps updating its value every cycle. Output-stationary does not need a buffer to store input or weight data as weight-stationary and input-stationary. To make PEs compatible with weight-stationary, input-stationary, and output-stationary, in Figure 14, ReSA adds two MUX to choose the correct data path on different dataflows. Table 2 lists the mapping data of each port in different dataflow, where X means that the accelerator does not need the port when it uses the dataflow. Thus, we do not care about the value of the ports in the cases. Furthermore, ReSA’s PE also maintains a buffer to save the data of the input or the weight. Hence, ReSA adds a small number of circuits to adapt to the changes in dataflows.

Table 2.

	Weight-stationary	Input-stationary	Output-stationary
preload	\(Weight\)	\(Input\)	X
Input_0	X	X	\(Weight\)
Input_1	\(Partial Sum\)	\(Partial Sum\)	X
Input_2	\(Input\)	\(Weight\)	\(Input\)
Output_0	X	X	\(Weight\)
Output_1	\(Partial Sum^{\prime }\)	\(Partial Sum^{\prime }\)	X
Output_2	\(Input\)	\(Weight\)	\(Input\)

Table 2. Port Mapping in Each Dataflow

Fig. 13.

Fig. 14.

5.2 Micro-architecture of the ReSA Data Path Controller

The ReSA data path controller handles the direction of data flowing into sub-arrays. Figure 15 presents the micro-architecture of the ReSA data path controller, which is included in each sub-array of ReSA. The ReSA data path controller utilizes a crossbar to direct the input and weight matrices to the ports of a sub-array. In Figure 15, the x-axis of the crossbar represents the data flow-in direction, while the y-axis represents the flow-out direction after being redirected by the crossbar. The data flowing into the data path controller consists of the weight data, input data, and null (represented by 0). The flow-out directions are the port preload, Input_0, and Input_2 port of the sub-array, which correspond to the preload ports and Input_0 ports of PEs in the top row and the Input_2 ports of PEs in the left-most column, respectively. Therefore, a \(3 \times 3\) crossbar is used for the data path controller. ReSA adjusts the configuration of the crossbar depending on the selected dataflow. For example, in the weight-stationary dataflow the weight data needs to be loaded into the buffers of the PEs from the preload ports. Thus, ReSA directs the weight to the preload ports. Additionally, in the weight-stationary dataflow, the input data streams into the sub-array horizontally from the Input_2 ports of the sub-arrays. Therefore, ReSA directs the input data to the Input_2 ports in the weight-stationary dataflow. Since the weight-stationary dataflow does not use data from Input_0, thus the data path controller directs 0 to Input_0, as depicted in Figure 15(c).

Fig. 15.

5.3 The ReSA Distributed On-chip Buffer

The traditional systolic array-based accelerator, such as TPU [22], only streams data from the left and the top of the systolic array. This design exhibits the under-utilization when the data of a model layer cannot occupy the entire systolic array. Planaria [13] decomposes a systolic array into multiple sub-arrays connected with the sophisticated network. Thus, Planaria’s dynamic PE fission increases the degree of freedom to pass data within the systolic array. Unlike Planaria, ReSA simplifies the NoCs and uses a data router to dispatch data to a group of sub-arrays.

Figure 7(a) presents ReSA’s on-chip buffer and data router architecture. In Figure 7(a), four sub-arrays share an on-chip buffer. ReSA’s on-chip buffer preserves the data of input, weight, and output data. Furthermore, ReSA’s on-chip buffer is a banked SRAM and contains one read-and-write port. ReSA’s on-chip buffer applies the write-back policy. Thus, each sub-array writes output data back to the on-chip buffer and returns the off-chip memory afterward. ReSA uses the on-chip buffer to hide the memory access latency of the systolic array through double buffering. Then, the ReSA data router distributes data to a group of sub-arrays. ReSA considers the complexity of the data router and enables four sub-arrays to be a sub-array group. Thus, the ReSA data router is a \(4 \times 4\) crossbar switch [45]. In a sub-array group, the sub-array can connect to its corresponding data router and on-chip buffer through the dedicated network. ReSA also has the connection to link different sub-array groups. Thus, ReSA also can perform the sub-array fission to increase the degree of freedom on the sub-array allocation and satisfies the needs of different DNN model layers.

5.4 The ReSA Interleaving Memory Access Policy

The conventional multi-tenant systolic architecture executes multiple models on the systolic array simultaneously. To shorten the runtime scheduling overhead, the previous multi-tenant work [12, 13] tends to pre-load and write back partial data of each running model between the DRAM to SRAM memory in the beginning time (initial state) and at the end (final state). However, PEs are idle during this initial state and final state, and the problem becomes worse when the number of running models increases, because of the growth of the data from different models. Unlike the prior multi-tenant work [12, 13], ReSA only serves a single model at a time but also can execute multiple tiny tensors on the systolic array concurrently. As illustrated in Algorithm 2, to decrease the idle time of PEs during the initial state and final state, ReSA only fetches the amount of data that is enough to satisfy the need of the systolic computation. Then, ReSA enters the middle state that overlaps the processing of the memory access and computation. Additionally, only one tile of output data is written back in the final state.

Figure 16 compares the memory arrangement patterns of the Planaria and the ReSA from the perspective of the SRAM memory. The manner of the Planaria is illustrated in Figure 16(a). Since the size of weight, input, and output data is smaller than the allocated SRAM space, all weight and input data are read into SRAM memory in the initial state, and all output data are written back to DRAM in the final state. Consequently, no memory access latency is hidden by computation latency in the middle state, resulting in a memory wait cycle of 4,308 cycles (2,626 + 1,308 cycles in the initial and final state shown in Figure 16(a)).

Fig. 16.

Figure 16(b) illustrates the ReSA memory arrangement policy. The ReSA limits the loading of input and weight data to only one tile in the initial state and immediately starts computing in the middle state. In the final state, only one tile of output data remains to be written back. The initial and final states take 1,436 and 561 cycles, respectively. The DRAM access time during the middle state is 2,312 cycles, while the computation latency is 2,034 cycles. Consequently, the memory wait is calculated as \(1{,}436 + (2{,}312 - 2{,}034) + 561 = 2{,}275\) cycles, which is significantly smaller than that of the Planaria [13].

6 Methodology

Hardware modeling. Table 3 presents parameters we use to model our ReSA’s systolic array. We implement our proposed reconfigurable dataflow systolic sub-array in Verilog and synthesize them with Synopsys Design Compiler using FreePDK-45nm standard cell library [9] to extract the power and area. Our proposed PE includes a 64 B register file. In addition, we model the on-chip SRAM buffer by using CACTI [31], which provides energy and area. The energy consumption of the off-chip DRAM is estimated by the Samsung 2D-DRAM model within CACTI [31]. Then, the energy per off-chip memory access is 12.5 pJ/bit and its bandwidth is 16,384 bits per cycle from CACTI [31]. Finally, the power consumption of the NoC network is estimated by the McPAT 1.3 [28] and consumes 0.61 pJ/bit per hop. The baseline is a conventional systolic-based accelerator, with an inseparable \(128 \times 128\) systolic array, which is like TPUv1 [22] with a smaller size of systolic array. As a result, the baseline, Planaria, and ReSA are the same in terms of the size of the systolic array.

Table 3.

Number of PEs	128 \(\times\) 128
Technology of PEs	45 nm
PE Clock Frequency	700 MHz
Number of PEs on a Sub-array	32 \(\times\) 32
Total Size of On-chip Buffer	12 MB
Input/Weight Precision	16 bits
Energy per On-chip Memory Access	0.61 pJ/bit
Energy per Off-chip Memory Access	12.5 pJ/bit
On-chip Memory Bandwidth	16,384 bits/cycle
Off-chip Memory Bandwidth	2,864 bits/cycle

Table 3. ReSA Micro-architecture Parameters

DNN benchmarks. We evaluate ReSA using nine state-of-the-art DNN models from domains of image classification [14, 17, 18, 42], object-detection [2, 29, 36], translation [46], and recommendation system [16]. We refer to the workloads found in ScaleSim tool [3, 37, 38] to evaluate our work. ResNet50 [17], GoogleNet [42], FastRCNN [14], YOLO [36], and transformer [46] consist of many CONV layers. Unlike small NNs such as MobileNet [18] and YOLO-tiny [2], the UNet2D [29] contains large input data. In addition, we also examine DNN models that include fully connected (FC) and multi-level perceptron (MLP) operations such as ResNet50 [17], GoogleNet [42], DLRM [16], and transformer [46]. The batch size of each DNN model is one used in our evaluations.

Simulation infrastructure for ReSA. We develop a cycle-accurate simulator to provide cycle counts and statistics of each DNN model. We build our systolic array simulator on the top of Planaria simulator [12, 13] and refer to SCALE-Sim [37, 38] for calculating the computation latency of the systolic array. Each cycle, PE stores the incoming data in its register file and forwards the data through its uni-directional link at the next cycle. In addition, we also model on-chip memory as double buffers to hide the SRAM memory access latency.

7 Evaluation

This section evaluates the performance, energy, and other metrics of ReSA against other systolic array-based architectures.

7.1 Performance of DNN Inference

Figure 17 compares the latency of nine DNN models across different systolic array-based architectures. The baseline is a systolic array-based architecture using weight-stationary. Planaria [13] consists of the dynamic fission that enables the omnidirectional data path and uses the weight-stationary dataflow, too. ReSA\(^P\) represents that ReSA employs Planaria’s memory access policy. ReSA, ReSA\(^P\), and Planaria achieve geometric mean speedups of 1.87\(\times\), 1.66\(\times\), and 1.52\(\times\) over the baseline. Since Planaria uses the weight-stationary dataflow that is not always the best choice for layers on different DNN models. Thus, ReSA outperforms Planaria although Planaria enables dynamic fission. Furthermore, ReSA is better than ReSA\(^P\), because ReSA’s memory access policy reduces the number of memory wait cycles. In Figure 17, small models such as MobileNet [18] in ReSA gain significant speedup compared to the baseline. ReSA achieves a speedup of 5.35\(\times\) over the baseline on MobileNet. MobileNet is composed of multiple depthwise convolution layers that contain many \(1 \times 1\) kernels. These layers do not fully occupy the entire systolic array and exhibit the under-utilization problem. A DNN model’s layer is divided into multiple tiles, and the baseline only works one tile at a lockstep. Unlike the baseline, ReSA can distribute these tiles to sub-arrays to improve the utilization of the systolic array. Thus, ReSA is a good fit for DNN models composed of small kernels.

Fig. 17.

7.2 Analysis of Memory Wait Latency

Figure 18 compares the memory wait cycle of ReSA to Planaria when the number of concurrent tiny tensors (threads) varies. In Figure 18, both of them use the same method to determine the size of each tiny tensor in a DNN model. The geometric mean reductions in memory wait cycles for different numbers of threads are 24%, 23%, 19%, 20%, and 21%, respectively. Unlike Planaria, ReSA only accesses a tiny (split) tensor in the initial state and final state, which increases the amount of overlapping processing to reduce the memory wait cycle. However, in Figure 18(h), ReSA reduces roughly 1% on the UNet2D model compared to Planaria, because the size of UNet2D’s split tensors tends to be large. Since the size of SRAM memory is limited, ReSA and Planaria will load and write back a similar amount of data in the initial and final state when the size of a split tensor becomes large. On the contrary, ReSA reduces over 50% memory wait cycles on the DRAM model when using 1 and 2 threads, as depicted in Figure 18(g), since the size of a split tensor is small. Note that the amount of memory access is the same on ReSA and Planaria. Unlike Planaria, ReSA is better to hide the memory access latency through the overlapping of computation and memory access in the middle state to improve the performance of DNN inference applications.

Fig. 18.

7.3 Analysis of Energy Efficiency

Figure 19 presents the energy efficiency of the baseline, Planaria [13], and ReSA accelerator. The energy efficiency of ReSA and Planaria achieve the geometric means of 1.28\(\times\) and 1.25\(\times\) higher than the baseline, respectively. In Figure 19, the energy consumption on DNN accelerators is the sum of the energy on PEs, DRAM, SRAM, and the interconnected network that links on-chip buffers in ReSA and Planaria. In ReSA, the energy of PEs, SRAM, DRAM, and connected network take 1.3%, 13.8%, 84.6%, and 0.03% of the total energy consumption, on average. The memory access takes a large portion of the energy consumption, and ReSA and Planaria obtain the same amount of memory access. As a result, the energy consumption of ReSA is only about 3% better than Planaria.

Fig. 19.

7.4 Analysis of Different ReSA’s Sub-array Size

Figure 20 shows the performance of ReSA that contains sub-arrays where the number of PEs in each sub-array is from \(16 \times 16\) to \(128 \times 128\). The total number of PEs in the accelerator is fixed at \(128 \times 128\). Thus, there are 64 sub-arrays in ReSA\(16 \times 16\) and only 1 sub-array in ReSA\(128 \times 128\). The number of threads allowed in each case is 64, 16, 4, 1 for \(16 \times 16\), \(32 \times 32\), \(64 \times 64\), and \(128 \times 128\), respectively. In Figure 20, ReSA\(16 \times 16\), ReSA\(32 \times 32\), ReSA\(64 \times 64\), and ReSA\(128 \times 128\) achieve geometric mean speedups of 1.91\(\times\), 1.87\(\times\), 1.74\(\times\), and 1.19\(\times\) over the baseline. ReSA’s profiling method determines the number of sub-arrays for each DNN model’s layer, so the fine-grained sub-array reduces the internal fragmentation problem that occurs in a sub-array. In addition, the fine-grained sub-array also improves PE utilization by increasing the number of active sub-arrays. For instance, MobileNet [18] contains multiple depthwise convolution layers composed of numerous tiny \(1 \times 1\) filters and only a few filters. These layers do not occupy the entire systolic array, even a sub-array. Fine-grained sub-arrays help reduce the latency of MobileNet by reducing idle PEs. Thus, MobileNet achieves 6.21\(\times\) speedup in ReSA\(16 \times 16\), but only 1.01\(\times\) in ReSA\(128 \times 128\) over the baseline. However, fine-grained sub-arrays do not significantly improve large DNN models such as UNet2D. To balance the complexity of NoCs and the performance on the systolic array, ReSA then chooses the size of a sub-array with \(32 \times 32\) PEs.

Fig. 20.

7.5 Analysis of Area of ReSA’s Sub-array

The sub-array was synthesized with FreePDK-45nm standard cell library [9]. As illustrated in Table 4, the area value of ReSA and Planaria [13] sub-arrays are \(53{,}586{,}325.428 \mu m^2\), \(5{,}280{,}739.909 \mu m^2\), respectively. The sub-array area of ReSA is 2% larger than Planaria. Planaria employs the sophisticated on-chip network to work out the dynamic sub-array fission, but its PE only can work on the weight-stationary dataflow. Unlike Planaria [13], ReSA inserts the circuit on each PE to cater to operations of different dataflows. As a result, ReSA achieves a geometric mean speedup of 1.23\(\times\) over Planaria through its simple sub-array fission and the use of different dataflow on each layer of a DNN model with only 2% area overhead over Planaria.

Table 4.

	Area(\(\mu m^2\))	Normalized Area	Normalized Speedup
Planaria [13]	5,280,739.909	1	1
ReSA	53,586,325.428	1.02	1.23

Table 4. Comparison of Area and Speedup

8 Related Work

This section discusses prior research work on reconfigurable dataflow and dataflow mapping.

Reconfigurable Dataflows: DNN models include various hyper-parameters that differ in the shape of models and have different requirements for hardware resources. Prior work [30, 44, 47, 48, 50] on configurable dataflow aims to increase utilization and energy efficiency of domain-specific accelerators. Flexflow [30] presents different mapping strategies on feature maps, neurons, and synapse-level parallelism. DNA [43] accelerator focuses on placing input, output, and weight reuse patterns on the same fabrics. FlexFlow [30] and DNA [43] only work on CNN models. The previous work leverages the novel data mapping through compiler [50] or presents multiple customized dataflows to some specified operations [30, 44, 47]. Unlike the work, ReSA enables the execution of different dataflows on a systolic array to cater to the need of each DNN model’s layer while minimizing the end-to-end latency of DNN models.

Dataflow Mapping: Modern DNN accelerators often present rigid implementation as PEs and network-on-chip (NoC). It is hard to translate irregular data patterns within these same substrates while maximizing data reuse and performance. Thus, prior work leverages the NoCs [1, 11, 15, 27] and novel scheduling policies [10, 19, 23, 25, 26, 51] to help the spatial DNN accelerators on rapid evolving neural networks. MAERI [27] demonstrates a DNN accelerator that consists of multiple modular and configurable functional blocks. MAERI partitions convolution, RNN, pooling, and fully connected layers into several tiny functional blocks connected via several switches in a tree-based network. MAERI achieves better PE utilization through the flexible dataflow and supports cross-layer mapping and sparsisty over the baseline. MAERI presents a DNN accelerator composed of a special tree structure rather than the common spatial architecture including multiple PEs. ReSA relies on offline profiling to determine the optimal dataflow of each DNN model’s layer. Then, ReSA changes the implementation of the systolic array to enable heterogeneous dataflow execution.

9 Conclusion

DNN models evolve rapidly. The variations in hyper-parameters of DNN models increase the challenge to map tensor data on modern DNN accelerators while fully utilizing the entire PEs. To minimize the execution time of DNN inference applications, ReSA presents a novel reconfigurable dataflow DNN accelerator that can execute different dataflows on a systolic array. Unlike conventional systolic architecture, ReSA divides a systolic array into multiple sub-arrays interconnected with an on-chip network and works with PE that fits different dataflow to satisfy the evolution of DNN models. In addition, ReSA also changes the existing memory access policy to increase the overlapping of memory and computing processing to improve the performance of the systolic architecture on concurrent tiny tensors. As a result, ReSA achieves a geometric mean speedup of 1.87\(\times\) over the baseline systolic architecture.

References

[1]

Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, Max Baker, Tom Hawkins, Andrew Bell, John Thompson, Temesghen Kahsai, Garrin Kimmell, Jennifer Hwang, Rebekah Leslie-Hurd, Michael Bye, E. R. Creswick, Matthew Boyd, Mahitha Venigalla, Evan Laforge, Jon Purdy, Purushotham Kamath, Dinesh Maheshwari, Michael Beidler, Geert Rosseel, Omar Ahmad, Gleb Gagarin, Richard Czekalski, Ashay Rane, Sahil Parmar, Jeff Werner, Jim Sproch, Adrian Macias, and Brian Kurtz. 2020. Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads. In Proceedings of the International Symposium on Computer Architecture (ISCA’20). 145–158.

Abstract

1 Introduction

2 Characterizations On Systolic Array

2.1 Latency on Different DNN Dataflows

2.2 Latency Analysis on Different Dataflows

2.3 The PE Utilization of Systolic Array

2.4 Memory System for Multiple Tiny Tensors

2.5 Conventional Tensor Mapping on Systolic Architecture

2.6 Tensor Mapping on Spatially Partitioned Systolic Architecture

3 ReSA System Architecture

3.1 Overview of ReSA’s Architecture

3.2 The ReSA Control Command

3.3 The ReSA Host Layer Scheduler

4 Hardware-aware Layer Splitting and Data Mapping

4.1 The ReSA Tensor Splitting Policy

4.2 The ReSA Sub-array Fission

5 ReSA Micro-architecture and Design

5.1 Micro-architecture of the ReSA Heterogeneous Dataflow PE

5.2 Micro-architecture of the ReSA Data Path Controller

5.3 The ReSA Distributed On-chip Buffer

5.4 The ReSA Interleaving Memory Access Policy

6 Methodology

7 Evaluation

7.1 Performance of DNN Inference

7.2 Analysis of Memory Wait Latency

7.3 Analysis of Energy Efficiency

7.4 Analysis of Different ReSA’s Sub-array Size

7.5 Analysis of Area of ReSA’s Sub-array

8 Related Work

9 Conclusion

References

Index Terms

Recommendations

A Survey of Design and Optimization for Systolic Array-based DNN Accelerators

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

An FPGA-based fault-tolerant 2D systolic array for matrix multiplications

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations