4.1 Instruction Set Architecture (ISA)
In this section, we elaborate the PIMSAB ISA, including Compute, Data Transfer and Synchronization instructions.
Compute Instructions. Compute instructions support arithmetic and logical operations (
add,
mult,
or,
and,
xor,
max, and
min), operate on data in the CRAMs, and are vectorized across bitlines. We also support inter-bitline instructions, like shifting data across bitlines (
shift). Instructions to reduce data within a CRAM (
reduce_cram) and across the CRAMs in a tile (
reduce_tile) are also provided. We also provide an instruction,
set_mask, which copies the data of wordline into the mask latches in PEs, to enable predicating operations per bitline. Additionally, each instruction has a field to specify what should be used for predication—the mask latch or the carry latch (Section
4.2 describes the PE architecture including the latches). Precision of each operand can be expressed in the instruction through the
pr* fields. Exposing precision through the ISA provides more control to the programmer over the benefits of adaptive precision. In most cases, all compute instructions are executed across all the CRAMs in tile, but we also provide a field (called
size) to specify the number of bitlines involved in the operation across the tile.
Operations with Scalars or Constants. For multiplication operation, a special instruction called mul_const is provided where one operand is from the RF (scalar or constant), instead of being replicated in the CRAM. This instruction skips zeros in the constant operand in the RF, reducing the execution time.
Data Transfer Instrucfions. These instructions are used to move data between the DRAM, CRAMs, the RF. Specifically, we support bidirectional data transfer between DRAM and CRAMs (load and store), as well as DRAM and the RF (load_rf and store_rf). We support transferring data between CRAMs within a tile (cram_tx_rx), and transferring data between tiles (tile_tx and tile_rx). Such communication blocks the receiver until the data arrives. Broadcasting from one tile to other tiles is supported by a library function called tile_bcast. Two modes of broadcasting are supported—(1) one_to_many, in which one tile sends data to all receivers, and (2) systolic, in which each tile receives data from one neighbor and passes it to another neighbor.
Data Shuffling Instructions. When loading data from DRAM into CRAMs, the shuffling can be enabled by using the
load_shf instruction. The
shf argument specifies the shuffling pattern and the
bcast bit specifies whether broadcast is enabled or not (see the Shuffle Logic subsection of Section
4.2 for details). Furthermore, we provide the capability to shuffle data that is already stored within a CRAM, using the
cram_local_shf instruction.
Synchronization Instructions. These instructions coordinate data transfers and computations among tiles. Two synchronization instructions provided are signal and wait. signal sends a message from a source CRAM to a destination CRAM and is non-blocking. A CRAM can wait for a message (blocking) from a source CRAM using the wait instruction.
Transposing Data. In load and store instructions, besides the source address, destination address, size and precision, there is an additional tr field specifying if the data is transposed or not. This can be used when, e.g., an immediate/constant operand read from the main memory need not be transposed.
Program Example. A simple elementwise vector multiplication is shown in Listing 1. The program generates an instruction stream for all tiles in the chip (NUM_TILES). Two operand arrays, each with elements of precision int8, are loaded from the main memory. vec_width is specified to be the full width of a tile. Then a multiplication instruction is used to generate a result with precision int6. The result is then stored back in the main memory.
4.2 Microarchitecture
Here we discuss PIMSAB’s microarchitecture. Table
2 provides a list of hardware parameters.
CRAMs. We employ dual-ported
compute-enabled RAMs (called
CRAMs) similar to CoMeFa [
5]. A CRAM has two modes: compute and memory. In compute mode, a data word written into the memory is treated as a micro-op. Each micro-op takes 1 cycle during which two wordlines are read (one on each port), computation is performed in the PE, and the result is written into a wordline. In memory mode, the CRAM behaves like normal RAM. CRAMs are grouped into tiles; all CRAMs in a tile execute in lock-step in a SIMD fashion (except when executing CRAM-to-CRAM data transfer or inter-CRAM intra-tile reduction). CRAMs in a tile are connected using the intra-tile network. In addition, there is a single wire ring interconnect between all CRAMs in a tile to facilitate
shift instructions.
Processing Element (PE). PIMSAB adopts the PE architecture from CoMeFa [
5], as shown in Figure
5. Each PE can perform any logical operation between two operands, using the
TR mux. The
TR mux allows the PE to be more flexible, compared to [
11]. With the addition of an XOR gate (
X), it can perform a 1-bit full adder operation. A carry latch (
C) is used to store the carry-out, which can be used as carry-in for the next timestep. The output of the TR mux can be stored in the mask latch (
M). Predication based on mask bits and carry bits is supported, through the predication mux (
P). There are as many PEs in a CRAM as many bitlines. The operation performed by the PE is governed by the micro-op received by the CRAM from the instruction controller. The micro-op bits are decoded in the CRAM’s sequencing logic and generate the various control signals present in the PE. The write select muxes
W1 and
W2 select what to write back to the bitlines using the write drivers - data from left or right PEs, or the sum or carry output calculated by this PE. In each cycle, the PE receives two bits of operands from the sense amplifiers and performs the operation specified by the micro-op. The result of the operation is then written back into the CRAM through the write drivers (unless predicated off).
Instruction Controller. Instructions are received from the HOST over PCIe. Each tile has an instruction controller to decode and farm-out execution to corresponding units. For compute instructions (add, multiply, reduction, etc.), it generates micro-ops for the CRAMs every cycle. For data transfer instructions (CRAM-to-CRAM transfer, tile-to-tile transfer, DRAM transfers), it reads the CRAM and sends data into the static network’s switches, and also writes data coming in from the switches into the CRAMs.
Inter-Tile Dynamic Network. The inter-tile network uses a standard wormhole-switched dynamic NoC, with X-Y routing. Each router, shown in Figure
7, has a crossbar connecting five input and output ports—local tile, north, south, west, east. Routers connected to DRAM have an extra input/output port to receive/send data from/to DRAM. The transferred data is broken into flits (flow control units). Each input port has multiple circular queues to buffer input flits into multiple virtual channels. Upon sending, header and data flits are pushed into a circular queue of the local tile port one after another. Each router performs wormhole switching on the incoming flits. Upon decoding the flit header, the router controller performs minimal routing to route incoming flits towards their destination. Upon receiving, data flits are popped from one of the input queues selected by the crossbar. Due to simple flow-control and routing strategy, small flit and queue sizes, area and power overheads of routers are minimized.
We choose a mesh topology to connect the tiles as opposed to a ring topology. A mesh topology helps in scalability and reduces the burden on the compiler. Section
7.7 shows the advantage of mesh topology compared to the ring topology.
Intra-Tile Static Network. The intra-tile network is a static circuit-switched network using an H-Tree topology. This is similar to a hierarchical FPGA [
2,
39], but with a much smaller configuration overhead because of the coarser granularity (word-level instead of bit-level).
Figure
6 provides the details of the microarchitecture of a switch in the intra-tile network. Each switch is a buffered crossbar with five ingress (input) and egress (output) ports. Each output port can be driven by the other four input ports—three from other directions at the same hierarchical level and the fourth from the next level of hierarchy (shown using separate colors in the figure). There is one switch at the top of the tile that interfaces with the NoC router of the inter-tile network. For a tile with 256 CRAMs, there are four levels of switches, for a total of 1 + 4 + 16 + 64 = 85 switches. The last set of 64 switches are connected to 256 CRAMs. The intra-tile network reconfigures its switches when a incoming data transfer instruction indicates new communication pattern.
Reductions can be performed on data within a single CRAM, using an algorithm similar to [
11]. We refer to this as intra-CRAM reductions. This method requires iteratively shifting values across bitlines and adding them. Moving bits to adjacent bitlines takes 1 cycle, but moving bits to a bitlines N-bitlines away takes N cycles, because each bit has to be shifted cycle-by-cycle, based on the connectivity across PEs provided in the CRAMs. The number of cycles consumed in the reduction operation is linearly related to distance (in terms of bitlines). Furthermore, the number of bitlines utilized reduces as the reduction operation progresses, and the result is available in the first bitline of the CRAM.
The H-Tree topology in PIMSAB facilitates pairwise reduction across multiple CRAMs. We refer to this as inter-CRAM (or intra-tile) reduction. Data to be reduced are transferred across pairs of CRAMs through levels of the H-Tree and added at each level. Therefore, the reduction time is logarithmically related to the number of CRAMs that the operand occupies. As a result, inter-CRAM reduction is more efficient than intra-CRAM reduction and is prioritized by our compiler. The number of utilized CRAMs in a tile reduces as the reduction operation progresses, and the results are available in the first CRAM of each tile.
Shuffle Logic. Operations like GEMM and convolution can greatly benefit from custom data layout patterns. For example, we may need a data element to be duplicated in each bitline or repeated every four bitlines in a CRAM. These custom layout patterns can be achieved by data duplication in the CRAM thereby avoiding unnecessary traffic from/to DRAM. We refer to this duplication of data in various layouts as shuffling. We implement dedicated hardware in PIMSAB to enable efficient shuffling of data. This hardware is implemented in two parts. The first part is implemented at the input of each tile, by employing careful modifications to the structure of the top-level intra-tile switch. This hardware provides the capability to broadcast data received at the top of the tile to each CRAM in the tile. The second part is implemented at the input of each CRAM. Additional multiplexing hardware is provided to enable common data patterns observed in the DL benchmarks.
Figure
8 shows the modifications done to the top-level intra-tile switch to support shuffling. The data coming from the NoC (skyblue circle) goes to all the ports as shown previously in Figure
6, but now an additional red multiplexer is added on each port. The first input of all the red multiplexers is data bits 255:0. Thus, the lower significant 256 bits from 1,024 bits received at the NoC router are broadcasted to all the four ports and flow through the intra-tile network of switches to CRAMs. The second input of the red multiplexer is the normal path, through which different set of bits received from the NoC router are sent downstream to the CRAMs through the intra-tile network. The selection between these inputs is controlled by a broadcast enable bit. If broadcast is enabled, all ports (and hence the CRAMs in the tile) receive the same 256 bits of data. This broadcast enable is exposed to the compiler through a
bcast argument in the
load_shf instruction.
Figure
9 illustrates the additional hardware designed to shuffle bits at the periphery of each CRAM. This unit enables data shuffling in four distinct patterns. The source selector multiplexer, shown in green, enables choosing between the data coming from the leaf level intra-tile switch, or the output of the CRAM itself. The former is used when data originating from outside (either in DRAM or another tile or another CRAM) is being written into the CRAM. Currently, we only support shuffling data loaded from DRAM through the
load_shf instruction. The latter is used when data already present in a CRAM needs to be shuffled and written back to the same CRAM. This path is supported through the
cram_local_shf instruction. The shuffle pattern selector multiplexer, shown in blue, allows selecting between four data patterns generated by shuffling the data bits of the data output by the source selector. For instance, the first pattern duplicates the 0th bit 256 times, while the fourth pattern duplicates bits 7:0 32 times. The functionality of this multiplexer is exposed to the compiler through the
shf argument in the
load_shf and
cram_local_shf instructions. Eventually, a shuffle enable bit of orange multiplexer selects whether to enable shuffling or not, and the resulting data is written into the CRAM. Shuffling is disabled if
load_shf or
cram_local_shf instructions are not used.
DRAM Interface and Transpose Unit. All tiles in the top row of the mesh NoC are connected to DRAM controllers. The data from DRAM must be transposed before storing into CRAMs, so that bit-serial arithmetic can be performed. Results need to be untransposed when writing back. In PIMSAB, transpose units are integrated within the DRAM controllers. This unit can be disabled if not needed (through the
tr field of the DRAM load/store instructions). Some common situations where transpose is not required include loading the RF and reading/writing spilled data during operations. Another example is for deep learning, where we enable this for input activations while we disable it for weights, as weights can be stored pre-transposed in DRAM. We use the transpose unit shown in Figure
10 similar to CoMeFa’s [
5]. It employs a ping-pong FIFO. Data enters from one side into the ping part in the non-transposed format. When full, transposed output is obtained by reading bit slices of the loaded elements, while the pong part is filled with new data. When the pong part is full, the roles are reversed and the process repeats. The bandwidth for each DRAM channel in PIMSAB is 1024 bits per cycle. There are 32 transpose units for each DRAM channel. 32 bits can be read from 1 transpose unit in 1 cycle. For the highest bandwidth achievable, the transpose unit adds a latency of 32 cycles. Data of different precisions can be handled and transposed using the transpose unit.
Register File and Operations with Constants or Scalars. Many applications, including DL, heavily rely on constant operations like vector-scalar multiplication. Instead of replicating the constant operand in all bitlines like Figure
11(a), PIMSAB holds the constant operand in a register file (RF) present in every tile. Figure
11(b) shows the operation of the instruction
mul_const. After the instruction is decoded ➊, the instruction controller fetches the scalar operand from the RF ➋, and sends micro-ops to the CRAMs according to the bits of the constant that are set. When a bit of the constant is zero, corresponding computations and micro-ops can be skipped ➌. Finally, the CRAMs execute the u-ops to perform the computation ➍. Besides exploiting such bit-level sparsity, constant operations also save CRAM space and reduce data spilling to DRAM. The RF is flip-flop-based and does not have any port restrictions. Any number of registers can be read and written in each cycle. Instructions
load_rf and
store_rf are provided to load/store RF from/to DRAM.