1 Introduction
Modern-day applications such as machine learning, image processing, and encryption are widely employed on real-time embedded systems as well as back-end data centers. Nowadays, since the amount of data has to be processed for these types of applications increased considerably (e.g., BERT has around 110 million parameters [
15]), performance and power consumption of the systems that fulfill these processes has become crucial and draw increasingly more attention. Currently, traditional von-Neumann architectures have been stretched to attain adequate performance at reasonable energy levels, but are clearly showing limitations for further improvements. The main limitation is the conceptual separation of the processing unit and its memory, which makes the data movement between memory and processing unit the main performance and energy bottleneck [
25,
36]. For instance, communication between off-chip DRAM and the processor takes up to 200 cycles [
62] and it consumes power more than 2 orders of magnitude than a floating-point operation [
1,
14,
17]. Consequently, new architectures need to be sought to achieve the performance levels required for these new applications by means of mitigating this costly communication.
To address this challenge, early researches [
12,
47] suggested integrating more on-chip memory next to the processing unit, which imposes significant fabrication costs to the system. One of the promising solutions to tackle this challenge is performing computation within or near the memory rather than moving the data to the processing unit. Some valuable works focus on near-memory computing, especially after 3D stacking technology was introduced, in which computation was performed in the DRAM chip by adding extra logic next to the memory structure [
2,
3,
10,
19,
34,
45]. However, recently,
computation in-memory (CIM), as a concept, has gained a huge interest from the research community in combining memory storage and actual computing in the same memory array. This is achieved by exploiting special characteristics of emerging non-volatile memories called memristors such as
resistive RAM (ReRAM) [
26],
phase change memory (PCM) [
40], and
spin-transfer torque magnetic RAM (STT-RAM) [
30]. No matter which technology is used for fabrication, memristor technology has great scalability, high density, near-zero standby power, and non-volatility [
26,
63,
74]. Accordingly, memristor technology with the aforementioned characteristics opens up new horizons toward new ways of computing and computer architectures.
Until now, the main focus of researchers was to enhance the characteristics of memristor devices such as latency and endurance [
8,
29,
32,
51,
74]. Researchers have already proposed different innovative circuit designs based on memristor devices to exploit their capabilities of co-locating computation and storage together [
7,
31,
37,
42,
43,
44,
46,
61,
76]. Moreover, within a single memory array as well as at inter-array level huge parallelism can be flexibly achieved as each memory tile becomes a powerful computation core. It was demonstrated that due to these two main features of memristor-based designs, significant energy and performance improvement can be gained [
28]. It is widely accepted that the dot-product (and, in turn, the matrix-matrix multiplication) operation is the most suited for the memristor-based designs. Consequently, convolutional and deep neural network are the potential applications that have been widely studied by the researchers to exploit memristive crossbar structures [
9,
57,
64,
66,
67,
73]. However, researchers have also proposed other types of operations, e.g., Boolean operations [
56,
60,
72] or arithmetic operations like additions [
58]. More information on available researches regarding device characteristics and potential applications for in-memory computing can be found in References [
49,
53].
Having said this, there is no work in the research community that allows for easy comparison of these supported operations at the application-kernel level between different technologies (e.g., ReRAM, PCM, and STT-RAM) nor is it possible to emulate complex operations, e.g., matrix-matrix multiply, when the underlying technology does not allow for direct implementation. Furthermore, the interactions between the analog memory array and its supporting digital periphery is largely overlooked.
In this article, we generalized the existing CIM-tile architectures (a CIM-tile comprises the analog array itself and all necessary analog and digital periphery) to introduced a generic execution model that allows operations in the digital periphery and the analog array to operate in parallel through pipelining. In this model, a new instruction set architecture (ISA) is introduced with the twin objectives of orchestrating digital and analog components of memristor tiles and bridging the gap between high-level programming languages and the CIM architecture. By rescheduling the proposed fine-grained instructions, maximum flexibility on executing a program targeting the CIM-tile can be attained. Our compiler written in C++ performs instruction lowering from high-level kernels, which are supposed to execute on the CIM tile down to in-memory instructions at compile time. The compiler is aware of the architecture configuration, the technology constraints, and the datatype size requested by the application. Subsequently, based on this information, the required in-memory instructions are generated in a proper sequence.
The execution model is aimed to work with different memristor technologies as well as different configurations/circuits for peripheral devices. To accomplish design space exploration targeting performance and energy, we designed a modular simulator written in SystemC in which the designer can track all the data and control signals. Timing and power numbers of the modules within the CIM architecture are obtained from low-level models. These numbers configure specific parameters of the simulator.
In conclusion, the contributions of our work are summarized below:
•
we generalize existing CIM tile architectures and execution models for in-memory computing, which enables an application to leverage a variety of in-memory operations. Additionally, we proposed a pipelining approach for an architecture that has digital and analog modules, which can have widely varying (unbalanced) latencies. Finally, the architecture can be easily extended with additional peripheral modules.
•
we define a new and generic in-memory ISA for the aforementioned design to obtain maximum flexibility while dealing with different constraints, configurations, and requirements. Furthermore, we developed a compiler that is able to translate higher-level operations (e.g., defined in higher-level programming languages) into a sequence of instructions in our ISA. The ISA can also be extended when additional peripheral modules are added.
•
we develop a fully parameterized simulator that is capable of executing the newly introduced instructions and simulate our CIM tile architecture. Design parameters (e.g., height/width of memory array, type and number of peripheral modules like the Analog-to-Digital Converter (ADC)) as well as technology-dependent parameters (e.g., ReRAM/PCM/STT-RAM, latencies, number of bits per cell, power utilization per module, clock frequency of the digital periphery) are supported.
•
we perform an initial set of design-space explorations based on well-accepted design parameters and current-day memristor technologies and present the results.
3 CIM-tile Architecture
A CIM tile can be either employed as a standalone accelerator or integrated to the conventional computer architecture. Accelerators are designed to lessen the execution time of certain functionalities, which are frequently used by the intended applications. With respect to their degree of flexibility, they can be designed and implemented for dedicated functionalities or they can be flexible enough to support wider range of tasks [
5,
11,
65]. However, in the context of in-memory computing, the accelerators are storage units enabled to perform certain functionalities. This is aligned with the main goal of this context, which is data movement reduction. Figure
2 depicts one potential way in which a CIM tile can be seen as an off-/on-chip accelerator from the CPU. In this architecture, CIM-tiles are not integrated to the memory hierarchy of the system and allocated into different address space. In this section, we focus on the architecture of our CIM tile and how it is organized using an in-memory instructions set architecture. Afterwards, we present our compiler, which is capable of translating higher-level programs, e.g., matrix-matrix multiplication, into a sequence of the newly introduced ISA. Finally, we propose the execution model that allows pipelining at the tile-level between the analog and digital peripheries.
3.1 Overview
As mentioned earlier, we focus on the CIM-P 1T1R structure in which the (computational) results of the (memory) array operations are captured in the (digital) periphery. Figure
3 depicts the architecture of the CIM tile, which includes the required components and signals that can control digital or analog data. The operations that can be executed on the crossbar are divided into two categories: (1) write and (2) read and computational operations. The computational operations include
addition, multiplication, and
logical operations. In the following, we will describe our tile architecture and its main modules considering these two categories (the discussion of the control signals are left to the subsequent section):
(1)
Write operation. To write data to the memristor crossbar, we have to specify in which row and column the data has to be written. Therefore, three registers are employed to capture this information. The data itself has to be written to the Write Data (WD) register whose length depends on the width of the crossbar as well as the number of levels supported by the memristor cells. Considering endurance issues and potential energy savings, it is not always necessary to write data to all the array columns. For this purpose, the Write Data Select (WDS) register is used to select which columns should be activated. This is especially relevant when considering implementing a write-verify operation. Finally, the Row Select (RS) register is employed to activate the row in which data has to be written. A more detailed description is presented in the following.
Voltages that have to be applied to the crossbar depend on the crossbar technology and they are usually different than the voltage used for digital part of the system. Therefore, we need a device to convert the information from digital to analog domain called
Digital Input Modular (DIM). Selecting a row requires that two different voltage levels have to be provided for both source and gate line of the target row as depicted in Figure
1(b), which means two DIMs (
Source/Gate DIM) are required to drive both of them. Therefore, considering one DIM for the crossbar columns (
Write DIM), we need three DIMs in this architecture in total. Based on the operation and the data stored in the RS and WD registers, DIMs can apply proper voltage levels to the crossbar. In addition, the data in the WDS register is used by the
Mask unit to prevent extra switches for the cells that the data has not to be written into them.
(2)
Read and computational operations. In this category, the operations generate an output and it has to be read by the periphery circuits in the architecture. The generated output can be the outcome of either a normal memory read or computational operation. In contrast to the write operation, there is no need to fill the WD and WDS registers. The RS register again is used for row activation. However, among computation operations,
matrix-matrix multiplication (MMM) is a little different than others in the sense that the RS not only has to indicate the active rows but also can be considered as the data for one of the matrices. When the operation performed inside the crossbar, the generated analog output has to be captured by the
Sample & Hold(S&H) unit. This allows for a clear separation of the execution within the array and the read-out circuitry, which can be used for pipelining of the system (explained in Section
3.4).
After the S&H module has captured the result from the array, the ADCs (or the sense amplifiers) can be used to convert the analog results into the digital domain. Since ADCs consume much energy and area, usually it is not possible to allocate one ADC per column. Therefore, we need analog multiplexers to share several columns with one ADC. Besides, certain high-level operations, e.g., the integer MMM, require additional processing steps and these are performed in the
Addition units that are specific to the integer MMM. Our scheme for this unit is proposed in Reference [
75] where the design utilizes minimum size adder to impose as less latency/power as possible to the system. The design considers technology/circuit/application-driven restrictions such as the maximum number of active crossbar rows, number of ADCs, and datatype size. When other (high-level) operations are needed in the future, this unit can be altered or substituted with others.
3.2 Instruction-set Architecture
As discussed in Section
3.1, a complex sequence of steps need to be performed in the CIM tile that can be different depending on the (higher-level) CIM tile operation, e.g., read/write, dot-matrix multiplication, Boolean operations, and integer matrix-matrix multiplication. Similar to the concept of microcode, we introduce an instruction-set architecture for our CIM tile that would allow for different schedules for different CIM tile operations. The “Controller” in Figure
3 is responsible for translating these instructions to the actual control signals (highlighted in green). The list of instructions is presented in Table
1. In the following, we discuss each instruction:
•
Row Select (RS): the RS instruction is responsible for setting up the RS register (see Figure
3) that is subsequently used to correctly control the source and gate drivers to provide the right voltage levels for the crossbar. At this moment, the number of bits to set the RS register is as large as the height of the crossbar (means the input precision is 1 bit). As the size of crossbar increased, this is impractical for hardware implementations and left as a future optimization. However, for simulation purposes, the impact is negligible and actually allows for investigations to utilization patterns of the RS register.
•
Write Data (WD): the WD instruction is responsible for setting up the WD register that is used to write data into the crossbar. Similar to the RS instruction and with the same reasoning, the size of the instruction is as large as the size of the WD register. We envision that this instruction to be replaced when another mechanism is chosen to load data into the crossbar that is more hardware friendly. For simulation purposes, it is now the only way to load data into the crossbar.
•
Write Data Select (WDS): the WDS instruction sets up the WDS register to control which bits of the WD register need to be written. This allows for a flexible manner to write data into the crossbar in the light of potential endurance issues associated with current-day memristor technologies. It is especially useful when a write-verify operation needs to be performed when writing data into the crossbar. Correctly, written bits can be masked out, which helps to improve the endurance of the crossbar.
•
Function Select (FS): the FS instruction is needed for several reasons. Functionally, the DIMs need to set up differently when writing data into the crossbar or when reading out data or performing compute in the crossbar. Furthermore, it is envisioned that future periphery needs additional control signals. These are now conceptually captured in the FS instruction. For example, the control signals needed to control the “Addition units” are more than one, i.e., more than one instruction is needed to control these units. The details are intentionally left out.
•
Do Array (DoA): the DoA instruction is used to actually initiate the DIMs after they have been set up using the RS, WD, and FS instructions. This instruction will have a variable latency depending on the operation that is performed in the crossbar. These delays are specified as parameters in the simulator. More importantly, this instruction allows for a clear (conceptual) separation between the setup stage and the execute stage (see Figure
3) that would allow for pipelining (discussed in Section
3.4) within the CIM tile.
•
Do Sample (DoS): the DoS instruction is used to signal the S&H module to start copying the result from crossbar into its own internal storage. It must be noted that this module still operates in the analog domain. The introduction of the the DoS instruction allows for a conceptual separation of the execute stage and read stage. After the values are copied into the S&H module, the crossbar can basically be issued the next DoA instruction.
•
Column Select (CS): the CS instruction is used to set up the CS register that controls the multiplexers (in the MUX module) in the read stage. In case that each column of the crossbar can be associated with its own ADC/SA, there is no need to have the CS instruction. In all other cases, the CS register flexibly controls which column (in the S&H module) is connected to a ADC/SA. The length of the CS instruction is related to the number of columns of the crossbar. For the same reasons as with the RS and WDS (and WD) instructions, it is purposely defined as it is now and implemented as such in the simulator to allow for further investigation in the future. It is expected to be optimized or replaced when a hardware implementation is considered.
•
Do Read (DoR): the DoR instruction initializes the modules “ADC/SA” to start converting the output of the S&H module (via the MUX) into a digital representation. It is expected that multiple iterations of the CS and DoR instructions need to be issued in order completely read out all the columns of the crossbar. However, depending on the complexity of the module, the latency can vary and thus that a “done”-signal is needed (see Figure
3) to signal the end of the readout before the next DoR instruction can be issued.
It is important to note that the crossbar, the S&H modules, and the ADC modules (related to the DoA, DoS, DoR instructions, respectively) need to be able to signal to the controller that their operation is finished. It is only after this signal, that subsequently instructions can be issued by the controller. If this turns out to be impossible in a real hardware implementation, then we envision the need to setup counters that are initialized to the specific operations to achieve the same functionality. Both approaches are already supported in our simulator.
3.3 Compiler
The compiler translates high-level operations intended for the CIM tile into a sequence of instructions to be executed within the CIM tile. The high-level operations (e.g., MMM) are provided by the front-end compiler, which is responsible for searching for the operations within the application program that can be performed using the memristor crossbar (see Figure
4). The front-end compiler receives the application in
Tensor Comprehensions representation. This is then converted to a polyhedral representation to identify computational patterns suitable for acceleration by employing
Loop Tactics. The front-end compiler is written by our partners in MNEMOSENE project and more information can be found in Reference [
18]. Based on the requirements or constraints that come from either the tile architecture or technology side, our back-end compiler translates high-level operations to in-memory instructions. As depicted in Figure
4, this information is written to the configuration file and passed to the compiler. For example, there might be a constraint on the number of rows that can be activated at once. This constraint can come either from the precision of ADCs or even technology capability. Therefore, if an operation wants to activate more rows, the compiler splits it into several steps and takes care of other changes that might be needed. The flexibility brought by our in-memory instructions helps to overcome constraints, requirements, and sparse patterns. Figure
5 illustrates an example for VMM operation where four ADCs have to read columns in a special pattern (each 8 bit-lines share one ADC). This example demonstrates how the instructions deal with these kinds of patterns. It is important to note that the sequence of instructions generated by the compiler changes whenever the tile configuration changes. Therefore, by putting this complexity into the compiler, we try to keep the tile controller as simple as possible.
3.4 Pipelining
The operations in the digital periphery and the analog array can be divided into the following stages: (indicated by different colors in Figure
6)
(1)
Set up stage (digital): all the control registers (and write data register) are initialized
(2)
Execution stage (analog): perform the actual operation in the analog array
(3)
Read out stage (analog): convert the analog results into digital values
(4)
Addition stage (digital): perform the necessary operations for the integer matrix-matrix multiplication
These stages sequentially follow each other while performing higher level operations translated to a sequence of instructions in our ISA. It should be clear that the pipelining described here is different from the traditional instruction pipelining. In the latter, the latency of each stage should be matched with each other to have a balanced pipeline. In the CIM tile, the latency of the operation performed in the analog array is expected to be much longer than the latency of a single clock cycle in the digital periphery. Therefore, it is important that the right signaling is performed between the stages to enable pipelining. The introduced execution model to pipeline the operations within the CIM tile, will allow for trade-off investigations between different NVM technologies and the (speed of the) digital periphery. Considering the aforementioned stages, the designer should evaluate the latency of each stage (which depends on the configuration of the tile, memristor technology, etc.) to realize the contribution of each stage into the total latency of the tile and merge some of them, in case there is a stage that has far less latency than others. This analog/digital behavior of the CIM-tile restricts the choices and their effectiveness regarding the pipelining stages. It is worth mentioning that, the two analog stages (Execution and Read out stages) cannot be split into more stages. This is the same for the Set up Stage as well, since the registers initialized here cannot be changed before activating the crossbar to ensure the correct functionality of the system.
In contrast to traditional processors where the execution of instructions is split into different stages to enable pipelining, our tile architecture associates different instructions to each tile stage. In the first stage, registers should be filled with new data and the drivers have to be configured. In the second stage, to activate the crossbar using
DoA instruction, the operation latency for the previous activation must elapse. The latency of the crossbar, which depends on its technology as well as the operation supposed to be executed, can be captured by either a counter or done signal generated by the circuit itself. Based on the operation, which can be
write and
read/computational, done signals issued by the crossbar and S&H unit should be used to synchronize the first two stages, respectively. In the third stage, the latency of the Read stage depends not only on the latency of ADCs, but on the number of columns that have to be read as well. Accordingly, groups of columns are read by ADCs sequentially and this stage would be available for the next S&H activation after the last columns already translated to the digital domain. Figure
7 depicts the instruction memory where it is divided into four sections each working on one stage of the tile in parallel.
4 CIM Simulator
The proposed CIM architecture is generalized making it capable of targeting different technologies with different configurations of the peripheral circuit. The simulator, written in SystemC, models the architecture presented in Section
3.1 and generates performance and energy numbers by executing applications. The simulator takes as input the program generated by the compiler (presented in Section
3.3), which is currently stored as simple (human-readable) text. Besides the program, to simplify design space exploration, the configuration of architecture has to be sent to the simulator via a configuration file in which the user is able to specify many parameters. The simulator produces as output the following: (1) energy and performance numbers, (2) content of the crossbar (over time), (3) waveforms of all control signals, and (4) the computational results. All outputs are written into text files to be used for further evaluation. Figure
8 illustrates the control and data flow of the simulator considering just 2-stage pipelining. To synchronize the stages, the controller can insert a stall to hold the execution of each stage. By decoding an instruction, the data embedded in it along with the control signals are passed to the corresponding component related to that specific instruction in the data path. Then, the status will be returned to the controller to indicate the execution of the instruction is finished.
The simulator has been written in a modular way, which helps us to easily modify or replace the components shown in the tile architecture with new designs/circuits. Moreover, new attributes can be easily added to the components model and their impacts are captured in the kernel level (e.g., read/write variability for the crossbar). In this fully parameterized simulator, each components has its own characteristics like energy, latency, and precision written into the configuration file. Table
2 shows all the parameters that can be set in the file to be used for the early-stage design space exploration. Furthermore, the first of its kind feature of our simulator is the ability to calculate energy number according to the data provided by the application. Existing simulators estimate average energy number regardless of data. Our simulator takes into account the data stored in the array to estimate the energy consumption in the crossbar and its drivers. This is achieved by taking into account the data stored in the crossbar cell resistance level, the number of activated rows, and the equivalent resistance of the crossbar.
The power consumption of the crossbar and read drivers regarding read/compute operations is given in Equation (
1), where
\(R_{rc}\) is the resistance level of the memristor cell located in row “r” and column “c,”
\(P_{DIM_{read}}\) is the read drivers power, and
\(V(read)\) is the read voltages. In general,
\(R_{rc}\) and
\(V(read)\) are members of two sets contain possible resistance and voltage levels, respectively. Furthermore,
\(activation_r\) is a binary value that indicates whether row “r” is activated and contributes to the power of the crossbar or not. Considering read and compute operations, the summation is performed for the selected rows and all the columns. In addition, for simplicity, the resistance of access transistors, as well as bit-lines, are ignored. The power consumption of write operations is shown in Equation (
2), where
\(P_{DIM_{write}}\) is the write drivers power.
\(V(write)\) and
\(I(write)\) are the write voltage and programming current, respectively. In general,
\(V(write)\) is a member of a set including different write voltage levels. Finally,
\(activation_c\) determines whether the column “c” is activated and would contribute to the crossbar energy or not. The summation is performed over the selected columns in just one activated row. The energy consumption of read/computational as well write operations are shown in Equations (
3) and (
4), respectively, in which
\(T_{Xbar(write)}\) as well as
\(T_{Xbar(read)}\) are the latency of the crossbar for write and read/computational operations. The latency of the crossbar for read/computational operations also depends on the peripheral circuits used to capture or read the analog values generated by the crossbar (S&H). Using S&H unit to capture the result, its capacitance is charged with different gradient according to the equivalent resistance of the crossbar. Therefore, the result should be captured at the right time when there is a maximum voltage difference on the capacitance of S&H unit for different crossbar equivalent resistances, which helps to be distinguished by the ADC easily (implies that the crossbar latency also depends on ADC capability). It is worth mentioning that the equations are data-dependent and provide the worst-case energy numbers for the crossbar and its drivers:
Due to the execution model and flexibility of our instructions, the simulator is able to add the energy of ADCs, which are active during the program execution to the total energy of the tile. Besides the hardware implementation of our digital controller, which can provide an accurate number for this unit, a more advanced model that incorporates more attributes of the crossbar are our main focus for future work. Considering that limited bits can be stored on a single cell, the data for the crossbar programming has to be distributed over multiple cells, depending on datatype size. In addition, in the case of MMM, due to the limitation on the number of levels which can be supported by the drivers and ADCs, the data for the multiplier should be sent to the crossbar’s rows in several steps, depending on datatype size. Therefore, extra processing in “
Addition unit” (see Figure
3) required to fulfill analog computations for calculation over integer numbers. An efficient structure tailored for the CIM-tile was proposed in Reference [
75]. Thanks to our in-memory ISA and the flexibility provided by rescheduling them using our back-end compiler, the architecture and subsequently the simulator can perform computation with different integer datatype sizes at the same time.
5 Evaluation AND Discussion
The defined CIM tile architecture, ISA, compiler, and simulator allows for various design-space explorations. These explorations can give insight into the future CIM-tile fabrication and the influence of different parameters on the performance and energy of the tile considering different technologies. In this section, we will present several of these explorations that are currently possible with our tools.
5.1 Simulation Setup
Energy and performance model. The values used in our experiments regarding the (technology) parameters are summarized in Table
3. The needed values related to the digital periphery were obtained by using Cadence Genus targeting the standard cell 90 nm UMC library. The values related to the three targeted technologies (ReRAM, PCM, and STT-MRAM) were taken from References [
20,
21,
24,
38,
48,
70]. For all the experiments, we assume the size of the crossbar is 256 by 256 and its input precision is one bit. In addition, the cycles required to fill the tile registers are computed based on the crossbar size and we assumed the databuses to be 32 bits wide.
The power and latency values for the ADCs were taken from Reference [
54]. Using Equation (
5), where “b” is the ADC resolution, we can derive the energy and latency of the ADC in performing a single conversion for different resolutions. These values are utilized in our simulations. Finally, the power model for the read and write drivers (DIMs) are obtained from Reference [
52]:
Benchmark. As a benchmark, the linear-algebra kernel “GEMM” from the Polybench/C benchmark suite was chosen. In this kernel, first, the multiplicands are written into the crossbar (write operation) and then the actual multiplication (compute operation) is performed. The datatype size for both multiplier and multiplicand are considered 8 bits meaning that for each number 8 cells have to be employed (assuming each cell stores one bit). This benchmark was chosen as it intensively utilizes the memory array as well as its digital peripheries given that we want to perform DSE targeting different technologies for the memory array.
5.2 Simulation Results
In this section, we will present several design-space explorations that are currently possible using our simulator. The insights obtained from these analyses will lead designers to take better decisions for the actual implementation.
Performance versus number of ADCs. In Figure
9, we plotted the (normalized) execution time of running the GEMM benchmark targeting three different technologies for the crossbar array, namely, PCM, ReRAM, and STT-MRAM. The simulations were performed assuming a 1 GHz clock frequency for the digital periphery and an 8-bit ADC resolution.
We can clearly observe that the number of ADCs greatly impacts the execution time. By adding more ADCs, the total execution time can be reduced as the cycles needed to read out the data from the crossbar array can be reduced. Although STT-MRAM has faster write time, due to the less number of write operations to program the crossbar compared to the computational operations, the improvement on the execution time is negligible. An interesting observation is that the performance does not improve much when moving from 32 to 64 ADCs. This can be explained by the fact that at some point, the latency of the read stage is no longer dominant and further reducing the readout time has little impact on the total execution time. Finally, regardless of the number of ADCs, the energy consumption is almost constant (small fluctuation due to the data randomness), since the number of conversions is always fixed.
Performance versus frequency of digital periphery. The latency of the operations in the (analog) crossbar array is a constant number. Therefore, it is interesting to determine how fast the digital periphery should be clocked to “match” this latency to make the pipeline more balanced. In the following investigation, we have fixed the number of ADCs to 16 and ran the GEMM benchmark at different frequencies.
Figures
10 clearly shows that performance improvements can be gained by raising the frequency of the digital periphery. However, increasing the clock frequency beyond 1 GHz does not result in much better execution times as the analog circuits (relatively) are becoming the bottleneck. In addition, the performance improvement due to the pipelining will be reduced, since the stages are more unbalanced. This DSE allows a designer to make the different stages of the tile more balanced. A positive side-effect is that pipelining more balanced stages will usually lead to better performance improvements over a non-pipelined design.
Energy contribution per module. Figure
11 depicts the relative energy spent in the different modules when running the GEMM benchmark for 16 ADCs and using an 8-bit datatype. We can clearly observe that the largest energy consumer is still the crossbar and its drivers. Although the power consumption of ADC is high, due to the low latency of this component (around 1 ns [
54]), the energy consumption of ADC does not dominate. In the PCM case, the relative energy consumption of the ADCs and crossbar are close to each other (compared to other technologies). The reason for this is that the power consumption of PCM is relatively lower compared to the ReRAM technology due to the higher cell resistance (see Table
3). This in turn increases the relative energy consumption of the ADCs.
It is worth mentioning that the energy consumption of the crossbar depends on the input and programming data. As more devices are programmed to LRS and more rows are activated, clearly more energy is consumed during the computation. Since the simulator can actually execute kernels, the energy number obtained for the crossbar is data-dependent. Figure
12 depicts the energy consumption of the crossbar with different levels of sparsity. This figure illustrates the sparsity of logic value 1 for matrix-matrix multiplication. For this experiment, it is considered that the multiplier (input) and the multiplicand (programmed) matrix have the same sparsity. In addition, logic value 1 (0) as an input implies the corresponding row is activated (deactivated) and as programming data indicates the memristive device is programmed to LRS (HRS). The simulation is performed for ReRAM and PCM devices and an interesting observation is that although the ratio of HRS to LRS is lower in ReRAM devices, since PCM devices have higher LRS compared to ReRAM, the sparsity leads to less variation in the energy consumption of the crossbar made with PCM.
Relative latency contribution per stage. Figure
13 depicts the relative time that the GEMM application spends in each of the 4 stages plotted against the frequency of the digital periphery. Since several columns share an ADC, reading all the columns should be performed in multi-steps. In addition, after each ADC activation, digital processing is required in
“addition unit.” Therefore, we can clearly observe that with a low frequency, the read and addition stages are completely dominant in the total latency. Using an efficient structure and minimum-sized adder, the latency of the addition unit is no longer than the read stage. With low clock frequency, the latency of the analog circuits is hidden in one clock period. However, As the clock frequency is increased more, the latency of the analog components starts to rise (relatively). Consequently, we can observe that the latency of the read stage, which is composed of analog (latency of ADC) and digital (decoding latency) latency, as well as the execute stage impose more latency. This information can be used to determine the number of pipeline stages for the actual implementation.
Figure
14 depicts the relative time that the GEMM application spends in each of the four stages plotted against the number of utilized ADCs. It should be clear that increasing number of ADCs, the number cycles spend in the read out stage is greatly reduced. Consequently, we can observe that the relative contribution of the setup stage to the total latency grows accordingly. The contribution of the other stages to the total latency is almost negligible. With the advent of advanced ADC design to have more ADCs per tile, this figure brings an insight into its implications and helps for future design decisions.