In recent years, we are witnessing a trend toward in-memory computing for future generations of computers that differs from traditional von-Neumann architecture in which there is a clear distinction between computing and memory units. Considering that data movements between the central processing unit (CPU) and memory consume several orders of magnitude more energy compared to simple arithmetic operations in the CPU, in-memory computing will lead to huge energy savings as data no longer needs to be moved around between these units. In an initial step toward this goal, new non-volatile memory technologies, e.g., resistive RAM (ReRAM) and phase-change memory (PCM), are being explored. This has led to a large body of research that mainly focuses on the design of the memory array and its peripheral circuitry. In this article, we mainly focus on the tile architecture (comprising a memory array and peripheral circuitry) in which storage and compute operations are performed in the (analog) memory array and the results are produced in the (digital) periphery. Such an architecture is termed compute-in-memory-periphery (CIM-P). More precisely, we derive an abstract CIM-tile architecture and define its main building blocks. To bridge the gap between higher-level programming languages and the underlying (analog) circuit designs, an instruction-set architecture is defined that is intended to control and, in turn, sequence the operations within this CIM tile to perform higher-level more complex operations. Moreover, we define a procedure to pipeline the CIM-tile operations to further improve the performance. To simulate the tile and perform design space exploration considering different technologies and parameters, we introduce the fully parameterized first-of-its-kind CIM tile simulator and compiler. Furthermore, the compiler is technology-aware when scheduling the CIM-tile instructions. Finally, using the simulator, we perform several preliminary design space explorations regarding the three competing technologies, ReRAM, PCM, and STT-MRAM concerning CIM-tile parameters, e.g., the number of ADCs. Additionally, we investigate the effect of pipelining in relation to the clock speeds of the digital periphery assuming the three technologies. In the end, we demonstrate that our simulator is also capable of reporting energy consumption for each building block within the CIM tile after the execution of in-memory kernels considering the data-dependency on the energy consumption of the memory array. All the source codes are publicly available.

1 Introduction

Modern-day applications such as machine learning, image processing, and encryption are widely employed on real-time embedded systems as well as back-end data centers. Nowadays, since the amount of data has to be processed for these types of applications increased considerably (e.g., BERT has around 110 million parameters [15]), performance and power consumption of the systems that fulfill these processes has become crucial and draw increasingly more attention. Currently, traditional von-Neumann architectures have been stretched to attain adequate performance at reasonable energy levels, but are clearly showing limitations for further improvements. The main limitation is the conceptual separation of the processing unit and its memory, which makes the data movement between memory and processing unit the main performance and energy bottleneck [25, 36]. For instance, communication between off-chip DRAM and the processor takes up to 200 cycles [62] and it consumes power more than 2 orders of magnitude than a floating-point operation [1, 14, 17]. Consequently, new architectures need to be sought to achieve the performance levels required for these new applications by means of mitigating this costly communication.

To address this challenge, early researches [12, 47] suggested integrating more on-chip memory next to the processing unit, which imposes significant fabrication costs to the system. One of the promising solutions to tackle this challenge is performing computation within or near the memory rather than moving the data to the processing unit. Some valuable works focus on near-memory computing, especially after 3D stacking technology was introduced, in which computation was performed in the DRAM chip by adding extra logic next to the memory structure [2, 3, 10, 19, 34, 45]. However, recently, computation in-memory (CIM), as a concept, has gained a huge interest from the research community in combining memory storage and actual computing in the same memory array. This is achieved by exploiting special characteristics of emerging non-volatile memories called memristors such as resistive RAM (ReRAM) [26], phase change memory (PCM) [40], and spin-transfer torque magnetic RAM (STT-RAM) [30]. No matter which technology is used for fabrication, memristor technology has great scalability, high density, near-zero standby power, and non-volatility [26, 63, 74]. Accordingly, memristor technology with the aforementioned characteristics opens up new horizons toward new ways of computing and computer architectures.

Until now, the main focus of researchers was to enhance the characteristics of memristor devices such as latency and endurance [8, 29, 32, 51, 74]. Researchers have already proposed different innovative circuit designs based on memristor devices to exploit their capabilities of co-locating computation and storage together [7, 31, 37, 42, 43, 44, 46, 61, 76]. Moreover, within a single memory array as well as at inter-array level huge parallelism can be flexibly achieved as each memory tile becomes a powerful computation core. It was demonstrated that due to these two main features of memristor-based designs, significant energy and performance improvement can be gained [28]. It is widely accepted that the dot-product (and, in turn, the matrix-matrix multiplication) operation is the most suited for the memristor-based designs. Consequently, convolutional and deep neural network are the potential applications that have been widely studied by the researchers to exploit memristive crossbar structures [9, 57, 64, 66, 67, 73]. However, researchers have also proposed other types of operations, e.g., Boolean operations [56, 60, 72] or arithmetic operations like additions [58]. More information on available researches regarding device characteristics and potential applications for in-memory computing can be found in References [49, 53].

Having said this, there is no work in the research community that allows for easy comparison of these supported operations at the application-kernel level between different technologies (e.g., ReRAM, PCM, and STT-RAM) nor is it possible to emulate complex operations, e.g., matrix-matrix multiply, when the underlying technology does not allow for direct implementation. Furthermore, the interactions between the analog memory array and its supporting digital periphery is largely overlooked.

In this article, we generalized the existing CIM-tile architectures (a CIM-tile comprises the analog array itself and all necessary analog and digital periphery) to introduced a generic execution model that allows operations in the digital periphery and the analog array to operate in parallel through pipelining. In this model, a new instruction set architecture (ISA) is introduced with the twin objectives of orchestrating digital and analog components of memristor tiles and bridging the gap between high-level programming languages and the CIM architecture. By rescheduling the proposed fine-grained instructions, maximum flexibility on executing a program targeting the CIM-tile can be attained. Our compiler written in C++ performs instruction lowering from high-level kernels, which are supposed to execute on the CIM tile down to in-memory instructions at compile time. The compiler is aware of the architecture configuration, the technology constraints, and the datatype size requested by the application. Subsequently, based on this information, the required in-memory instructions are generated in a proper sequence.

The execution model is aimed to work with different memristor technologies as well as different configurations/circuits for peripheral devices. To accomplish design space exploration targeting performance and energy, we designed a modular simulator written in SystemC in which the designer can track all the data and control signals. Timing and power numbers of the modules within the CIM architecture are obtained from low-level models. These numbers configure specific parameters of the simulator.

In conclusion, the contributions of our work are summarized below:

•

we generalize existing CIM tile architectures and execution models for in-memory computing, which enables an application to leverage a variety of in-memory operations. Additionally, we proposed a pipelining approach for an architecture that has digital and analog modules, which can have widely varying (unbalanced) latencies. Finally, the architecture can be easily extended with additional peripheral modules.

•

we define a new and generic in-memory ISA for the aforementioned design to obtain maximum flexibility while dealing with different constraints, configurations, and requirements. Furthermore, we developed a compiler that is able to translate higher-level operations (e.g., defined in higher-level programming languages) into a sequence of instructions in our ISA. The ISA can also be extended when additional peripheral modules are added.

•

we develop a fully parameterized simulator that is capable of executing the newly introduced instructions and simulate our CIM tile architecture. Design parameters (e.g., height/width of memory array, type and number of peripheral modules like the Analog-to-Digital Converter (ADC)) as well as technology-dependent parameters (e.g., ReRAM/PCM/STT-RAM, latencies, number of bits per cell, power utilization per module, clock frequency of the digital periphery) are supported.

•

we perform an initial set of design-space explorations based on well-accepted design parameters and current-day memristor technologies and present the results.

2 Background AND Related Work

This section first provides a preliminary background on memristor devices and the different ways that they employed to perform computation. Subsequently, the existing architectures exploiting in-memory computing using memristor devices as well as available simulators will be discussed.

2.1 Memristor Devices

Memristor devices are categorized as non-volatile memory in which data can be represented as resistance levels. Figure 1(a) depicts the current-voltage behavior of a bipolar memristor device. For a unipolar device, SET and RESET would occur at the same voltage polarity. The resistance states are altered by the application of suitable voltage or current pulses during a write or programming operation. Usually the term SET is used to denote the transition from a low resistive state (LRS) to a high resistive state (HRS) while the term RESET means the transition from a HRS to a LRS. Since ReRAM and STT-MRAM are both bipolar devices, one voltage polarity SETs the device while the other RESETs it. The polarity depends on the definition of the reference potential. PCM is a unipolar device, which means that SET and RESET occur for the same voltage polarity but at different amplitudes [69]. By utilizing the LRS/HRS resistance states we can store the logic levels “1” and “0,” respectively [22, 29]. In the case that the device can be programmed to multiple resistance levels, obviously it can hold more than one bit of data [35, 74]. Reading resistance state of the device requires a small voltage applied to the device, which prevents data disturbance, and then the current through or the voltage across the device can be sensed.

Fig. 1.

2.2 Memristor-based Computation Circuits

Several papers proposed circuit designs that employ memristive devices to be used as either computation or storage unit. In addition to their appealing characteristics in terms of area and energy, some of these designs have demonstrated the ability to enable both computing and storing data at the same location; i.e., data is not shuttled back and forth between the memory and the processor [27]. Most of these novel CIM circuit designs can be classified based on the location of the results generated by the memory core [39], which consists of a memory array and peripheral circuits. Figure 2 illustrates this classification. According to the classification, if the result of computation is produced outside the memory core, then it is referred to as Computation-Outside-Memory (COM). In this case, the computation can take place either in extra logic circuit inside the memory System-in-Packages (SiP) or like a traditional architecture in computational cores (CPU or GPU), which are referred to COM-Near (COM-N) or COM-Far (COM-F), respectively. The two main classes of CIM circuit designs using memristive devices where the results produced inside the memory core are described in the following:

Fig. 2.

CIM-Array (CIM-A): The result of the computation is produced within the array and represented by resistances [7, 37, 42, 55, 59, 61]. CIM-A circuit designs have generally the following advantages in common:

•

They achieve the maximum bandwidth of “transferring” data between the computation and storage units, as both computation and storage happen physically within the same memory array.

•

They perform computation independently from the sense amplifiers. This can provide high parallelism.

•

They can potentially enable logic cascading of universal functions.

However, they share some limitations as well such as:

•

Frequent write operations can lead to endurance and energy issues.

•

Performance overhead due to the high latency of device programming and logic cascading to achieve complex functions.

•

Significant design efforts both for the memory array (e.g., the memory array must support wider bit-lines to handle the excess of currents when multiple values are written simultaneously) and its controller (e.g., the controller requires complex state machines to apply a long sequence of different control voltages to the memory array).

CIM-Periphery (CIM-P): The result of the computation is produced within the periphery and represented by voltage levels [33, 54, 72]. CIM-P circuit designs generally have the following advantages in common:

•

They do not exacerbate the endurance of the memory array as the memory states do not change during and after the computation.

•

They require relatively less redesign efforts in the memory array; i.e., the currents during computation are lower or close to the write current.

•

They allow high to maximum bandwidth depending on the complexity and area of the periphery.

However, they share some limitations such as:

•

They may have to share sense amplifiers or analog-to-digital converters per multiple bit-lines, which can degrade the performance.

•

They cannot cascade functions without read/write operations.

The pros/cons listed for the two classes point us to CIM-P as a prime candidate to be exploited in our CIM-tile organization. In this decision, we considered parameters like endurance, performance, and the complexity of the controller administrates the memory array. Among different types of operations, vector logical operations, as well as vector-matrix multiplication, have a great potential to be executed on the crossbar structure. Figure 1(b) depicts the structure of 1T1R memristor crossbar for the CIM-P circuit class. For example, to perform logical operations using this structure, first the desired rows have to be activated and the read voltage has to be dropped across each cell. Subsequently, according to the scouting logic [72], just by changing the reference input of the sense amplifiers (SAs), different Boolean operations (AND, OR, XOR) can be performed. For vector-matrix multiplication, the matrix has to be stored inside the crossbar and then the vector encoded to analog voltages is applied to source lines of the corresponding rows. In the end, the analog result is received for all the columns in a single shot.

2.3 In-memory Architectures and Simulators

Some other studies are proposing in-memory accelerator designs using memristor devices that mostly focused on accelerating Neural Networks (NN). PRIME [13] designed a new architecture based on ReRAM devices for NN applications in which some portion of crossbars can be configured for computation and the rest for storage. By reducing the communication between memory and processing units, their estimation demonstrates significant improvement in energy and performance. Similar to PRIME, ISAAC [54] proposed a full-fledged accelerator with memristor crossbars for CNNs, which exploits pipelining in the level of application. Recently, new in-memory instructions were proposed [23]. In this promising work, in-memory instructions are complete operations that have to be performed on crossbars based on the Single Instruction, Multiple Data (SIMD) execution model to enjoy massive data parallelism. The basic operations supported in the crossbar are multiplication, addition, and subtraction. However, to perform logical operations, it was just assumed that Look Up Tables (LUTs) can be placed beside the crossbars. Similar to ISAAC, PUMA [4] proposed an accelerator tailored for vector-matrix-multiplication. The designed architecture has three piplined stages (fetch, decode, and execute) and the new in-memory instructions defined there are mainly responsible for communicating data between memory units or performing scalar operations in digital peripheries. Finally, ReVAMP [6] proposed a general-purpose Very Long Instruction Word (VLIW) architecture platform based on ReRAM devices. In this work, two high-level instructions are defined to indicate read and compute operations and control the architecture based on them. Although some abstract architectures are proposed in the literature, there is still a lack of studies on the detailed execution model of in-memory architectures. In this work, we propose new in-memory instructions to be able to orchestrate the analog and digital circuits inside the tile. The proposed instructions are defined to be as generic as possible, which enables the programmers to build other operations on top of them.

Many factors have to be considered when an in-memory memristor-based architecture has to be designed, since they can have a non-negligible effect on the performance, energy, area, or even accuracy of the system. Therefore, the necessity of optimization and design space exploration at reasonable simulation time derive the researchers to develop high-level simulators for this new type of architecture. There are a few simulators among which NVSim [16] and NVMain [50] are the first memory-oriented simulators designed for non-volatile memories. These tools, which are inspired by CACTI [68], can estimate the access time, energy, and area of non-volatile memories. Similar to NVSim, MNSIM [71] proposed a behavioral simulation platform for neuromorphic accelerators that estimates design parameters using analytical equations. Recently, another system-level simulator [41] was proposed, in which by using probability functions to capture the accuracy of ReRAM cells, the behavior of an application in terms of accuracy can be evaluated. Contrary to existing works that estimate execution times, our execution model allows us to build a tile-level cycle-accurate simulator by actually executing in-memory instructions—after translating high-level kernels to our ISA. Furthermore, our simulator allows the user to track all the control signals and the content of crossbar/registers and produces more accurate performance and energy consumption results. Finally, due to the modular programming of the simulator, the user can easily investigate different memristor technologies, circuit designs, and more advanced crossbar modeling (e.g., considering read/write variability).

3 CIM-tile Architecture

A CIM tile can be either employed as a standalone accelerator or integrated to the conventional computer architecture. Accelerators are designed to lessen the execution time of certain functionalities, which are frequently used by the intended applications. With respect to their degree of flexibility, they can be designed and implemented for dedicated functionalities or they can be flexible enough to support wider range of tasks [5, 11, 65]. However, in the context of in-memory computing, the accelerators are storage units enabled to perform certain functionalities. This is aligned with the main goal of this context, which is data movement reduction. Figure 2 depicts one potential way in which a CIM tile can be seen as an off-/on-chip accelerator from the CPU. In this architecture, CIM-tiles are not integrated to the memory hierarchy of the system and allocated into different address space. In this section, we focus on the architecture of our CIM tile and how it is organized using an in-memory instructions set architecture. Afterwards, we present our compiler, which is capable of translating higher-level programs, e.g., matrix-matrix multiplication, into a sequence of the newly introduced ISA. Finally, we propose the execution model that allows pipelining at the tile-level between the analog and digital peripheries.

3.1 Overview

As mentioned earlier, we focus on the CIM-P 1T1R structure in which the (computational) results of the (memory) array operations are captured in the (digital) periphery. Figure 3 depicts the architecture of the CIM tile, which includes the required components and signals that can control digital or analog data. The operations that can be executed on the crossbar are divided into two categories: (1) write and (2) read and computational operations. The computational operations include addition, multiplication, and logical operations. In the following, we will describe our tile architecture and its main modules considering these two categories (the discussion of the control signals are left to the subsequent section):

Fig. 3.

(1)

Write operation. To write data to the memristor crossbar, we have to specify in which row and column the data has to be written. Therefore, three registers are employed to capture this information. The data itself has to be written to the Write Data (WD) register whose length depends on the width of the crossbar as well as the number of levels supported by the memristor cells. Considering endurance issues and potential energy savings, it is not always necessary to write data to all the array columns. For this purpose, the Write Data Select (WDS) register is used to select which columns should be activated. This is especially relevant when considering implementing a write-verify operation. Finally, the Row Select (RS) register is employed to activate the row in which data has to be written. A more detailed description is presented in the following.

Voltages that have to be applied to the crossbar depend on the crossbar technology and they are usually different than the voltage used for digital part of the system. Therefore, we need a device to convert the information from digital to analog domain called Digital Input Modular (DIM). Selecting a row requires that two different voltage levels have to be provided for both source and gate line of the target row as depicted in Figure 1(b), which means two DIMs (Source/Gate DIM) are required to drive both of them. Therefore, considering one DIM for the crossbar columns (Write DIM), we need three DIMs in this architecture in total. Based on the operation and the data stored in the RS and WD registers, DIMs can apply proper voltage levels to the crossbar. In addition, the data in the WDS register is used by the Mask unit to prevent extra switches for the cells that the data has not to be written into them.

(2)

Read and computational operations. In this category, the operations generate an output and it has to be read by the periphery circuits in the architecture. The generated output can be the outcome of either a normal memory read or computational operation. In contrast to the write operation, there is no need to fill the WD and WDS registers. The RS register again is used for row activation. However, among computation operations, matrix-matrix multiplication (MMM) is a little different than others in the sense that the RS not only has to indicate the active rows but also can be considered as the data for one of the matrices. When the operation performed inside the crossbar, the generated analog output has to be captured by the Sample & Hold(S&H) unit. This allows for a clear separation of the execution within the array and the read-out circuitry, which can be used for pipelining of the system (explained in Section 3.4).

After the S&H module has captured the result from the array, the ADCs (or the sense amplifiers) can be used to convert the analog results into the digital domain. Since ADCs consume much energy and area, usually it is not possible to allocate one ADC per column. Therefore, we need analog multiplexers to share several columns with one ADC. Besides, certain high-level operations, e.g., the integer MMM, require additional processing steps and these are performed in the Addition units that are specific to the integer MMM. Our scheme for this unit is proposed in Reference [75] where the design utilizes minimum size adder to impose as less latency/power as possible to the system. The design considers technology/circuit/application-driven restrictions such as the maximum number of active crossbar rows, number of ADCs, and datatype size. When other (high-level) operations are needed in the future, this unit can be altered or substituted with others.

3.2 Instruction-set Architecture

As discussed in Section 3.1, a complex sequence of steps need to be performed in the CIM tile that can be different depending on the (higher-level) CIM tile operation, e.g., read/write, dot-matrix multiplication, Boolean operations, and integer matrix-matrix multiplication. Similar to the concept of microcode, we introduce an instruction-set architecture for our CIM tile that would allow for different schedules for different CIM tile operations. The “Controller” in Figure 3 is responsible for translating these instructions to the actual control signals (highlighted in green). The list of instructions is presented in Table 1. In the following, we discuss each instruction:

Table 1.

Instruction	Semantic	Operand
Row select	RS	data to fill RS register
Write data	WD	data to fill WD register
Write data select	WDS	data to fill WDS register
Function select	FS	data to configure the drivers/ADCs
Do array	DoA	—
Do sample	DoS	—
Columns select	CS	data for the select of MUX
Do read	DoR	—

Table 1. List of Instructions

•

Row Select (RS): the RS instruction is responsible for setting up the RS register (see Figure 3) that is subsequently used to correctly control the source and gate drivers to provide the right voltage levels for the crossbar. At this moment, the number of bits to set the RS register is as large as the height of the crossbar (means the input precision is 1 bit). As the size of crossbar increased, this is impractical for hardware implementations and left as a future optimization. However, for simulation purposes, the impact is negligible and actually allows for investigations to utilization patterns of the RS register.

•

Write Data (WD): the WD instruction is responsible for setting up the WD register that is used to write data into the crossbar. Similar to the RS instruction and with the same reasoning, the size of the instruction is as large as the size of the WD register. We envision that this instruction to be replaced when another mechanism is chosen to load data into the crossbar that is more hardware friendly. For simulation purposes, it is now the only way to load data into the crossbar.

•

Write Data Select (WDS): the WDS instruction sets up the WDS register to control which bits of the WD register need to be written. This allows for a flexible manner to write data into the crossbar in the light of potential endurance issues associated with current-day memristor technologies. It is especially useful when a write-verify operation needs to be performed when writing data into the crossbar. Correctly, written bits can be masked out, which helps to improve the endurance of the crossbar.

•

Function Select (FS): the FS instruction is needed for several reasons. Functionally, the DIMs need to set up differently when writing data into the crossbar or when reading out data or performing compute in the crossbar. Furthermore, it is envisioned that future periphery needs additional control signals. These are now conceptually captured in the FS instruction. For example, the control signals needed to control the “Addition units” are more than one, i.e., more than one instruction is needed to control these units. The details are intentionally left out.

•

Do Array (DoA): the DoA instruction is used to actually initiate the DIMs after they have been set up using the RS, WD, and FS instructions. This instruction will have a variable latency depending on the operation that is performed in the crossbar. These delays are specified as parameters in the simulator. More importantly, this instruction allows for a clear (conceptual) separation between the setup stage and the execute stage (see Figure 3) that would allow for pipelining (discussed in Section 3.4) within the CIM tile.

•

Do Sample (DoS): the DoS instruction is used to signal the S&H module to start copying the result from crossbar into its own internal storage. It must be noted that this module still operates in the analog domain. The introduction of the the DoS instruction allows for a conceptual separation of the execute stage and read stage. After the values are copied into the S&H module, the crossbar can basically be issued the next DoA instruction.

•

Column Select (CS): the CS instruction is used to set up the CS register that controls the multiplexers (in the MUX module) in the read stage. In case that each column of the crossbar can be associated with its own ADC/SA, there is no need to have the CS instruction. In all other cases, the CS register flexibly controls which column (in the S&H module) is connected to a ADC/SA. The length of the CS instruction is related to the number of columns of the crossbar. For the same reasons as with the RS and WDS (and WD) instructions, it is purposely defined as it is now and implemented as such in the simulator to allow for further investigation in the future. It is expected to be optimized or replaced when a hardware implementation is considered.

•

Do Read (DoR): the DoR instruction initializes the modules “ADC/SA” to start converting the output of the S&H module (via the MUX) into a digital representation. It is expected that multiple iterations of the CS and DoR instructions need to be issued in order completely read out all the columns of the crossbar. However, depending on the complexity of the module, the latency can vary and thus that a “done”-signal is needed (see Figure 3) to signal the end of the readout before the next DoR instruction can be issued.

It is important to note that the crossbar, the S&H modules, and the ADC modules (related to the DoA, DoS, DoR instructions, respectively) need to be able to signal to the controller that their operation is finished. It is only after this signal, that subsequently instructions can be issued by the controller. If this turns out to be impossible in a real hardware implementation, then we envision the need to setup counters that are initialized to the specific operations to achieve the same functionality. Both approaches are already supported in our simulator.

3.3 Compiler

The compiler translates high-level operations intended for the CIM tile into a sequence of instructions to be executed within the CIM tile. The high-level operations (e.g., MMM) are provided by the front-end compiler, which is responsible for searching for the operations within the application program that can be performed using the memristor crossbar (see Figure 4). The front-end compiler receives the application in Tensor Comprehensions representation. This is then converted to a polyhedral representation to identify computational patterns suitable for acceleration by employing Loop Tactics. The front-end compiler is written by our partners in MNEMOSENE project and more information can be found in Reference [18]. Based on the requirements or constraints that come from either the tile architecture or technology side, our back-end compiler translates high-level operations to in-memory instructions. As depicted in Figure 4, this information is written to the configuration file and passed to the compiler. For example, there might be a constraint on the number of rows that can be activated at once. This constraint can come either from the precision of ADCs or even technology capability. Therefore, if an operation wants to activate more rows, the compiler splits it into several steps and takes care of other changes that might be needed. The flexibility brought by our in-memory instructions helps to overcome constraints, requirements, and sparse patterns. Figure 5 illustrates an example for VMM operation where four ADCs have to read columns in a special pattern (each 8 bit-lines share one ADC). This example demonstrates how the instructions deal with these kinds of patterns. It is important to note that the sequence of instructions generated by the compiler changes whenever the tile configuration changes. Therefore, by putting this complexity into the compiler, we try to keep the tile controller as simple as possible.

Fig. 4.

Fig. 5.

3.4 Pipelining

The operations in the digital periphery and the analog array can be divided into the following stages: (indicated by different colors in Figure 6)

Fig. 6.

(1)

Set up stage (digital): all the control registers (and write data register) are initialized

(2)

Execution stage (analog): perform the actual operation in the analog array

(3)

Read out stage (analog): convert the analog results into digital values

(4)

Addition stage (digital): perform the necessary operations for the integer matrix-matrix multiplication

These stages sequentially follow each other while performing higher level operations translated to a sequence of instructions in our ISA. It should be clear that the pipelining described here is different from the traditional instruction pipelining. In the latter, the latency of each stage should be matched with each other to have a balanced pipeline. In the CIM tile, the latency of the operation performed in the analog array is expected to be much longer than the latency of a single clock cycle in the digital periphery. Therefore, it is important that the right signaling is performed between the stages to enable pipelining. The introduced execution model to pipeline the operations within the CIM tile, will allow for trade-off investigations between different NVM technologies and the (speed of the) digital periphery. Considering the aforementioned stages, the designer should evaluate the latency of each stage (which depends on the configuration of the tile, memristor technology, etc.) to realize the contribution of each stage into the total latency of the tile and merge some of them, in case there is a stage that has far less latency than others. This analog/digital behavior of the CIM-tile restricts the choices and their effectiveness regarding the pipelining stages. It is worth mentioning that, the two analog stages (Execution and Read out stages) cannot be split into more stages. This is the same for the Set up Stage as well, since the registers initialized here cannot be changed before activating the crossbar to ensure the correct functionality of the system.

In contrast to traditional processors where the execution of instructions is split into different stages to enable pipelining, our tile architecture associates different instructions to each tile stage. In the first stage, registers should be filled with new data and the drivers have to be configured. In the second stage, to activate the crossbar using DoA instruction, the operation latency for the previous activation must elapse. The latency of the crossbar, which depends on its technology as well as the operation supposed to be executed, can be captured by either a counter or done signal generated by the circuit itself. Based on the operation, which can be write and read/computational, done signals issued by the crossbar and S&H unit should be used to synchronize the first two stages, respectively. In the third stage, the latency of the Read stage depends not only on the latency of ADCs, but on the number of columns that have to be read as well. Accordingly, groups of columns are read by ADCs sequentially and this stage would be available for the next S&H activation after the last columns already translated to the digital domain. Figure 7 depicts the instruction memory where it is divided into four sections each working on one stage of the tile in parallel.

Fig. 7.

4 CIM Simulator

The proposed CIM architecture is generalized making it capable of targeting different technologies with different configurations of the peripheral circuit. The simulator, written in SystemC, models the architecture presented in Section 3.1 and generates performance and energy numbers by executing applications. The simulator takes as input the program generated by the compiler (presented in Section 3.3), which is currently stored as simple (human-readable) text. Besides the program, to simplify design space exploration, the configuration of architecture has to be sent to the simulator via a configuration file in which the user is able to specify many parameters. The simulator produces as output the following: (1) energy and performance numbers, (2) content of the crossbar (over time), (3) waveforms of all control signals, and (4) the computational results. All outputs are written into text files to be used for further evaluation. Figure 8 illustrates the control and data flow of the simulator considering just 2-stage pipelining. To synchronize the stages, the controller can insert a stall to hold the execution of each stage. By decoding an instruction, the data embedded in it along with the control signals are passed to the corresponding component related to that specific instruction in the data path. Then, the status will be returned to the controller to indicate the execution of the instruction is finished.

Fig. 8.

The simulator has been written in a modular way, which helps us to easily modify or replace the components shown in the tile architecture with new designs/circuits. Moreover, new attributes can be easily added to the components model and their impacts are captured in the kernel level (e.g., read/write variability for the crossbar). In this fully parameterized simulator, each components has its own characteristics like energy, latency, and precision written into the configuration file. Table 2 shows all the parameters that can be set in the file to be used for the early-stage design space exploration. Furthermore, the first of its kind feature of our simulator is the ability to calculate energy number according to the data provided by the application. Existing simulators estimate average energy number regardless of data. Our simulator takes into account the data stored in the array to estimate the energy consumption in the crossbar and its drivers. This is achieved by taking into account the data stored in the crossbar cell resistance level, the number of activated rows, and the equivalent resistance of the crossbar.

Table 2.

crossbar	drivers/analog peripheries	digital peripheries
- number of rows/columns - cell levels	- number of ADCs - precision of ADCs	- clock frequency - datatype size
- cell resistances - cell read/write voltages	- ADC power - read/write drivers power	- energy per adder in addition unit
- write latency - read latency	- ADC latency per conversion - SH latency	- RS/WD/WDS/CS filling cycle - instruction decoding cycle - latency per adder

Table 2. List of Parameters Used in the Configuration File

The power consumption of the crossbar and read drivers regarding read/compute operations is given in Equation (1), where \(R_{rc}\) is the resistance level of the memristor cell located in row “r” and column “c,” \(P_{DIM_{read}}\) is the read drivers power, and \(V(read)\) is the read voltages. In general, \(R_{rc}\) and \(V(read)\) are members of two sets contain possible resistance and voltage levels, respectively. Furthermore, \(activation_r\) is a binary value that indicates whether row “r” is activated and contributes to the power of the crossbar or not. Considering read and compute operations, the summation is performed for the selected rows and all the columns. In addition, for simplicity, the resistance of access transistors, as well as bit-lines, are ignored. The power consumption of write operations is shown in Equation (2), where \(P_{DIM_{write}}\) is the write drivers power. \(V(write)\) and \(I(write)\) are the write voltage and programming current, respectively. In general, \(V(write)\) is a member of a set including different write voltage levels. Finally, \(activation_c\) determines whether the column “c” is activated and would contribute to the crossbar energy or not. The summation is performed over the selected columns in just one activated row. The energy consumption of read/computational as well write operations are shown in Equations (3) and (4), respectively, in which \(T_{Xbar(write)}\) as well as \(T_{Xbar(read)}\) are the latency of the crossbar for write and read/computational operations. The latency of the crossbar for read/computational operations also depends on the peripheral circuits used to capture or read the analog values generated by the crossbar (S&H). Using S&H unit to capture the result, its capacitance is charged with different gradient according to the equivalent resistance of the crossbar. Therefore, the result should be captured at the right time when there is a maximum voltage difference on the capacitance of S&H unit for different crossbar equivalent resistances, which helps to be distinguished by the ADC easily (implies that the crossbar latency also depends on ADC capability). It is worth mentioning that the equations are data-dependent and provide the worst-case energy numbers for the crossbar and its drivers:

\begin{equation} \begin{split} & P_{(read,compute)}= \sum _{r=1}^{\#rows}\left[\left(\sum _{c=1}^{\#columns} \frac{V^2(read)}{R_{rc}}\right)+P_{DIM_{read}}\right]*activation_{r} \\ & \quad \, R_{rc} \in \lbrace L1,L2,\ldots , Ln\rbrace \, \, \, activation_r \in \lbrace 0,1\rbrace \\ & \quad \, V(read) \in \lbrace V(r1),V(r2),\ldots , V(rn)\rbrace , \end{split} \end{equation}

(1)

\begin{equation} \begin{split} & P_{(write)_{r}}= \sum _{c=1}^{\#columns} (V(write)*I(write)*activation_c+P_{DIM_{write}}) \\ & \quad \, V(write) \in \lbrace V(w1),V(w2),\ldots , V(wn)\rbrace , \end{split} \end{equation}

(2)

\begin{equation} \begin{split} E_{(read,compute)} = P_{(read,compute)}*T_{Xbar(read,compute)}, \end{split} \end{equation}

(3)

\begin{equation} \begin{split} E_{(write)} = P_{(write)}*T_{Xbar(write)}. \end{split} \end{equation}

(4)

Due to the execution model and flexibility of our instructions, the simulator is able to add the energy of ADCs, which are active during the program execution to the total energy of the tile. Besides the hardware implementation of our digital controller, which can provide an accurate number for this unit, a more advanced model that incorporates more attributes of the crossbar are our main focus for future work. Considering that limited bits can be stored on a single cell, the data for the crossbar programming has to be distributed over multiple cells, depending on datatype size. In addition, in the case of MMM, due to the limitation on the number of levels which can be supported by the drivers and ADCs, the data for the multiplier should be sent to the crossbar’s rows in several steps, depending on datatype size. Therefore, extra processing in “Addition unit” (see Figure 3) required to fulfill analog computations for calculation over integer numbers. An efficient structure tailored for the CIM-tile was proposed in Reference [75]. Thanks to our in-memory ISA and the flexibility provided by rescheduling them using our back-end compiler, the architecture and subsequently the simulator can perform computation with different integer datatype sizes at the same time.

5 Evaluation AND Discussion

The defined CIM tile architecture, ISA, compiler, and simulator allows for various design-space explorations. These explorations can give insight into the future CIM-tile fabrication and the influence of different parameters on the performance and energy of the tile considering different technologies. In this section, we will present several of these explorations that are currently possible with our tools.

5.1 Simulation Setup

Energy and performance model. The values used in our experiments regarding the (technology) parameters are summarized in Table 3. The needed values related to the digital periphery were obtained by using Cadence Genus targeting the standard cell 90 nm UMC library. The values related to the three targeted technologies (ReRAM, PCM, and STT-MRAM) were taken from References [20, 21, 24, 38, 48, 70]. For all the experiments, we assume the size of the crossbar is 256 by 256 and its input precision is one bit. In addition, the cycles required to fill the tile registers are computed based on the crossbar size and we assumed the databuses to be 32 bits wide.

Table 3.

The power and latency values for the ADCs were taken from Reference [54]. Using Equation (5), where “b” is the ADC resolution, we can derive the energy and latency of the ADC in performing a single conversion for different resolutions. These values are utilized in our simulations. Finally, the power model for the read and write drivers (DIMs) are obtained from Reference [52]:

\begin{equation} \begin{split} power_{ADC} &= \frac{energy_{ADC}}{latency_{ADC}} = \frac{[(64*34fJ) (2^{b-8})]}{([(1.2GS/sec) * (2^{8-b})])^{-1}}. \\ \end{split} \end{equation}

(5)

Benchmark. As a benchmark, the linear-algebra kernel “GEMM” from the Polybench/C benchmark suite was chosen. In this kernel, first, the multiplicands are written into the crossbar (write operation) and then the actual multiplication (compute operation) is performed. The datatype size for both multiplier and multiplicand are considered 8 bits meaning that for each number 8 cells have to be employed (assuming each cell stores one bit). This benchmark was chosen as it intensively utilizes the memory array as well as its digital peripheries given that we want to perform DSE targeting different technologies for the memory array.

5.2 Simulation Results

In this section, we will present several design-space explorations that are currently possible using our simulator. The insights obtained from these analyses will lead designers to take better decisions for the actual implementation.

Performance versus number of ADCs. In Figure 9, we plotted the (normalized) execution time of running the GEMM benchmark targeting three different technologies for the crossbar array, namely, PCM, ReRAM, and STT-MRAM. The simulations were performed assuming a 1 GHz clock frequency for the digital periphery and an 8-bit ADC resolution.

Fig. 9.

We can clearly observe that the number of ADCs greatly impacts the execution time. By adding more ADCs, the total execution time can be reduced as the cycles needed to read out the data from the crossbar array can be reduced. Although STT-MRAM has faster write time, due to the less number of write operations to program the crossbar compared to the computational operations, the improvement on the execution time is negligible. An interesting observation is that the performance does not improve much when moving from 32 to 64 ADCs. This can be explained by the fact that at some point, the latency of the read stage is no longer dominant and further reducing the readout time has little impact on the total execution time. Finally, regardless of the number of ADCs, the energy consumption is almost constant (small fluctuation due to the data randomness), since the number of conversions is always fixed.

Performance versus frequency of digital periphery. The latency of the operations in the (analog) crossbar array is a constant number. Therefore, it is interesting to determine how fast the digital periphery should be clocked to “match” this latency to make the pipeline more balanced. In the following investigation, we have fixed the number of ADCs to 16 and ran the GEMM benchmark at different frequencies.

Figures 10 clearly shows that performance improvements can be gained by raising the frequency of the digital periphery. However, increasing the clock frequency beyond 1 GHz does not result in much better execution times as the analog circuits (relatively) are becoming the bottleneck. In addition, the performance improvement due to the pipelining will be reduced, since the stages are more unbalanced. This DSE allows a designer to make the different stages of the tile more balanced. A positive side-effect is that pipelining more balanced stages will usually lead to better performance improvements over a non-pipelined design. Energy contribution per module. Figure 11 depicts the relative energy spent in the different modules when running the GEMM benchmark for 16 ADCs and using an 8-bit datatype. We can clearly observe that the largest energy consumer is still the crossbar and its drivers. Although the power consumption of ADC is high, due to the low latency of this component (around 1 ns [54]), the energy consumption of ADC does not dominate. In the PCM case, the relative energy consumption of the ADCs and crossbar are close to each other (compared to other technologies). The reason for this is that the power consumption of PCM is relatively lower compared to the ReRAM technology due to the higher cell resistance (see Table 3). This in turn increases the relative energy consumption of the ADCs.

Fig. 10.

Fig. 11.

It is worth mentioning that the energy consumption of the crossbar depends on the input and programming data. As more devices are programmed to LRS and more rows are activated, clearly more energy is consumed during the computation. Since the simulator can actually execute kernels, the energy number obtained for the crossbar is data-dependent. Figure 12 depicts the energy consumption of the crossbar with different levels of sparsity. This figure illustrates the sparsity of logic value 1 for matrix-matrix multiplication. For this experiment, it is considered that the multiplier (input) and the multiplicand (programmed) matrix have the same sparsity. In addition, logic value 1 (0) as an input implies the corresponding row is activated (deactivated) and as programming data indicates the memristive device is programmed to LRS (HRS). The simulation is performed for ReRAM and PCM devices and an interesting observation is that although the ratio of HRS to LRS is lower in ReRAM devices, since PCM devices have higher LRS compared to ReRAM, the sparsity leads to less variation in the energy consumption of the crossbar made with PCM. Relative latency contribution per stage. Figure 13 depicts the relative time that the GEMM application spends in each of the 4 stages plotted against the frequency of the digital periphery. Since several columns share an ADC, reading all the columns should be performed in multi-steps. In addition, after each ADC activation, digital processing is required in “addition unit.” Therefore, we can clearly observe that with a low frequency, the read and addition stages are completely dominant in the total latency. Using an efficient structure and minimum-sized adder, the latency of the addition unit is no longer than the read stage. With low clock frequency, the latency of the analog circuits is hidden in one clock period. However, As the clock frequency is increased more, the latency of the analog components starts to rise (relatively). Consequently, we can observe that the latency of the read stage, which is composed of analog (latency of ADC) and digital (decoding latency) latency, as well as the execute stage impose more latency. This information can be used to determine the number of pipeline stages for the actual implementation.

Fig. 12.

Fig. 13.

Figure 14 depicts the relative time that the GEMM application spends in each of the four stages plotted against the number of utilized ADCs. It should be clear that increasing number of ADCs, the number cycles spend in the read out stage is greatly reduced. Consequently, we can observe that the relative contribution of the setup stage to the total latency grows accordingly. The contribution of the other stages to the total latency is almost negligible. With the advent of advanced ADC design to have more ADCs per tile, this figure brings an insight into its implications and helps for future design decisions.

Fig. 14.

6 Conclusions

In this article, we proposed a general architecture and execution model that captures how a memristor-based compute-in-memory tile would function when it produces its results in the periphery (instead of in the crossbar array). This architecture is defined to accommodate current and future NVM technologies used to implement the crossbar array. Example technologies are PCM, ReRAM, and STT-RAM. Our abstraction allowed for the definition of an ISA that describes/captures the functionalities within the CIM tile and allows for sequencing of these instructions to perform higher-level operations achievable with such a CIM tile, e.g., integer matrix-matrix multiplication. In addition, a compiler was developed that can translate such high-level operations into a sequence of instructions that can be executed by the CIM tile. Subsequently, we developed a parameterized cycle-accurate simulator for our CIM architecture that would allow for design space exploration. The parameters are either technology-dependent parameters or design parameters of our architecture. Finally, we ported the GEMM benchmark and demonstrated in this article how various design space explorations for this benchmark can be performed targeting different technologies and other design parameters. In future work, we will extend the simulator to comprise more CIM tiles. This will help us to explore the communication between tiles, implement complex applications, and evaluate the scalability of our simulator.

References

[1]

Micron.com. 2017. Calculating Memory Power for DDR4 SDRAM. Retrieved from https://www.micron.com/-/media/client/global/documents/products/technical-note/dram/tn4007_ddr4_power_calculation.pdf.

Abstract

1 Introduction

2 Background AND Related Work

2.1 Memristor Devices

2.2 Memristor-based Computation Circuits

2.3 In-memory Architectures and Simulators

3 CIM-tile Architecture

3.1 Overview

3.2 Instruction-set Architecture

3.3 Compiler

3.4 Pipelining

4 CIM Simulator

5 Evaluation AND Discussion

5.1 Simulation Setup

5.2 Simulation Results

6 Conclusions

References

Cited By

Index Terms

Recommendations

Memristor-Based (ReRAM) Data Memory Architecture in ASIP Design

A multilevel memristor-CMOS memory cell as a ReRAM

System-level impacts of persistent main memory using a search engine

Comments

Information

Published In

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations