1 Introduction
Today’s data-driven applications that use graph processing [
30,
53,
56,
79], machine learning [
15,
29], or privacy-preserving paradigms [
3,
19,
82] demand memory sizes on the order of hundreds of
\(GBs\) and bandwidths on the order of
\(TB\)/
\(s\). The widely used main memory technology, DRAM, is facing critical technology scaling challenges and fails to meet the increasing bandwidth and capacity demands of these data-driven applications [
37,
40,
41,
48,
58,
95].
Phase Change Memory (PCM) is emerging as a class of
non-volatile memory (NVM) that is a promising alternative to DRAM [
33,
39,
46,
47,
71,
72]. PCMs outperform other NVM candidates owing to their higher reliability, increased bit density, and better write endurance [
13,
16,
61,
92].
In PCMs, data is stored in the state of the phase change material, i.e., crystalline (logic 1) or amorphous (logic 0) [
64,
93]. A SET operation triggers a transition to crystalline state, and a RESET operation triggers a transition to amorphous state. PCMs also enable
multi-level cells (MLC) using the partially crystalline states. Higher MLC capacity enables increased bit density (
\(bits\)/
\(mm^2\)). PCM cells are typically controlled electrically (we refer to them as EPCM cells), where different PCM states have distinct resistance values. EPCM cells are SET or RESET by passing the corresponding current through the phase change material (via the bitline) to trigger the desired state transition. The state of the EPCM cells is read out by passing a read current and measuring the voltage on the bitline. Main memory systems using EPCM cells are designed using the same microarchitecture and read/write access protocol as DRAM systems [
44,
85]. EPCM systems, however, experience resistance drift over time and so are limited to 2
\(bits\)/
\(cell\) [
13,
17], have
\(3\!-\!4\times\) higher write latency than DRAM leading to lower performance [
5,
44], consume high power due to the need for large on-chip charge pumps [
35,
66,
90], and have lower lifetime than DRAM due to faster cell wearout [
70].
Recent advances in device research have demonstrated optically controlled PCM cells (we refer to them as OPCM cells) [
18,
26,
27,
78]. OPCM cells exhibit higher MLC capacity than EPCM cells (up to 5
\(bits\)/
\(cell\) [
52]). Moreover, high-bandwidth-density silicon-photonic links [
84,
87], which are being developed for processor-to-memory communication, can directly access these OPCM cells, thereby yielding higher throughput and lower energy-per-access than EPCM. These two factors make OPCM a more attractive candidate for main memory than EPCM.
Given that in OPCM the optical signals in silicon-photonic links directly access the OPCM cells, the traditional row-buffer-based memory microarchitecture and the read/write access protocol encounter critical design challenges when adapted for OPCM. We need a complete redesign of the memory microarchitecture and a novel access protocol that is tailored to the OPCM cell technology.
In this article, we propose a COmbined System of Optical Phase Change Memory and Optical LinkS, COSMOS, which integrates the OPCM technology and the silicon-photonic link technology, thereby providing seamless high-bandwidth access from the processor to a high-density memory. Figure
1 shows a computing system with COSMOS. COSMOS includes a hierarchical multi-banked OPCM array, E-O-E control unit, silicon-photonic links, and laser sources. The multi-banked OPCM array uses 3D optical integration to stack multiple banks vertically, with 1 bank/layer. The cells in the OPCM array are directly accessed using silicon-photonic links that carry optical signals, thereby eliminating the need for
electrical-optical (E-O) and
optical-electrical (O-E) conversion in the OPCM array. These optical signals are generated by an E-O-E control unit that serves as an intermediary between the
memory controller (MC) in the processor and the OPCM array. This E-O-E control unit is responsible for mapping the standard DRAM protocol commands sent by the MC onto optical signals and then sending these optical signals to the OPCM array.
The major contributions of our work are as follows:
(1)
We architect the COSMOS, which consists of a hierarchical multi-banked OPCM array, where the cells are accessed directly using optical signals in silicon-photonic links. The OPCM array design combines wavelength-division-multiplexing (WDM) and mode-division-multiplexing (MDM) properties of optical signals to deliver high memory bandwidth. Moreover, the OPCM array contains only passive optical elements and does not consume power, thus providing cost and efficiency advantages.
(2)
We propose a novel mechanism for read and write operation of cache lines in COSMOS. A cache line is interleaved across multiple banks in the OPCM array to enable high-throughput access. The write data is encoded in the intensity of optical signals that uniquely address the OPCM cell. The readout of an OPCM cell uses a three-step operation that measures the attenuation of the optical signal transmitted through the cell, where the attenuation corresponds to a predetermined bit pattern. Since the read operation is destructive, we design an opportunistic writeback operation of the read data to restore the OPCM cell state.
(3)
We design an E-O-E control unit to interface COSMOS with the processor. This E-O-E control unit receives standard DRAM commands from the processor and converts them into the OPCM-specific address, data, and control signals that are mapped onto optical signals. These optical signals are then used to read/write data from/to the OPCM array. The responses from the OPCM array are converted by the E-O-E control unit back into standard DRAM protocol commands that are sent to the processor.
Evaluation of a 2.5D system with a multi-core processor and COSMOS demonstrates \(2.15\times\) higher write throughput and \(2.09\times\) higher read throughput compared to an equivalent system with EPCM. This increased memory throughput in COSMOS reduces the memory latency by \(33\%\). For graph and high performance computing (HPC) workloads, when compared to EPCM, COSMOS has \(2.14\times\) better performance, \(3.8\times\) lower read energy-per-bit, and \(5.97\times\) lower write energy-per-bit. Moreover, COSMOS provides a scalable and non-volatile alternative to DDR5 DRAM systems, with similar performance and energy consumption for read and write accesses. With DRAM technology undergoing critical scaling challenges, COSMOS presents the first non-volatile main memory system with improved scalability, increased bit density, high area efficiency, and comparable performance and energy consumption as DDR5 DRAM.
3 Motivation
In this section, we motivate the need for a novel memory microarchitecture and access protocol for OPCM by first describing the typical EPCM architecture and then explaining why such an architectural design is impractical for OPCM arrays. Figure
3 shows the architecture of EPCM [
39,
44]. The EPCM array is a hierarchical organization of banks, blocks, and sub-blocks [
44]. During read or write operations, the EPCM first receives a row address. The row address decoder reads the appropriate row from the EPCM array into a row buffer. The EPCM next receives the column address, and the column address multiplexer selects the appropriate data block from the row buffer. The bitlines of the selected data block are connected to the write drivers for write operation or to the sense amplifiers for read operation. For write operation, the charge pumps supply the required drive voltage to the write drivers, which corresponds to SET or RESET operation. For read operation, a read current is first passed through the GST element in the EPCM cell through an access transistor [
44]. Then, sense amplifiers determine the voltage on the bitline to read out logic 0 or logic 1.
Naively adapting the EPCM architecture for OPCM, by just replacing the EPCM cells with OPCM cells raises latency, energy, and thermal concerns, thereby rendering such a design impractical. To understand these concerns, let us consider an OPCM array that uses the EPCM architecture from Figure
3 with either an optical row buffer or an electrical row buffer. Such an OPCM array architecture has following limitations:
Limitations with optical row buffer: An optical row buffer can be designed using a row of GST elements whose states are controlled using optical signals. When a row is read from the OPCM array using an optical signal, the data is encoded in the signal’s intensity. This intensity is not large enough to update the state of the GST elements in the optical row buffer. So, the read value first needs to be converted into an electrical signal. Based on this value, an optical signal with the appropriate intensity is generated to write the value into the optical row buffer. Essentially, we perform an extra O-E and E-O conversion. This necessitates the use of photodetectors, receivers, transmitters, and optical pulse generators, which adds to the energy and latency of a memory access. Hence, an optical row buffer is not a viable option.
Limitations with electrical row buffer: An electrical row buffer can be designed either using capacitor cells as in DRAM or using phase change materials controlled using electrical current as in EPCM. In both cases, the row buffer is accessed using electrical signals (assuming electrical links between the processor and memory). This increases the access latency and energy and creates thermal issues as follows:
(1)
Impact on read latency: Upon receiving a row address from the MC on electrical links, the address first needs to be converted to an optical pulse, which is then used to read data from OPCM cells. After optical readout of an entire row from OPCM array, the data has to be converted back into electrical domain to store it in the row buffer. These two operations require an E-O and an O-E conversion, respectively, inside the OPCM array. These E-O/O-E conversions adds a latency of
\(25\!-\!30\) cycles for each read access [
6].
(2)
Impact on write latency: When writing data from the row buffer to the OPCM array, a set of sense amplifiers reads the data from the electrical row buffer. This row buffer data is then mapped onto optical signals with appropriate intensities using a pulse generation circuitry within memory. The optical signals are then used to write the data to the OPCM cells. Therefore, the write operation requires three E-O/O-E conversions, which adds a latency of
\(40\!-\!45\) cycles for each write access [
6].
(3)
Impact on read/write energy: The energy spent in the peripheral circuitry for optical signal generation and readout, as well as in the circuitry for E-O-E conversion increases the active power dissipation within memory [
6,
60,
63]. Since each read/write operation encounters multiple E-O-E conversions, the energy per read and write access rises considerably high (
\(\gt \!\!200 pJ/bit\)) [
24].
(4)
Thermal issues: The MRRs used in the OPCM array are highly sensitive to thermal variations [
65]. The thermal variations due to active electrical circuits within memory lowers the reliability of the MRR operation. Such a design calls for active thermal and power management in OPCM, which contributes to a power overhead of
\(10\!-\!30 W\) [
2].
Furthermore, using silicon-photonic links in combination with OPCM requires additional E-O and O-E conversions on the MC and the OPCM array with this EPCM architecture that exacerbate the above discussed problems. Hence, we argue for the need to redesign the microarchitecture and the read/write access mechanisms that are tailored to the properties of the OPCM cell technology and the associated silicon-photonic link technology.
5 Access Protocol in COSMOS
To enable high-throughput access of OPCM cells within the OPCM array, we propose a novel read and write access protocol for COSMOS. When the MC issues a read or write operation, the row address and column address are entered into the Row Address Queue and Column Address Queue, respectively, and the write data is entered into the Data Buffer in the E-O-E control unit.
5.1 Writing a Cache Line to OPCM Array
To write a cache line to the OPCM array, the E-O-E control unit identifies the bank ID, the row ID, and column ID of the tile, and the row ID and column ID of the cell within a tile using the address mapping. In our example with \(32 \times 32\) array of cells in a tile, when writing 128-bit chunk of a cache line, we end up updating all the cells in a row (any misaligned accesses are handled on the processor side). Hence, for writes at cache line granularity, the column ID within a tile is not used. The E-O-E control unit determines the optical intensity that is required at each OPCM cell in the row to write the 128-bit chunk of the cache line. It then breaks down the optical intensity into two signals: one with a constant intensity of \(I_0\) and the other with a data-dependent intensity of \(I_i\), where \(i=1,2,\ldots ,128\). The E-O-E control unit modulates the constant intensity \(I_0\) onto the optical signal corresponding to the row (selected by the row ID of cell) within a tile. The E-O-E control unit then modulates the data-dependent optical intensities (i.e., \(I_1\), \(I_2,\ldots ,I_{128}\)) onto the optical signals corresponding to the 4 tiles spread across 4 banks with 32 columns per tile. The E-O-E control unit transmits the row signal \(I_0\), and the column optical signals \(I_1, I_2,\ldots ,I_{128}\) in parallel to write the cache line in the OPCM array. The superposition of the optical signals, i.e., \(I_0\text{+}I_1\), \(I_0\text{+}I_2,\ldots ,I_0\text{+}I_{128}\) updates the state of the OPCM cells. Note that, since a cache line is spread across 4 banks, the E-O-E control unit modulates data on optical signals to write to an OPCM tile in each of these 4 banks. None of the optical signals individually carries sufficient intensity to trigger a state transition at any cell, so none of the other cells along the row or column are affected.
5.2 Reading a Cache Line from OPCM Array
To read a cache line from OPCM array, the E-O-E control unit transmits sub-ns optical pulses along all the columns in a tile that contain the cache line and measures the pulse attenuation. However, there are multiple OPCM cells along each column and so the output intensity of optical signals will be attenuated by all cells in that column. It is, therefore, not possible to determine the OPCM cell values using a one-pulse readout. Hence, we use a three-step process for read operation of OPCM array in COSMOS. ➊ To read a cache line, the E-O-E control unit first determines the bank ID, row ID, and column ID of tile, and row ID and column ID of cell. The E-O-E control unit transmits a read pulse \(RD_1\) through all the columns in a tile containing the cache line. Note that, since a cache line is spread across 4 banks, the E-O-E control unit transmits \(RD_1\) on the 4 different optical modes corresponding to the 4 banks. Each read pulse is attenuated by all the OPCM cells in the column. The attenuated pulses are received by the E-O-E control unit, which records the intensities of these attenuated pulses as \(I_{1,1}\), \(I_{2,1},\ldots ,I_{128,1}\). These intensities are converted into electrical voltage and stored as \(V_{1,1}\), \(V_{2,1},\ldots ,V_{128,1}\). ➋ The E-O-E control unit then transmits a RESET pulse to the OPCM cells of the cache line, i.e., all the cells along a row within a tile. All the cells along the row are now amorphized and have \(100\%\) optical transmission. ➌ The E-O-E control unit then sends a second read pulse \(RD_2\) through all the columns of a tile containing the cache line. Each read pulse is again attenuated by all OPCM cells in the column. Given that step 2 amorphized all OPCM cells of the cache line, the output pulse intensities are different from those in step 1. The attenuated pulses are received by the E-O-E control unit, which records the intensities of these attenuated pulses as \(I_{1,2}\), \(I_{2,2}\), ..., \(I_{128,2}\). These intensities are converted into electrical voltage and stored as \(V_{1,2}\), \(V_{2,2}\), ..., \(V_{128,2}\). The E-O-E control unit computes the difference of the stored voltages of steps 1 and 3, i.e., \(V_{1,1}\!-\!V_{1,2},V_{2,1}\!-\!V_{2,2},\ldots ,V_{128,1}\!-\!V_{128,2}\). This difference is used to determine the cache line data stored in the OPCM cells.
5.3 Opportunistic Writeback After Read
The RESET operation in step 2 of the read operation destructs the original data in the OPCM cells. We, therefore, perform an opportunistic writeback of the cache line to the OPCM cells. After completing the three steps of the read operation, the read data and the address are saved into a holding buffer in the E-O-E control unit. When there are no pending read or write operations from the MC, the E-O-E control unit reads the data and its address from the holding buffer and writes the data back to the OPCM array. This writeback operation does not block any critical pending read and write operations coming from the MC. The dependencies in read and write requests between the holding buffer and the data buffer are handled in the E-O-E control unit. For a Read-After-Read case, the second read operation reads the data from the holding buffer if present. If the data is not in the holding buffer, then the second read operation just uses the three-step process + writeback (described above) to complete the read operation. For a Write-After-Read case, if the write address matches the read address and there is an entry for that read in the holding buffer , then the corresponding entry in the holding buffer is invalidated. The write data is then entered into the data buffer and then written into the appropriate OPCM array.
6 E-O-E Control Unit Design
Our proposed E-O-E control unit provides the interface between the processor and the OPCM array. The MC sends standard DRAM access protocol commands to the E-O-E control unit. The E-O-E control unit maps these commands onto optical signals that read/write the data from/to OPCM array.
Though we can design a COSMOS-specific MC and the associated read/write protocol, our goal is to enable the COSMOS operation with a standard MC in any processor. The E-O-E control unit uses the following five sub-units to read from and write to the OPCM array:
data modulation unit (DMU), address mapping unit (AMU), pulse selector unit (PSU), pulse amplification unit (PAU), and
pulse filtering unit (PFU). Each OPCM bank has a dedicated set of these five sub-units in the E-O-E control unit. Figure
5(a) shows the design of the E-O-E control unit in COSMOS and the internals of these sub-units.
Figure
5(b) illustrates the sequence of operations in the E-O-E control unit for write operation to a bank containing
\(512 \times 512\) tiles with
\(32 \times 32\) cells per tile (same design as that used in Figure
4(e)). The AMU in the E-O-E control unit first receives the row address and then the column address from MC (Step 1). Depending on the addresses, the PSU in the E-O-E control unit selects the appropriate optical signals using the address mapping explained in Section
4.4 (Step 2). The PSU selects one optical signal for the row and 32 optical signals for the 32 columns in the row to write to 32 cells in a tile. In parallel with the write address, the DMU in the E-O-E control unit receives the write data from the MC (Step 3). The DMU generates a unique bias current for each of the 32 optical signals depending on write data and applies the currents to the
semiconductor optical amplifiers (SOA) in the PAU (Step 4). The SOAs amplify the optical signals to the required intensities. These amplified signals and the optical signal (corresponding to the row) traverse through the silicon-photonic links to the appropriate OPCM cells in the bank and SET/RESET the cell (Step 5). The E-O-E control unit incurs a latency of
\(T_{EO}\) cycles to map the address and data onto optical signals, resulting in a peak throughput of 1/
\(T_{EO}\). It should be noted that the physical location of a cell in the OPCM array in COSMOS determines the level of losses that will be experienced by an optical signal that is writing to the cell. These losses in turn dictate the amplification of that optical signal in the E-O-E control unit. To address this, the E-O-E control unit uses the address mapping (refer to Figure
4(e)) to map the physical address to the corresponding OPCM cell that needs to be written. Based on the physical location of the cell, the DMU in the E-O-E control unit looks up a pre-programmed LUT, which holds the amplification factor required for each cell. The DMU applies a bias current as a function of this amplification factor to the PAU, which amplifies the optical signals to the required level.
Figure
5(c) illustrates the sequence of operations in the E-O-E control unit for the three-step read operation from a bank. In the first step, the AMU receives the row and column addresses from MC and selects the appropriate 32 optical signals in the PSU using the address mapping explained in Section
4.4 (Step 1.1). The DMU generates a low-intensity readout pulse (
\(RD_1\)) and the PAU modulates this pulse on the 32 optical signals (Step 1.2). The optical signals traverse through the silicon-photonic link and then through the columns in the tile. The optical signals lose intensity as they pass through all the OPCM cells in their associated columns (Step 1.3). The intensities of these attenuated signals are recorded by the PFU (Step 1.4). The PFU then converts the optical intensities into electrical voltages,
\(V_{1,1}\),
\(V_{2,1},\ldots ,V_{32,1}\) (Step 1.5). In the second step, the DMU generates the RESET pulse. This RESET pulse is mapped onto the appropriate optical signals, and these signals are sent to the OPCM array (Step 2.1). The signals traverse through the silicon-photonic links and amorphize the OPCM cells corresponding to the read address (Step 2.2). In the third step, the DMU generates another readout pulse (
\(RD_2\)) and the PAU modulates this pulse on a set of 32 optical signals (Step 3.1). These signals traverse through the silicon-photonic links and then through the appropriate columns in the tile. These signals, too, lose intensity as they pass through all the OPCM cells in their associated columns (Step 3.2). The PFU records these attenuated signals (Step 3.3) and converts these optical signals into electrical voltages
\(V_{1,2}\),
\(V_{2,2},\ldots ,V_{32,2}\) (Step 3.4). Finally, the PFU computes
\(V_{1,1}\!-\!V_{1,2}\),
\(V_{2,1}\!-\!V_{2,2},\ldots ,V_{32,1}\!-\!V_{32,2}\) to determine the data (Step 3.5) and sends the data to the MC. The PFU also writes this data back to the
holding buffer in the DMU (Step 3.6).
10 Conclusion
EPCM systems suffer from long write latencies and high write energies, yielding poor performance and high energy consumption for data-intensive applications. In contrast, OPCM technology provides the opportunity to design high-performance and low-energy memory systems due to its higher MLC capacity and the direct cell access via high-bandwidth-density and low-latency silicon-photonic links. Adapting the current EPCM design architecture for OPCM systems, however, raises major issues in terms of latency, energy, and thermal concerns, thereby rendering such a design impractical. We are the first to architect a complete memory system, COSMOS, which consists of an OPCM array microarchitecture, a read/write access protocol tailored for OPCM technology, and an E-O-E control unit that interfaces the OPCM array with the MC. Our evaluations show that, compared to an EPCM system, our proposed COSMOS system provides \(2.09\times\) higher read throughput and \(2.15\times\) higher write throughput, thereby reducing the execution time by \(2.14\times\), read energy by \(1.24\times\), and write energy by \(4.06\times\).
We show that COSMOS designed with state-of-the-art technology provides similar performance and energy as DDR5. This is a significant finding, as future higher-density OPCM cells are expected to provide better performance. Our promising first version of a COSMOS architecture opens doors for new architecture-level, circuit-level, and system-level methods to enable practical integration of OPCM-based main memory in future computing systems. Moreover, the high-throughput and scalable OPCM technology ushers in interesting research opportunities in persistent memory, in-memory computing, and accelerator-specific memory designs.