research-article

Open access

POEM: Performance Optimization and Endurance Management for Non-volatile Caches

Authors:

Preeti Ranjan PandaAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems, Volume 29, Issue 5

Article No.: 79, Pages 1 - 36

https://doi.org/10.1145/3653452

Published: 04 September 2024 Publication History

PDF eReader

Abstract

Non-volatile memories (NVMs), with their high storage density and ultra-low leakage power, offer promising potential for redesigning the memory hierarchy in next-generation Multi-Processor Systems-on-Chip (MPSoCs). However, the adoption of NVMs in cache designs introduces challenges such as NVM write overheads and limited NVM endurance. The shared NVM cache in an MPSoC experiences requests from different processor cores and responses from the off-chip memory when the requested data is not present in the cache. Besides, upon evictions of dirty data from higher-level caches, the shared NVM cache experiences another source of write operations, known as writebacks. These sources of write operations—writebacks and responses—further exacerbate the contention for the shared bandwidth of the NVM cache and create significant performance bottlenecks. Uncontrolled write operations can also affect the endurance of the NVM cache, posing a threat to cache lifetime and system reliability. Existing strategies often address either performance or cache endurance individually, leaving a gap for a holistic solution. This study introduces the Performance Optimization and Endurance Management (POEM) methodology, a novel approach that aggressively bypasses cache writebacks and responses to alleviate the NVM cache contention. Contrary to the existing bypass policies that do not pay adequate attention to the shared NVM cache contention and focus too much on cache data reuse, POEM’s aggressive bypass significantly improves the overall system performance, even at the expense of data reuse. POEM also employs effective wear leveling to enhance the NVM cache endurance by careful redistribution of write operations across different cache lines. Across diverse workloads, POEM yields an average speedup of 34% over a naïve baseline and 28.8% over a state-of-the-art NVM cache bypass technique while enhancing the cache endurance by 15% over the baseline. POEM also explores diverse design choices by exploiting a key policy parameter that assigns varying priorities to the two system-level objectives.

1 Introduction

High-performance and energy-efficient Multi-Processor Systems-On-Chip (MPSoCs) have become increasingly popular in recent times [19, 61]. MPSoCs, which integrate multiple processors and accelerators onto a single chip, are used in parallel processing, AI [20], real-time data processing, multimedia applications, and so on. Modern MPSoC performance is limited by the memory wall [75], a critical performance gap between the compute and the memory resources. Next generation data-intensive workloads further aggravate this issue [69], demanding more compute power and larger on-chip storage, especially larger last level cache (LLC), which is typically shared by all cores. Conventional memory technologies, such as SRAM, suffer from several inherent limitations such as substantial cache area footprint and consumption of high leakage power, which become even more critical with technology scaling [17].

Non-volatile memories (NVMs) have experienced a surge in attention, due to their distinct advantages over traditional technologies such as SRAM. The non-volatility of NVMs is attractive for ensuring data integrity during power cycles and reducing the need for time-consuming data re-fetches. NVMs also address the problem of memory wall with their high storage density and ultra-low leakage power [66]. While NVMs demonstrate good potential, their wide-spread adoption as a replacement for SRAM has several limitations, such as inefficient write operations, non-uniform access delays, limited reliability, and endurance. These characteristics of emerging NVMs create opportunities for innovation and breakthroughs in designing the future memory hierarchy for next generation computing systems.

Contrary to charge-based memory technologies such as SRAM and DRAM, NVMs exploit electron spin, physical state, dielectric properties of materials, and so on, to store digital information. For caches and main memories, spin-transfer torque magnetic random-access memory (STT-MRAM) [36, 53, 60, 74], phase change memory (PCM) [62], racetrack memory [77], resistive RAM (ReRAM) [6], and so on, have emerged as various potential NVM candidates. Nevertheless, due to relatively high access speed and better endurance property over the other NVMs, STT-MRAM gained the highest attention, particularly for its application in on-chip caches [46]. However, despite its promises, STT-MRAM still faces critical challenges when compared to the state-of-the-art SRAM technology in the form of significantly slower write performance [3, 7, 12, 45, 58, 68] and orders of magnitude lower write endurance [6].

Employing STT-MRAM in the design of shared caches in future MPSoCs makes the aforementioned issues particularly significant, especially for write intensive workloads. The processor cores send read operations to the NVM shared cache as requests. In case the requested data is not present in the NVM cache, the main memory provides the data to the NVM cache as responses, which are then written to the cache for future reuse and forwarded to the higher-level caches in the hierarchy. In addition, when the dirty data is evicted out of the higher-level caches, it is presented to the NVM cache as another source of writes, known as the writebacks. These sources of write operations, i.e., writebacks and responses, create significant contention in the shared NVM cache, delaying the service of more critical reads from different processor cores and thereby affecting the overall system performance. Moreover, frequent write operations in the form of writebacks and responses threaten the NVM cache’s lifetime, when the increased write activity surpasses the allowed endurance limit, causing permanent damage. Addressing these two major problems presents a significant research challenge. Various research efforts have been made to tackle these issues partially; however, a complete solution has not yet been achieved.

Previous studies have predominantly focused on improving either the performance [9, 35, 36] or the lifetime of STT-MRAM cache [6, 38, 59], but neither addressed both aspects, nor explored the tradeoffs between them. Certain prior works proposed overly proactive wear leveling strategies emphasizing on uniform cache write distribution without adequately considering the performance implications [6]. Prior NVM cache bypass policies [9, 35] applied conservative bypass decisions, failing to mitigate the NVM cache contention efficiently. Besides being inadequate for mitigating contention, conservative bypassing of NVM writes also has limited impact on the volume and distribution of NVM write traffic. In contrast, adopting an aggressive bypass approach could be highly effective in alleviating contention while also significantly curtailing the NVM write traffic in favor of enhanced NVM lifetime. These observations serve as the motivation for our current research, which aims to address the issue of NVM cache contention as well as NVM cache endurance, exploiting their inherent tradeoffs.

Based on these observations, this study is the first work to tackle both the performance and endurance aspects of NVM-based shared cache in future MPSoCs and explore their inter-play. In this work, we propose Performance Optimization and Endurance Management policy for NVM shared cache called “POEM”. The proposed policy consists of a request and response bypass and endurance strategies. In the POEM, request and response bypass policy can optimize the overall system performance, and the endurance policy can enhance the NVM lifetime. Our experimental results demonstrate significant improvements in both performance and endurance compared to state-of-the-art policies. Across a set of diverse SPEC workloads, POEM improves the overall system performance by 34% and 29% over a naïve no-bypass baseline and the state-of-the-art NVM cache bypass technique, respectively. POEM improves the cache endurance by 15% and 11% over the baseline in terms of two endurance metrics, significantly outperforming the state-of-the-art endurance policy, which is agnostic to the overall system performance. Our key contributions are summarized as follows:

—

We propose an eficient NVM cache controller policy in which we aggressively bypass both sources of NVM write operations: writebacks from the higher-level caches and responses from the main memory. Contrary to the existing approaches that emphasize on cache data reuse and prefer writing the cache more frequently, POEM employs aggressive bypassing to alleviate the shared NVM cache contention while also leveraging the cache data reuse by writing very selectively. We show that such an aggressive bypass is useful in improving the overall system performance, even at the expense of cache locality. Aggressive bypassing of NVM writes is also shown to be beneficial for enhancing NVM endurance due to considerable reduction of NVM write traffic.

—

Based on the patterns of NVM cache write operations, we propose the migration of data between hot and cold cache lines to attain a more uniform write distribution across different cache lines while enhancing the cache lifetime and the overall system reliability. While such data migrations hamper normal cache accesses incurring significant performance bottlenecks, POEM exercises careful control over data migration to maximize its effectiveness while reducing associated performance overheads.

—

As a comprehensive policy, the proposed POEM achieves a balance between the two crucial system-level objectives: enhancing overall system performance and improving NVM cache endurance. POEM provides a flexible methodology to the system designers by assigning different priorities to the two system-level objectives through parameterized control of the endurance threshold.

—

In our research, we extend the capabilities of gem5 [14], a widely used architectural simulator, to reasonably model shared cache contention in the context of NVMs. By default, simulators like gem5 make simplified assumptions regarding the timing model of cache accesses and also do not model the read-write asymmetry of NVMs. To bridge this gap and facilitate further research in the field, we plan to make our implementation open source.

The remaining structure of this article is as follows: Section 2 and 3 provide the related work and the motivation, respectively. Section 4 presents the details of our POEM methodology, while Section 5 evaluates POEM, comparing its performance and endurance results against existing policies. Section 6 discusses the overheads associated to the hardware implementation of POEM. Section 7 concludes this work.

2 Related Work

Several studies have focused on performance and energy optimization techniques for NVM caches. We present a review of the related works, discussing hardware- and software-based techniques with an emphasis on architectural approaches such as cache bypass [9, 35, 71, 78], hybrid cache organization [8, 28, 36, 67, 74], cache data encoding [30], and cache replacement policies [55]. We also present an overview of the state-of-the-art addressing the NVM cache endurance concerns [6, 48, 50, 59].

Korgaonkar et al. [35] proposed Write Congestion Aware Bypass (WCAB), a runtime bypass policy that selectively bypasses writebacks from higher-level caches based on the contention level in the NVM cache request queue and the predicted reuse potential (liveness) of writebacks. During high contention phases, characterized by high occupancy of the request queue, WCAB employs an aggressive writeback bypass. However, its emphasis on cache data reuse limits its aggressiveness of bypass, and its oversight of the influence of responses further restricts its ability to address the contention issue. We include WCAB as one of our baselines due to its consideration of cache contention in bypass decision-making. Based on a theoretical model and classification for redundant cache writes, Ahn et al. [9] proposed a strategy to bypass different categories of redundant writes to the NVM cache to improve the overall energy efficiency. This approach utilizes a state-of-the-art sampling predictor with new signatures to identify the underlying redundancies among different sources of cache writes. To mitigate the overheads of NVM write operations, researchers proposed simple analytical models to assess the advantage of writing a particular data into the NVM cache from the perspective of the average memory access latency and selectively bypass write operations that are evaluated to be of less utility [71]. Zhang et al. [78] proposed a read-write asymmetry-aware analytical model of cache data reuse and employed runtime bypass decisions to reduce the redundant NVM cache writes. Cheng et al. [16] reduced the cache energy consumption by modifying the inclusion property and replacement policy of an NVM cache. State-of-the-art methods extensively investigated the performance and energy bottlenecks of the NVM write operations with emphasis on utilizing NVM write operations as potential targets for cache bypass policies. However, state-of-the-art NVM cache bypass policies primarily focused on cache data reuse. They attempted to eliminate NVM cache writes that are redundant from the reuse perspective, often overlooking the critical issue of cache bandwidth contention. Recent research has demonstrated that even for SRAM shared caches of modern MPSoCs, the shared cache bandwidth contention created by requests from multiple processor cores and their responses from the off-chip memory becomes a significant performance bottleneck [70]. In the context of NVM shared cache with additional write overheads, such contention issues become even more critical.

To exploit the advantages of both SRAM and NVM technologies, prior research proposed hybrid caches consisting of both SRAM and NVM partitions. Several architectural management techniques are proposed in the context of hybrid caches, with most of them employing different techniques to dynamically identify data with high write intensity and strategically placing them within the SRAM partition to mitigate the overall write overheads [4, 8, 28, 74]. Prior works also attempted reducing the write overheads of NVMs by trading off their retention times. These works proposed designing hybrid NVM caches with multiple cache regions of different retention times and allocating applications to appropriate regions based on their access patterns [13, 36, 37, 67]. NVM-aware data encoding techniques, such as the encoding of zero-valued NVM cache lines, are proposed by prior works [30]. Replacement algorithms are also modified to apply different promotion and insertion policies for reads and writes in the context of NVM caches [55]. Zhou et al. [79] introduced a circuit-level early termination mechanism for NVM write operations, exploiting the proactive sensing capability to detect redundant write operations that attempt to store the same data value to a particular memory address. Such redundant NVM cache writes are efficiently identified and terminated proactively to reduce the NVM write overheads.

Previous researches also employed software-based techniques to manage NVM caches efficiently. To minimize the frequency of refresh operations necessary to preserve the data integrity for a hybrid cache consisting of multiple NVM cache partitions with different retention times, Li et al. [40] analyzed the patterns of program-induced fill operations and reorganized the layout of cache data. Compiler techniques were used to design efficient hybrid SRAM-NVM scratch-pad memories [25]. The overhead of data migrations between SRAM and NVM technologies in a hybrid cache architecture was mitigated through an efficient cache data layout [41]. Sayed et al. [60] explored profiling-based techniques to effectively map program data onto a region with an appropriate retention time in a hybrid STT-MRAM cache. Through compiler analysis, silent stores are identified and eliminated to prevent redundant writes to memory cells [15].

To address the reliability concerns of NVMs, researchers proposed various techniques to address limited write endurance and read disturbance errors. By tracking write intensities of cache lines, prior works employed dynamic endurance policies to restrict write operations to hot cache lines [6, 48, 50]. Agarwal et al. [6] proposed a dynamic write restriction (WR) strategy that designates a certain number of write-intensive cache ways from a cache set as read-only in an execution interval. Writebacks and responses to the read-only cache ways are preferably redirected to the invalid cache lines from the remaining normal cache ways or to the LRU victim among the normal cache ways if no invalid cache line exists in the cache set. The write counters associated to each cache line are updated appropriately to avoid consecutive selection of the same cache ways as read-only. While WR aims to achieve a more uniform cache write distribution across different cache lines, its aggressive write-redirection invoked from the very beginning of the execution overlooks associated performance implications. We adopt this approach as one of our baselines. To exploit the tradeoff between write performance and retention time in NVMs, Saraf et al. [59] proposed a skewed set-associative NVM cache architecture along with a modified least recently used (LRU) replacement policy. This approach maintained the retention state of different cache blocks to prioritize the evictions of data that is more likely to expire. To achieve a uniform distribution of cache writes, prior arts also modified the set-mapping techniques and displaced hot cache lines to victim lines within the same cache set [29, 72]. To exploit the difference in write frequencies between program data and instructions, researchers developed effective ways to partition the NVM cache into separate data and instruction ways in such a way that the allocation of the instruction cache ways are rotated periodically to ensure a balanced write distribution [63]. Farbeh et al. [23] critically analyzed the impact of error correcting codes (ECCs) on the endurance of an NVM cache and distributed the ECC write activities uniformly by periodically relocating the ECC bits inside the cache lines. To mitigate the NVM read disturbance errors, Hosseini et al. [24] proposed compiler techniques to selectively instrument restore operations after speculated vulnerable reads. Overall, the prior NVM cache endurance enhancement techniques could be classified into two categories: intra-set [6, 48, 50, 63, 64, 65] and inter-set [5, 47, 59, 72, 73] techniques. Intra-set cache endurance management techniques try to reduce the skewness of writes within cache sets (across different cache ways in a cache set) essentially by migrating data between hot and cold cache ways inside a cache set, while inter-set cache endurance management techniques attempt to redistribute writes across different cache sets. In general, intra-set techniques are easier to adopt than inter-set techniques, because migrating data between different cache ways inside the same cache set is much easier than migrating data between cache ways from different cache sets.

In summary, prior research addressed the challenges associated with the design of on-chip caches with NVM technologies. These works can be broadly divided into two categories. The first category focused solely on addressing the performance and energy bottlenecks associated with NVM writes [4, 8, 9, 13, 15, 16, 25, 28, 30, 35, 36, 37, 40, 55, 60, 67, 71, 74, 78, 79], while the second category attempted to mitigate the NVM endurance issues [6, 23, 29, 43, 48, 50, 59, 63, 72]. While both the problems addressed by the two categories are pertinent, there is a need for a holistic solution that addresses both problems and exploits their inherent tradeoffs.

3 Motivation

Researchers have extensively explored cache bypass techniques to improve NVM cache performance. While some works targeted specific sources of NVM writes [35], others covered all sources, focusing, however, on cache data reuse [9]. Contention for the shared LLC bandwidth is a critical bottleneck in modern MPSoCs [11, 51, 70], and this issue is exacerbated in an NVM-based LLC due to NVM’s slow write speed. Existing works do not adequately address this contention problem, presenting an opportunity for us to introduce a reasonable model of LLC contention and propose a novel bypass technique to improve the overall system performance by exploiting the tradeoff between contention and data reuse. However, while prior bypass policies remain conservative due to their strong emphasis on cache reuse, a smarter bypass policy can afford being more aggressive, curtailing the overall NVM write stress in favor of its endurance.

Endurance is a critical reliability concern for caches in modern computing systems. NVMs have significantly lower endurance compared to conventional memory technologies such as SRAM and DRAM. For instance, state-of-the-art SRAM can endure over \(10^{15}\) writes, while the state-of-the-art STT-MRAM, the most promising NVM candidate for its application in on-chip caches, offers an endurance limit of only \(4 \times 10^{12}\) writes [26, 49]. Other NVM technologies suffer from even more critical endurance challenges. ReRAM, for instance, can endure only \(10^{11}\) writes per memory cell [33], and PCM’s endurance is limited only to \(10^{8}\) writes [54]. A single worn-out NVM bit-cell can have a devastating impact on system-level reliability, especially if it contains critical data. Previous research [6, 50] addressed the endurance issue by redistributing NVM writes more uniformly across different cache lines. However, their strong emphasis on overly aggressive write redistribution incurred major performance overheads.

Figure 1 illustrates the limitations of the state-of-the-art NVM cache policies. Such visualizations underscore the necessity for a comprehensive policy that effectively addresses both aspects together by exploiting the interactions between system performance and NVM endurance. In Figure 1, we use experimental results obtained across a diverse set of SPEC CPU 2006 workloads, details of which are outlined in Table 3, Section 5.1. Along with WCAB [35] and WR [6], we include a new baseline, denoted by WCAB+WR, which bypasses writebacks to the NVM cache according to WCAB [35] while following WR [6] in uniformly distributing writes across different cache lines. In Figure 1, NBB represents a naïve No Bypass Baseline, which is devoid of any performance or endurance enhancement strategies. The Y-axis of Figure 1 captures the overall performance gain, termed Speedup, which is calculated as the percentage improvement of a policy’s overall throughput compared to the naïve NBB. The endurance gain, displayed on the X-axis, denotes the percentage reduction in the maximum write frequency of NVM cache lines throughout the entire workload execution compared to the NBB.

Table 1.

System Unit	Configuration
CPU	8 Intel X86 @ 2 GHz out-of-order (OOO) CPU cores
L1 (SRAM)	Private to each core, 32 KB L1-D and 32 KB L1-I, 64 B cache line, 8-way set-associative, parallel-access, tag latency 1 cycle, data access latency 1 cycle, MSHR queue size 4
L2 (SRAM)	Private to each core, 256 KB, 64 B cache line, 8-way set-associative, parallel-access, tag latency 1 cycle, data access latency 2 cycles, MSHR queue size 8.
L3 (STT-MRAM)	Shared across all cores, 8 MB, 64 B cache line, 16-way set-associative, parallel-access, non-inclusive, tag latency 2 cycles, read access latency 7 cycles, write access latency 23 cycles, MSHR and response queue size 64
Main Memory (DRAM)	DDR3, 1,600 MHz, 8 GB, single channel, 2 ranks/channel, 8 banks/rank, page size 1 KB.

Table 1. System Configurations

Table 2.

Characteristics	Benchmarks
Low Write Sensitive (LWS)	sjeng, gobmk, wrf, bzip2, gromacs, h264ref, povray, dealII, namd, sphinx3, astar
Writeback Write Sensitive (WWS)	lbm, libquantum, omnetpp, soplex, hmmer, gamess, GemsFDTD
Response Write Sensitive (RWS)	mcf, zeusmp, bwaves, leslie3d, milc

Table 2. SPEC Workload Categorization for NVM-based Cache

Table 3.

Mix no.	Type	Benchmarks
1	WWS	soplex, soplex, gamess, omnetpp, omnetpp, omnetpp, omnetpp, omnetpp
2	WWS	omnetpp, omnetpp, omnetpp, soplex, omnetpp, omnetpp, omnetpp, gamess
3	WWS	soplex, hmmer, lbm, hmmer, libquantum, omnetpp, soplex, libquantum
4	WWS, RWS	hmmer, gamess, hmmer, gamess, mcf, bwaves, bwaves, bwaves
5	WWS, RWS	soplex, gamess, GemsFDTD, gamess, lbm, zeusmp, zeusmp, mcf
6	WWS, RWS	soplex, gamess, omnetpp, gamess, gamess, zeusmp, milc, milc
7	WWS, LWS	libquantum, hmmer, hmmer, dealII, lbm, libquantum, omnetpp, wrf
8	LWS, RWS	gromacs, h264ref, sjeng, gobmk, wrf, zeusmp, zeusmp, bwaves
9	WWS, RWS, LWS	gamess, gamess, bzip2, povray, dealII, libquantum, lbm, mcf
10	WWS, RWS, LWS	gromacs, sphinx3, gamess, soplex, gamess, lbm, zeusmp, libquantum
11	WWS, RWS, LWS	dealII, libquantum, astar, astar, namd, milc, zeusmp, bwaves
12	WWS, RWS, LWS	zeusmp, leslie3d, sjeng, sjeng, sphinx3, hmmer, wrf, wrf

Table 3. SPEC Workloads Used in the Experimental Evaluation

Fig. 1.

In Figure 1, NBB serves as the baseline at the origin (0,0), against which the performance and endurance gains of other policies are measured. On average across our workloads, WCAB, which bypasses writebacks based on request queue occupancy and writeback data reuse, provides a 4% speedup over NBB. WCAB does not employ any policy for addressing NVM endurance problems. However, it leads to 6% improvement in the cache endurance over NBB as a by-product of an overall write traffic reduction via the cache bypass decisions. WR dynamically redirects wear from hot cache lines in a cache set to the LRU victims among the remaining cache lines, resulting in an average endurance gain of 14%. However, the aggressive wear leveling of WR incurs performance overhead, leading to a 1% overall degradation in speedup even with respect to the naïve NBB. A combination of WCAB and WR achieves an average speedup of 3%, falling between the performance gains offered by WCAB and WR individually. However, it provides an average endurance gain of 15%, surpassing the individual endurance gains of WCAB and WR. Therefore, while combining state-of-the-art solutions for performance and endurance demonstrates synergistic effects on endurance, its overall performance is limited by the speedup of the state-of-the-art bypass solution. Our objective is to simultaneously enhance both performance and endurance gains, motivating the need for a more intelligent and comprehensive policy. In Figure 1, the shaded green region (\(X \ge 15\), \(Y \ge 4\)) represents the desired co-optimization zone, encompassing potential solutions that balance both performance and endurance. In the next section, we present the POEM methodology, designed to achieve this dual objective.

4 POEM: Performance Optimization and Endurance Management for Non-volatile Caches

We propose a novel approach called Performance Optimization and Endurance Management (POEM) to improve the overall system performance and the lifespan of NVM-based cache. The POEM framework comprises two cache controller policies: (1) Request and Response Bypass, whose primary objective is to enhance the system performance (Performance Optimization - PO); and (2) Endurance Policy, which addresses the skew in NVM write distribution to prolong NVM’s lifespan (Endurance Management - EM), discussed in detail in Sections 4.1 and 4.2, respectively.

4.1 Request and Response Bypass Policy

As shown in Figure 2, the Request and Response Bypass Policy of POEM can be integrated into the NVM cache controller. To alleviate the contention, this policy takes independent bypass decisions for both sources of NVM write operations. The Write Request Decision Maker dynamically determines whether a write request (writeback) from the private caches should be bypassed. If the decision is to bypass a write request, then POEM sends the write request to the main memory interconnect and subsequently to the main memory to perform the concerned data write. However, if the decision is to perform the cache write, then the write request is inserted into the request queue. The mechanism is slightly different for the responses. As soon as a response becomes available after completing the main memory access, a copy is transferred to the private cache interconnect and is subsequently delivered to the processor core waiting for the response data. This architectural optimization is known as Response Forwarding. The original response, however, is enqueued into the response queue only if the Response Decision Maker decides to write the response data into the NVM cache. Both bypass decisions consider the impact of cache contention as well as the potential for future data reuse in the NVM cache.

Fig. 2.

We divide the execution into equal-sized time intervals, known as epochs, and collect necessary performance counters in each epoch. Such an epoch-based paradigm helps us identify the dynamic changes in the program behavior during its execution. To capture such a notion of contention dynamically, we collect average lengths of the read request queue in every epoch. To obtain the reuse potential of different cache data, we incorporate hardware prediction structures [31, 35, 39] that store the reuse history of a predefined number of recent cache writes. The underlying principle of the predictor is that if the cache data accessed by a specific instruction has exhibited high reuse recently, then other data accessed by instructions with the same signature are likely to demonstrate high levels of reuse as well. We utilize the least significant 9 bits of the instruction program counter (PC) as the instruction signature, which is also used to index into the predictors to update and maintain the reuse history of writebacks and responses.

Figure 3 illustrates the main idea of POEM’s bypass policy. While the first plot of Figure 3 shows the variation of the average read queue length across different epochs, the later one unfolds each epoch, showcasing the variation of predicted cache reuse of different write accesses in each epoch. Real-world applications often change their behavior dynamically during execution. While during application phases with critical NVM cache contention we prefer an aggressive bypass approach to effectively alleviate the increased contention, we make conservative bypass decisions when the NVM contention is not significant. This adaptive approach helps strike a fine-grained balance between the cache contention and the cache data reuse. To assess the severity of contention at any epoch, we compare the average read queue length of the previous epoch against the running average of the read queue lengths collected across previous \(K\) epochs and identify the contention as critical only if the former is greater than the latter. For example, in Figure 3, the contention at epoch A is not perceived critical despite incurring a previous spike in the average read queue length. The use of running average as the dynamic threshold in the contention evaluation criterion facilitates ignoring such transient program behaviors, allowing us to focus on significant phase changes. So, at epoch A, we decide to write data from a writeback \(\mathbf {W_1}\) into the NVM cache, and hence, enqueue \(\mathbf {W_1}\) into the request queue (Step ①). In epochs B and C, we observe the read queue length to grow consistently and identify the contention scenarios at these two epochs as critical. At epoch B, when a writeback \(\mathbf {W_2}\) arrives at the NVM cache, the writeback reuse predictor is consulted (Step ②). Because \(\mathbf {W_2}\) is predicted to have a reuse count of 5, which is higher than the predefined reuse threshold (Reuse Threshold = 3), \(\mathbf {W_2}\) is decided to be written into the NVM cache, despite the presence of critical contention. Hence, in Step ③, \(\mathbf {W_2}\) is inserted into the request queue for later write. At epoch C, similarly, the response reuse predictor is consulted for the response \(\mathbf {W_3}\) (Step ④). Because \(\mathbf {W_3}\) is anticipated to have a reuse count of 2 (less than Reuse Threshold), \(\mathbf {W_3}\) is not inserted into the response queue and is bypassed (Step ⑤). In this way, through adaptive exploitation of the tradeoffs between cache contention and cache data reuse, POEM could enhance the overall system performance.

Fig. 3.

Algorithm 1 shows the steps involved in Request and Response Bypass of POEM in two parts, with lines

–

and lines

–

discussing Write Request Decision Maker and Response Decision Maker, respectively. Upon the arrival of a writeback (\(\mbox{W}_{Acc}\)) at the NVM cache, we assess the severity of shared cache contention (Line

). This is achieved by comparing the average length of the read request queue in the previous epoch (\(\mbox{AvgL}_{i-1}\)) with the running average of the read request queue length computed across previous \(K\) epochs (\(\mbox{KavgL}_{i-1,..,i-K}\)). The recent average being higher than the scaled running average (\(\alpha\) is an empirical tolerance; details discussed in Section 5.10) indicates phases of high cache contention. The scaling prunes out sudden fluctuations in the queue length while paying more attention to significant changes in the program behavior. In such scenarios of increased contention, we consider the predicted reuse of the writeback (\(\mbox{W}_{Acc}\)), computed in Line

. If an entry for \(\mbox{W}_{Acc}\) is already present in the prediction structure and the predicted reuse count (\(\mbox{RC}\)) is substantially high (exceeding \(\mbox{Th}_{High\_wb}\)), we decide to write the data of \(\mbox{W}_{Acc}\) into the cache and insert it into the request queue (Line

). Conversely, if no matching entry for \(\mbox{W}_{Acc}\) is found or the predicted reuse count is below \(\mbox{Th}_{High\_wb}\), then we bypass \(\mbox{W}_{Acc}\) by sending it to the lower-level memory (Line

). It is important to highlight that bypassing a writeback even in the absence of a matching entry in the predictor is a strategic approach to achieve a high level of bypass aggressiveness, necessary to mitigate contention and enhance the overall system performance. While bypassing, it is essential to be cautious for writebacks that already have their target data present in the cache (\(\mbox{isCached} == True\)). In such cases, we invalidate the target data (\(\mbox{W}_{Blk}\)) inside the cache (Line

) to prevent subsequent cache reads from accessing a stale value. In scenarios of low contention (Line

), similar steps are followed as already discussed (Line

–Line

), with the only exception being the choice of a different reuse threshold. When contention is not significant, we give higher preference to writing data over bypassing them. To achieve this, a lower threshold on the writeback data reuse (\(\mbox{Th}_{Low\_wb}\)) is used (Line

), encouraging the NVM cache to focus on reusing data more frequently.

At a high-level, the working of the Response Decision Maker, described in lines

–

, is similar to that of Write Request Decision Maker. When a response (\(\mbox{W}_{Acc}\)) from the lower-level memory reaches the NVM shared cache, we assess the status of the shared cache contention by comparing the recent average against the running average of the read request queue length (Line

). Regardless of the contention severity, the response data is always forwarded to the higher-level caches to ensure the progress of cores (Lines

and

). However, when contention is high, a response is written to the cache and inserted into the response queue (Line

) only if the predicted reuse count is significantly high (greater than \(\mbox{Th}_{High\_resp}\)). The responses and writebacks in the NVM cache exhibit different data reuse characteristics, necessitating the use of distinct reuse thresholds for each of them. In case the predicted reuse count is lower than \(\mbox{Th}_{High\_resp}\), no additional bypass steps are required, as the response (\(\mbox{W}_{Acc}\)) is already forwarded. When the contention is not so significant (Line

), the policy prioritizes writing responses, employing a lower threshold on the data reuse count (\(\mbox{Th}_{Low\_resp}\)). This adaptive approach allows the policy to optimize cache behavior and efficiently manage response handling. The tolerance parameters (\(\alpha\)), running average window size (\(K\)), and four different reuse thresholds used in Request and Response Bypass are determined empirically, with details provided in Section 5.10. Collectively, the Write Request and Response Decision Makers of POEM facilitate a highly selective approach to NVM cache writes, resulting in improved system performance through aggressive cache bypass mechanisms.

4.2 Endurance Policy

While POEM’s bypass policy effectively controls the overall write traffic to the NVM cache, it does not address the issue of underlying write distribution across different NVM cache lines. As a consequence of this, certain cache lines may be subject to frequent writes either from private caches or from the main memory, while other cache lines may receive much fewer writes, accelerating the risk of specific cache lines prematurely crossing their endurance limits. To address this problem, we integrate an endurance policy that swaps data between the hot cache lines (those with high write frequency) and cold cache lines (those with minimal writes) into the POEM framework. When any cache line receives excessive write operations (beyond a certain threshold), we initiate swap operations to redistribute the write traffic more uniformly. However, the NVM cache becomes inaccessible during any such swap operation, aggravating the shared cache contention. So, while invoking the swap operations, we must carefully consider the tradeoffs between their performance overheads and their endurance impacts.

We introduce a threshold (\(\mbox{E}_{Th}\)) to invoke POEM’s endurance policy when the write count of a hot NVM cache line reaches \(\mbox{E}_{Th}\). In real systems, the actual limit on the number of write operations a cache line can withstand before wearing out depends on the execution environment and the specifics of the NVM technology [26, 33, 49, 54]. We denote such an actual endurance limit for a particular NVM technology in a particular execution environment as \(\mbox{L}_{NVM}\). In a real system, the \(\mbox{E}_{Th}\) can serve as a proactive threshold, as described in Equation (1).

\begin{equation} \mbox{E}_{Th} \le \dfrac{1}{1+\eta } \times \mbox{L}_{NVM} \end{equation}

(1)

Here, \(\eta\) is a scaling factor (\(\eta \ge 1\)), with a higher value indicating a more aggressive endurance policy and a lower value indicating a more conservative endurance mitigation approach. The parameter \(\eta\) might depend on various statistical properties of the underlying distribution of cache writes. A more formal treatment of \(\eta\) merits a separate study and is beyond the scope of the current research. We assume \(\eta\) to be of a sufficiently large value to ensure a broad enough safety margin between the invocation threshold (\(\mbox{E}_{Th}\)) and the actual critical limit (\(\mbox{L}_{NVM}\)).

Figure 4 presents a running example demonstrating the steps followed by POEM’s endurance policy. An NVM cache set with \(K\) cache lines is shown in Figure 4. The cache controller incorporates hardware counters known as Write Count for each cache line, tracking the number of write operations performed on the cache line throughout the execution. Each cache line is also associated with a Swap Bit, which indicates whether the cache line has undergone a recent swap operation (Swap Bit = 1) or not (Swap Bit = 0). For instance, in Figure 4, cache data blocks \(D_1\) and \(D_2\) have recently undergone swap operations, while \(D_3\) and \(D_K\) have not been swapped recently. The Write Count of cache lines plays a central role in the endurance policy, helping us to identify hot and cold cache data for a potential swap operation. The Swap Bit, however, minimizes unnecessary swap operations, thereby mitigating associated performance overheads. In the first instance of the cache set shown in Figure 4, cache blocks \(D_1\) and \(D_2\) have relatively lower Write Counts; however, their Swap Bits are already set. This signifies that blocks \(D_1\) and \(D_2\) presently contain hot data, which was recently moved from other cache lines to their respective positions by previous swap operations. Because these blocks contain hot data, their Write Counts might soon attain higher values, despite currently being relatively less. If the endurance policy looks only at the Write Counts of cache lines, then cache blocks such as \(D_1\) and \(D_2\) could very well be swapped again with other hot data. In this way, the Swap Bits helps prevent such types of redundant swap operations to save associated performance overheads. So, while initiating a swap operation, we take into account not only the Write Counts of the lines but also the status of their Swap Bits. The Write Count of cache lines are updated each time a writeback or a response data is written into that particular cache line. The Swap Bit of a cache line is set when the cache line undergoes a swap operation and is dynamically reset based on predefined thresholds on the Write Count of that cache line. For instance, if the Swap Bit of \(D_i\) was set to 1 when its Write Count was \(WC_i\) (because \(D_i\) was swapped with some other block), the Swap Bit of \(D_i\) is reset when \(WC_i\) crosses a threshold, say, \(\mbox{T}\), where \(\mbox{T} \gt WC_i\). This prevents \(D_i\) from being swapped between the time its Write Counts are \(WC_i\) and \(\mbox{T}\) and makes it eligible for another swap operation only when \(D_i\) is written beyond \(\mbox{T}\) times. The process could be repeated for \(n\) such monotonic thresholds on the Write Count of each cache line throughout the program execution (\(0 \lt \mbox{T}_{1} \lt \mbox{T}_{2} \lt \ldots \lt \mbox{T}_{n} \lt \mbox{E}_{Th}\)) so the POEM cache controller could smartly determine when a cache line can or cannot be swapped. To perform in-place swap operations between the hot and the cold cache lines, we introduce an SRAM buffer within the NVM cache controller to store the intermediate swap data, as shown in Figure 4.

Fig. 4.

So, we introduce two kinds of thresholds on the Write Count of any cache line for two different purposes, discussed as follows:

—

Endurance Threshold (\(\mbox{E}_{Th}\)): When the Write Count associated to any cache line reaches this threshold (which, in the examples of Figures 4 and 5, is taken as 400), the cache line is considered hot and vulnerable, and the endurance policy is invoked to swap that hot line with a suitable cold cache line.

—

Swap Bit Reset Thresholds: We reset the Swap Bit corresponding to a cache line multiple times (\(n\)) to regulate its eligibility for swapping based on a set of pre-defined thresholds (\(\mbox{T}_1\), \(\mbox{T}_2\), \(\ldots\), \(\mbox{T}_n\)) on the Write Count of that cache line. Specifically, in POEM, we consider three (\(n=3\)) such thresholds: \(\mbox{T}_{1} = 0.25\times \mbox{E}_{Th}\), \(\mbox{T}_2 = 0.50\times \mbox{E}_{Th}\), and \(\mbox{T}_3 = 0.75\times \mbox{E}_{Th}\).

Fig. 5.

The endurance threshold (\(\mbox{E}_{Th}\)) is a parameter of POEM, serving as a proactive safety threshold and dependent on the specific NVM technology. It helps us exploit the tradeoff between system performance and NVM endurance, as discussed in Section 5.5. The value of \(\mbox{E}_{Th}\) is uniform across all cache lines, assuming equal criticality of all cache data. Therefore, any cache line with a Write Count surpassing \(\mbox{E}_{Th}\) is considered hot and vulnerable, potentially triggering a swap operation. However, we acknowledge the possibility of employing different Swap Bit reset thresholds for distinct cache lines. To assess the overall sensitivity, we conduct experiments with various choices for these thresholds (refer to Section 5.10) but observe marginal impact on NVM endurance. Therefore, we recognize a more sophisticated treatment of these thresholds as a direction for future research.

Figure 5 illustrates the timeline of a cache line \(L\) from the beginning of the execution (marked by \(t=0\)) until its completion (marked by \(t=600\)). Initially, at \(t=0\), both the Write Count (WC) and Swap Bit (SB) of \(L\) are set to 0. At time \(t=150\), POEM’s endurance policy is activated to swap \(L\) with a hot cache line (whose write count reaches \(\mbox{E}_{Th}=400\)) within the same cache set. After a successful swap operation, \(L\)’s Write Count becomes 51, and its Swap Bit is set to 1 to indicate that it has recently undergone a swap operation. As long as \(L\)’s Swap Bit remains set, it is not considered eligible for any further swap operation. After some time, when the Write Count of \(L\) reaches \(\mbox{T}_1 = (0.25\times \mbox{E}_{Th}) = 100\) (at time \(t=200\)), its Swap Bit is reset, making \(L\) once again eligible for a swap operation. Thus, during time \(t=[150, 200]\), \(L\) remains ineligible for a swap operation. At time \(t=250\), another cache line within the same cache set becomes hot, and we swap that line with \(L\), causing \(L\)’s Swap Bit to be set once again. Consequently, \(L\) becomes ineligible for swap until time \(t=350\) when its Write Count reaches the second threshold \(\mbox{T}_2 = (0.50\times \mbox{E}_{Th}) = 200\), and its Swap Bit is reset. Thus, during time \(t=[250, 350]\), \(L\) remains ineligible for swap operations, and at time \(t=350\), it regains its eligibility to be a swap candidate. For simplicity, only two Swap Bit reset thresholds (e.g., \(\mbox{T}_1\) and \(\mbox{T}_2\)) are shown in the example of Figure 5. The fundamental concept behind these Swap Bit reset operations is to prevent a cache line from being swapped shortly after a previous swap. The thresholds (\(\mbox{T}_1\), \(\mbox{T}_2\)) control the duration after a swap operation on a particular cache line during which it is not eligible for another swap.

In Figure 4, POEM’s endurance policy is invoked when the Write Count of a cache line (i.e., \(D_3\)) reaches the endurance threshold (\(E_{Th}\), assumed to be 400 for this example), and the Swap Bit of that cache line (i.e., \(D_3\)) is not set (Step 1). At this point, \(D_3\) is identified as hot and becomes eligible for a swap operation. In Step 2, we transfer \(D_3\) to the swap buffer. To determine the most suitable swap candidate for \(D_3\), we look for the coldest cache line (i.e., the one with the minimum Write Count) that also has its Swap Bit reset (Step 3). In this example, \(D_K\) satisfies these conditions and is selected as the swap candidate for \(D_3\). Although \(D_1\) and \(D_2\) currently have lower Write Counts than \(D_K\), they are not chosen for the current swap operation, because they have recently undergone swap operations and might actually contain hot data. In Step 4, we transfer \(D_K\) to the cache way where \(D_3\) was located and increment the Write Count associated with that cache line. Finally, Step 5 concludes the swap operation by transferring \(D_3\) from the swap buffer to the original cache way of \(D_K\) and updating its Write Count. Steps 4 and 5 also set the Swap Bits associated with \(D_K\) and \(D_3\), which are currently swapped.

Algorithm 2 outlines the NVM cache endurance policy of POEM. Given that \(\mbox{W}_{\mbox{curr}}\) represents the cache line currently being written, the algorithm checks whether the current write count of \(\mbox{W}_{\mbox{curr}}\), denoted by \(\mbox{WC}[\mbox{W}{\mbox{curr}}]\), has reached the predefined endurance threshold (\(\mbox{E}_{Th}\)) and whether the swap bit of \(\mbox{W}_{\mbox{curr}}\) is reset (Line

). If both conditions are satisfied, then the algorithm searches within the current cache set for a cache block (\(\text{Min_WC_Blk}\)) that has the least write count and whose swap bit is not set (Lines

–

). Once such a suitable cache line is found, \(\mbox{W}_{\mbox{curr}}\) is swapped with that cache line (Line

) and their write counts and swap bits are updated accordingly (Lines

–

). If no suitable swap candidate is found, then the endurance policy performs no swap operation. Thus, when the write frequency of a hot cache line exceeds a predefined endurance threshold, our novel endurance policy strategically swaps the cache data between hot and cold cache lines inside a cache set. The NVM cache controller maintains additional cache metadata to capture the write frequency and the swapping history of each cache line. While the write frequencies of cache lines are used to identify hot and cold cache lines, the swap bits help the controller to avoid redundant swap operations and reduce associated swap overheads. Unless a cache line is written beyond the endurance threshold, the endurance policy of POEM is not activated, with the normal cache accesses being served as usual according to POEM’s bypass policy. While the Swap Bit set operations are performed after each swap operation as part of Algorithm 2 (Lines

and

), the reset operations are performed in the background independently of the endurance policy, and thus, not shown as a part of Algorithm 2.

POEM’s endurance policy, discussed in Algorithm 2, falls within the category of intra-set cache endurance management techniques, which remain effective until cold cache lines are available within the cache sets (or as long as there is write skew inside the cache sets). When all cache lines in a pathological cache set become hot, intra-set swap operations cannot address the issue, however, it can affect the overall system performance. POEM could identify such extreme scenarios with the help of swap-bits and avoid unnecessary swap overheads. For such extreme cache sets, inter-set endurance management techniques could identify a suitable cold cache set and distribute writes from the hot set to the cold one. Because these two techniques are mostly independent, any inter-set technique can be applied in parallel with POEM, incurring, however, much higher overheads than POEM. Therefore, inter-set swap operations should be invoked carefully, assessing when they are absolutely crucial for improving endurance. This demands a separate study in itself and is planned as one of the future extensions of POEM.

5 Experiments and Results

5.1 Experimental Setup

In this study, we use gem5 simulator [14] to conduct the experimental evaluation of POEM and three other baseline policies. Because the shared cache contention is a crucial aspect of contemporary MPSoCs and simulators such as gem5 mostly overlook such contention, we modify gem5 to reasonably model the shared cache contention. Inside gem5, we incorporate request and response queues for the NVM L3 cache along with the forwarding mechanisms for the responses from the main memory. The various timing-related parameters for the gem5 L3 cache are also modified to appropriately model the read-write asymmetry, which is a unique characteristic of NVM technologies. The specific configuration of the simulated system architecture is presented in Table 1. Higher-level caches (L1 and L2) are performance-critical and hence are constructed using SRAM technology, which offers faster accesses. However, the shared L3 cache is designed using the STT-RAM technology. To reduce the performance overheads, the tag directory of the shared L3 cache is built using SRAM, while the data portion is implemented with STT-MRAM [8, 18, 32, 34]. CACTI [2] and NVSim [1], popular modeling utilities for SRAM and NVM caches, respectively, are used to obtain the cache latency parameters for 22 nm technology node.

In the experimental evaluation, we employ gem5’s system call emulation (SE) mode, which offers limited support for SPEC CPU 2017 benchmarks. Therefore, we use SPEC CPU 2006 benchmarks, known for possessing comparable memory access characteristics to SPEC CPU 2017 benchmarks [42]. For each workload, we initiate the simulation with a fast forward of 1 billion instructions, followed by a warm-up phase of 150 million instructions, and terminate it once any core completes 250 million instructions subsequently. As mentioned earlier, the shared L3 cache encounters two distinct sources of write operations: dirty writebacks from L2 caches and responses from DRAM. Considering the latency overhead associated with both these write accesses, we classify the SPEC CPU 2006 benchmarks into three distinct categories based on which of these write accesses has the most significant impact on the overall system performance, as shown in Table 2.

The characterization outlined in Table 2 is performed based on standalone executions of applications with their NVM-specific behaviors under consideration. Benchmarks categorized as Low Write Sensitive (LWS) exhibit only marginal performance degradation, less than 5%, in comparison to their execution on an SRAM L3 cache-based system, where read and write latencies are identical. Benchmarks with more significant performance implications (more than 5%) are further classified into two groups, depending on whether the NVM latency overhead of writebacks has more negative impact on the overall system performance than that of responses (referred to as Writeback Write Sensitive or WWS) or vice versa (referred to as Response Write Sensitive or RWS). For these sensitive benchmarks, both sources of write operations lead to substantial contention in the shared NVM L3 cache. However, in the case of WWS benchmarks, the writebacks are responsible for causing more contention compared to the responses. Conversely, for RWS benchmarks, it is the responses that lead to more contention in the shared NVM L3 cache, affecting the overall system performance more critically.

The overall experimental evaluation encompasses two distinct sets of workloads. Initially, a collection of 13 sensitivity workloads is employed to conduct an analysis of each policy parameter’s sensitivity. This is achieved by executing the sensitivity workloads while varying the values of the respective parameters. Once the parameter values are established based on the outcomes from the sensitivity workloads, a separate set of 12 evaluation workloads (outlined in Table 3) is used to present the final results. Both sets of workloads effectively capture a spectrum of diverse scenarios of shared L3 cache contention, aggravated disproportionately by different sources of NVM writes. For the experimental evaluation of POEM, we use four baselines: WCAB [35] and WR [6] as state-of-the-art solutions for performance and endurance, respectively, a combination of WCAB and WR, denoted as WCAB+WR, and the naïve No-Bypass Baseline (NBB), which serves as the common baseline for all.

In Section 5.2, we conduct a comprehensive analysis of POEM’s performance, comparing it with other baseline policies. Section 5.3 discusses the effects of cache bypassing on cache miss rates, while Section 5.4 provides detailed results on the NVM endurance across POEM and other baseline policies. The tradeoff between performance and endurance objectives is explored in Section 5.5. Sensitivity analyses of POEM’s performance across different NVM access latencies and number of processor cores are discussed in Sections 5.6 and 5.7, respectively. While Section 5.8 discusses the result of POEM’s adoption in a distributed LLC, Section 5.9 presents POEM’s throughput gains for a multi-port LLC. The sensitivities of various parameters and thresholds of POEM are presented in Section 5.10. Finally, we analyze POEM’s overheads in Section 6 and conclude the manuscript in Section 7.

5.2 Overall System Performance

The overall system performance is measured using the overall system throughput, which is quantified in terms of Instructions Per Cycle (IPC). The overall system throughput of a particular multi-core workload (or mix) is defined as the summation of IPCs of applications executing on all eight cores. We measure the speedup of a policy as follows:

\begin{equation} \mbox{Speedup}_{pol} = \dfrac{\mbox{IPC}_{pol}}{\mbox{IPC}_{NBB}} . \end{equation}

(2)

Here, pol can be any one of POEM, WCAB, WR, or WCAB+WR. Speedup is a metric where higher values indicate better performance. Since NBB represents a naïve policy that applies no bypass at all and writes all writebacks and responses, it serves as the common reference for defining speedup in Equation (2).

In Figure 6, we present the speedup results of proposed POEM along with three other baseline policies, WCAB, WR, and WCAB+WR, across a set of 12 evaluation workloads. Since NBB is the common reference for all the policies, we refrain from showing NBB results explicitly in Figure 6. Among the three baselines, WCAB offers the highest average speedup of 1.04 across the workloads. Specifically, for mixes 1–3, which consist entirely of WWS applications, WCAB achieves a significant average speedup of 1.11 because of WCAB’s focus on bypassing writebacks. For mixes 4–6, which consist of combinations of WWS and RWS applications, WCAB leads to an average speedup of 1.03. However, the performance of WCAB diminishes for the remaining mixes, characterized by a more diverse distribution of applications. For these workloads (mixes 7–12), WCAB leads to marginal speedup converging towards unity. These results could be attributed to WCAB’s runtime bypass decisions covering only writebacks and its excessive emphasis on cache data reuse, overlooking the contention created by responses from the DRAM.

Fig. 6.

As shown in Figure 6, WR incurs an average degradation of 1% in the overall system performance across the set of 12 evaluation workloads. This is consistently observed across almost all mixes. In the most unfavorable scenario (e.g., mix 11), the performance degradation reaches up to 3%. Unlike WCAB, WR involves writing all writebacks and responses to the NVM L3 cache, failing to mitigate the cache contention. Furthermore, WR’s aggressive wear leveling strategy, which initiates cache write redistribution from the very beginning of the workload execution, introduces performance overheads, generating an average speedup of 0.99.

WCAB+WR attains an average speedup of 1.03 over the naïve NBB across 12 evaluation workloads. The offered speedup is lower than that of WCAB due to the performance overheads associated with wear-leveling. However, it exceeds the speedup offered by WR because of the mitigation of LLC contention through writeback bypass. For instance, while mixes 1–3, comprising WWS applications, show 11% average speedup with WCAB and no speedup with WR, they demonstrate 10% average speedup with WCAB+WR. For mixes 4–6, comprising both WWS and RWS applications, WCAB and WR exhibit 3% improvement and 1% degradation in the overall system performance, respectively, whereas a combination of both, i.e, WCAB+WR leads to an average speedup of 2%.

Across 12 evaluation workloads, POEM consistently outperforms both WCAB and WR, delivering average performance improvements of 34% over NBB, \(28.8\%\) over WCAB, the state-of-the-art bypass baseline for NVM cache, and \(30.1\%\) over WCAB+WR, a combination of the state-of-the-art bypass and endurance solutions. POEM achieves substantial speedups for mixes 1–3, consisting entirely of WWS applications, with an average speedup of 1.57. For mixes 1–3, POEM significantly outperforms the state-of-the-art bypass solution WCAB, which offers its best results for these particular WWS mixes. The highest performance improvement for POEM is observed in case of mix 2 (comprising all WWS applications), where POEM demonstrates a remarkable speedup worth 1.9. For mixes 4–6, consisting of a combination of WWS and RWS applications, POEM attains a considerable average speedup of 1.46. Through its aggressive bypass decisions for writebacks and responses, POEM effectively alleviates the contention in the shared NVM L3 cache, leading to efficient processing of critical reads. Even in case of mix 8, which contains more number of LWS applications than RWS applications, POEM intelligently employs runtime bypass decisions for responses, resulting in a decent speedup of 1.05. Across the most heterogeneous workloads (mixes 9–12), comprising all three types of applications (i.e., LWS, WWS, and RWS), POEM delivers an average speedup of 1.21, with the highest speedup reaching up to 1.5 for mix 9.

5.3 NVM Cache Miss Rate

Figure 7 presents the overall NVM cache miss rate for proposed POEM, the state-of-the-art bypass solution WCAB, and NBB to estimate the impact of cache bypass decisions on the overall cache locality. Since the cache miss rates are themselves ratios, Figure 7 shows absolute values for the NVM cache miss rates of the three policies. Given our focus on capturing the influence of cache bypass on cache locality, we have omitted WR from the results depicted in Figure 7, as it does not implement any bypass mechanism. We also have omitted WCAB+WR, because the insights are captured in the WCAB results itself. The wide range of cache miss rate values across our workloads in Figure 7 indicates the diversity of the workloads not only in terms of cache contention behavior, but also in terms of cache access (or reuse) patterns. While WWS applications in mixes 1 and 2, namely, omnetpp, soplex, and gamess, exhibit high data reuse in the L3 cache, the majority of RWS applications, such as mcf, zeusmp, milc, and bwaves, demonstrate relatively poorer L3 data reuse.

Fig. 7.

Across the set of 12 evaluation workloads, WCAB and POEM exhibit an average increase in the NVM cache miss rate of \(0.33\%\) and \(8.41\%\), respectively, compared to NBB. In mixes 1–3, which consist entirely of WWS applications, WCAB and POEM increase the NVM cache miss rate by \(0.67\%\) and \(7.34\%\), respectively, over NBB. In the context of mixes 4–6, a combination of WWS and RWS applications, the limited bypass strategy of WCAB increases the cache miss rate only by \(0.34\%\), whereas the more aggressive bypass approach adopted by POEM, targeting both writebacks and responses, leads to a \(9.67\%\) increase in the cache miss rate. In the context of the most heterogeneous workloads (mixes 9–12), WCAB maintains almost similar cache miss rate as NBB, while POEM leads to an increase worth \(6.75\%\).

The overall system performance is influenced by both cache locality and cache contention. While previous bypass policies primarily emphasized on enhancing cache locality through bypassing of redundant (or dead) cache data, they did not pay enough attention to the contention aspect. Contrary to these policies, POEM employs bypass decisions to alleviate the NVM shared cache contention. On an average across our workloads, POEM ends up bypassing 60% of the cache writebacks and 66% of DRAM responses, because of its appropriate emphasis on the aspect of cache contention. Such aggressive bypass decisions effectively mitigate the contention but might also affect the cache locality. However, even in phases of high contention and associated aggressive bypass decisions, POEM’s selective reuse-aware cache writes enable it to offer performance improvements for all types of mixes, despite the side effects on cache locality.

5.4 NVM Cache Endurance

In real hardware, the NVM endurance is predetermined, often represented by a fixed value such as \(4\times 10^{12}\) for a particular NVM technology such as STT-MRAM [26, 49]. However, due to practical limitations in simulation environments, which prevent us from executing real-world workloads for a sufficiently long duration required to actually reach such endurance limits, we use two statistical metrics to estimate the NVM cache endurance in our simulation setup. Also, in an actual hardware, the risk of wear-out is attributed to individual NVM bit-cells. However, we monitor the write frequency of individual cache ways. This is a reasonable approximation to gain insights into the endurance problem without incurring excessive hardware overheads for tracking individual bit-cells [6, 48, 59, 73].

The first metric (\(\mbox{M}_{max}\)) considers the maximum write frequency incurred by any cache way, over the whole NVM cache, throughout the entire workload execution, as expressed in Equation (3) as follows:

\begin{equation} \mbox{M}_{max} = \max _{i=0}^{N-1} \mbox{WC}[i]. \end{equation}

(3)

Here, \(N\) represents the total number of cache ways, and \(\mbox{WC}[i]\) indicates the write frequency of cache way \(i\) over the course of execution, measured by hardware counters available as a part of additional metadata for every cache way. By reducing the value of \(\mbox{M}_{max}\), a policy is considered to improve the NVM cache endurance, effectively lowering the risk of the most vulnerable cache line crossing the endurance limit. Given the limitations of replicating real-world wear conditions in shorter simulations, we also incorporate another metric, denoted as \(\mbox{M}_{var}\), to quantify the imbalance or skewness in the distribution of write accesses across all cache ways. This metric is defined in Equation (4) as follows:

\begin{equation} \mbox{M}_{var} = \frac{\sqrt {\frac{1}{N-1} \sum _{i=0}^{N-1} \left(\mbox{WC}[i] - \frac{1}{N} \sum _{i=0}^{N-1} \mbox{WC}[i] \right)^2}}{\frac{1}{N} \sum _{i=0}^{N-1} \mbox{WC}[i]} . \end{equation}

(4)

The numerator in Equation (4) represents the standard deviation of write frequencies across all cache ways, while the denominator evaluates the average write frequency of a cache way. This metric is also referred to as the co-efficient of variation, which is a statistical measure for the overall variation of writes across the entire NVM cache [59]. Because of its global nature, \(\mbox{M}_{var}\) encompasses notions of both intra-set and inter-set cache write variations. A higher value of \(\mbox{M}_{var}\) signifies that certain cache blocks experience a disproportionate number of writes, increasing their vulnerability to wear out, while other cache blocks receive significantly fewer writes. Such a skewed distribution of cache writes underscores the need for an endurance policy that can effectively redistribute the write frequency across various cache lines, thereby reducing the risk of wear-out. A lower value of \(\mbox{M}_{var}\) indicates a more uniform write distribution across different cache ways. In an ideal scenario where all cache ways are written the exact same number of times, \(\mbox{M}_{var}\) would be 0. However, if all cache ways are written uniformly and also heavily, then \(\mbox{M}_{var}\) would still be 0, failing to capture the endurance threat of the scenario. That is why \(\mbox{M}_{max}\) is a crucial metric for understanding the cache endurance. However, \(\mbox{M}_{max}\) has not been adopted extensively by the state-of-the-art, which focused only on variation metrics such as \(\mbox{M}_{var}\) [6]. Nevertheless, we believe that both these metrics are equally essential for a comprehensive evaluation of the NVM endurance issue, and we incorporate both into our analysis.

5.4.1 Worst-case Write Frequency.

Figure 8 presents a comparative analysis of the cache worst-case write frequency (\(\mbox{M}_{max}\)) for the proposed POEM and other baseline policies. The value of this metric for a particular mix within a specific policy, as illustrated in Figure 8, is normalized with respect to the value of this metric for that particular mix under the NBB. Across the set of 12 evaluation workloads, WCAB, WR, WCAB+WR, and proposed POEM reduce the worst-case write frequency of cache blocks by 6%, 14%, 15%, and 15%, respectively, over NBB.

Fig. 8.

With the help of its writeback bypass, WCAB reduces the worst-case cache write frequency by 10% on an average across mixes 1–3 consisting of all WWS applications. However, because WCAB does not explicitly address the underlying cache write distribution, the reductions in the value of \(\mbox{M}_{max}\) are significantly less compared to policies such as WR and POEM, which incorporate specific endurance managements. WR is an ultra-proactive wear leveling policy that initiates redistributing NVM writes from the beginning of the execution. While WR incurs performance overheads for such aggressive wear leveling, it reduces \(\mbox{M}_{max}\) by 14% on an average across our workloads, with the highest reduction over NBB reaching up to 40%. On an average across 12 workloads, the WCAB+WR baseline demonstrates a notable reduction of 15% in the value of \(\mbox{M}_{max}\). This surpasses the average reductions of 6% and 14% achieved by WCAB and WR, respectively. For the most heterogeneous workloads (mixes 9–12), while WCAB yields an average reduction of 6% and WR achieves a significant reduction of 22% in the \(\mbox{M}_{max}\) value over NBB, their integration results in a higher reduction of 25% in the value of \(\mbox{M}_{max}\) over NBB. For mixes 4–6 comprising WWS and RWS applications, while WCAB and WR individually reduce \(\mbox{M}_{max}\) by 5% and 4%, respectively, WCAB+WR offers an average reduction of 6% in the value of \(\mbox{M}_{max}\). These findings underscore the synergistic benefits of integrating the bypass and wear-leveling strategies. However, WR and WCAB+WR both start redistributing the write traffic across cache lines ultra-proactively, incurring significant performance overheads.

However, POEM’s endurance policy is triggered only when a cache block’s write frequency surpasses an endurance threshold (\(\mbox{E}_{Th}=400\)), thereby avoiding unnecessary performance stalls. Despite being much less aggressive in wear leveling than WR and WCAB+WR, POEM achieves slightly higher reduction in \(\mbox{M}_{max}\) than WR and similar reduction in \(\mbox{M}_{max}\) as WCAB+WR. However, the endurance gains vary across different workloads. For mixes 1–3 consisting entirely of WWS applications, POEM demonstrates 10%, 8%, and 9% more reduction in the value of \(\mbox{M}_{max}\) compared to WCAB, WR, and WCAB+WR, respectively. For mixes 4–6, comprising combinations of WWS and RWS applications, POEM leads to 8%, 9%, and 7% more reductions in \(\mbox{M}_{max}\) compared to WCAB, WR, and their combination, respectively. So, for these mixes that stress the NVM cache most significantly, POEM beats WR and WCAB+WR, despite being much less aggressive in terms of wear leveling. However, for other more diverse workloads such as mixes 9–12, WR and WCAB+WR achieve 7% and 10% higher reductions in the value of \(\mbox{M}_{max}\) than POEM, respectively. The overall efficiency of POEM in reducing the maximum cache block write frequency could be attributed to the fact that POEM can potentially swap between the hottest and coldest data. In contrast, WR (and WCAB+WR) redirects writes from the most-written cache ways to the least recently used (LRU) victim from the remaining cache ways, and the LRU victims might not necessarily contain the coldest data, making them less promising swap candidates.

5.4.2 Cache Write Variation.

Figure 9 presents a comparative analysis of the cache write variation (\(\mbox{M}_{var}\)) for POEM and other baseline policies. The value of this metric for a particular mix within a specific policy, as illustrated in Figure 9, is normalized with respect to the value of this metric for that particular mix under the NBB.

Fig. 9.

On average across our evaluation workloads, WCAB, WR, WCAB+WR, and POEM demonstrate a reduction in the cache write variation (\(\mbox{M}_{var}\)) by 3%, 9%, 10%, and 11%, respectively, in comparison to NBB. While WCAB could reduce the value of the worst-case cache write frequency (\(\mbox{M}_{max}\)) by curtailing the overall NVM write traffic through its dynamic writeback bypasses, it fails to have significant impacts on the cache write variation. For example, while WCAB reduces the value of \(\mbox{M}_{max}\) by 10% across mixes 1–3 comprising all WWS applications, the average reduction of \(\mbox{M}_{var}\) for these mixes under WCAB turns out to be 2% only.

WR redirects write stress from hot cache ways to LRU victims rather than explicitly selecting coldest victims, and, as a result, the overall reduction in cache write skew offered by WR is less prominent compared to POEM, which incorporates careful swap operations between hot and cold cache lines. Mixes 4–6 comprising WWS and RWS applications, exhibit 6% and 8% average reductions in \(\mbox{M}_{var}\) with WR and WCAB+WR while demonstrating a significant reduction of 15% in the value of \(\mbox{M}_{var}\) with POEM. For the most heterogeneous workloads, i.e., mixes 9–12, POEM and WCAB+WR both lead to 12% reductions in the value of \(\mbox{M}_{var}\), with WCAB and WR offering reductions of only 4% and 8%, respectively, over NBB. However, for mixes 1–3 consisting entirely of WWS applications, WR and WCAB+WR achieve 15% and 13% reductions in the value of \(\mbox{M}_{var}\) over NBB, respectively, with POEM offering an overall reduction of 5% across these mixes. While the combination of state-of-the-art bypass and endurance strategies proves to be more effective than the individual ones, it is crucial to emphasize that the aggressive wear leveling of WR (and WCAB+WR) comes at the expense of significant performance overheads. In contrast, POEM, despite applying a much less aggressive wear leveling approach compared to WR (or WCAB+WR), achieves a more uniform distribution of NVM writes, surpassing the endurance gain of either method individually or in combination. Across our workloads, there is an interesting interplay between the overall speedup and the improvement in endurance, which is further detailed in Section 5.5.

While the two metrics, i.e., \(\mbox{M}_{max}\) and \(\mbox{M}_{var}\), provide adequate insights into the NVM endurance and lifetime, we also consider the average write frequency per cache line as another relevant statistical property of the underlying NVM write distribution. This metric could help gather a broader understanding of how different policies influence the overall NVM write behavior. Across our workloads, WCAB, WCAB+WR, and POEM reduce the average cache block write frequency by 2%, 2%, and 11%, respectively, over NBB, which does not involve cache write bypass at all. WR also does not apply any cache write bypass, maintaining an average block write frequency similar to that of NBB. This underscores the fact that cache write bypass policies such as WCAB, WCAB+WR, and POEM can reduce the average frequency of cache block write. However, wear leveling alone, e.g., WR, does not influence the average frequency of cache block write, because it primarily involves redistributing writes from hot cache ways to other cache ways.

5.5 Performance vs. Endurance Tradeoffs

While Sections 5.2 and 5.4 present the performance and endurance results separately for POEM and other baseline policies, the current section aims to explore the tradeoff between these two crucial system-level objectives by examining various interesting variations of POEM. While discussing the individual performance and endurance results in the previous sections, the value of \(\mbox{E}_{Th}\) is assumed to be 400. However, by adjusting the value of \(\mbox{E}_{Th}\), it is possible to modulate the emphasis placed on the two system-level objectives: boosting overall speedup and enhancing NVM endurance. This flexibility empowers system designers to explore different tradeoffs between performance gains and endurance improvement, as shown in Figure 10.

Fig. 10.

We present the results in two separate figures, Figures 10(a) and 10(b), with the X-axis depicting the two different endurance metrics, i.e., \(\mbox{M}_{max}\) and \(\mbox{M}_{var}\), respectively, and the Y-axis representing the achieved speedup. Both figures present the speedup and the reduction in the two endurance metrics as percentages. We vary the endurance threshold of POEM across a range of values, specifically: 0, 200, 400, 600, and 800, represented by POEM-0, POEM-200, POEM-400, POEM-600, and POEM-800, respectively. We also include the results for a policy denoted as PO, which essentially represents the POEM strategy without the endurance management (EM). PO is an interesting variant of POEM that, despite no explicit wear leveling, could potentially enhance the NVM endurance by curtailing the NVM write traffic significantly through its aggressive bypass approach.

In Figure 10(a), the NBB policy, which is the common baseline for all other policies, is positioned at the origin (0,0). WCAB demonstrates a 4% speedup, coupled with a 6% reduction in the worst-case block write frequency (\(\mbox{M}_{max}\)), and WR enhances the NVM endurance by 14% but at the expense of 1% degradation in the overall performance. A combination of WCAB and WR offers a 15% reduction in the worst-case block write frequency while offering a speedup worth 3% over NBB. In summary, the combination of state-of-the-art bypass and endurance strategies proves beneficial for cache endurance but is significantly limited by its performance. POEM-0 activates the wear leveling from the start of the simulation (similar to WR), generating a significant (i.e., 15%) reduction in the worst-case cache block write frequency. However, while WR incurs performance overheads because of its ultra-proactive wear leveling, POEM-0 manages to enhance the overall system performance by 10% due to its strategic bypass decisions to mitigate the NVM cache contention. As we gradually increase the value of POEM’s endurance threshold (\(\mbox{E}_{Th}\)), the emphasis shifts from wear leveling to overall speedup. This trend is evident in the data points for POEM-200, POEM-400, POEM-600, and POEM-800, which yield speedups of 14%, 34%, 39%, and 39%, respectively, with endurance enhancements of 15%, 15%, 14%, and 12% compared to NBB. The PO policy achieves the highest performance gain, because it does not perform wear leveling. While WCAB, with its conservative bypassing of NVM writes, shows limited endurance gain (only 6% over NBB), PO, with its aggressive bypass approach, is capable of producing much higher endurance gain, even without applying the wear leveling strategy at all. When the endurance threshold is significantly increased (e.g., 800), POEM aligns closely with PO in terms of performance and endurance improvements because of incurring minimal number of swap operations. Table 4 details the total number of swap operations incurred by POEM-0, POEM-200, POEM-400, POEM-600, and POEM-800 during the entire execution, along with the associated performance overheads. By comparing the speedups of these POEM variants against PO, which does not invoke any swap operation, we can estimate the performance overhead of wear leveling. Table 4 shows that the increment in the endurance threshold corresponds to a reduction in the total number of swap operations as well as the associated performance overheads.

Table 4.

Policy	POEM-0	POEM-200	POEM-400	POEM-600	POEM-800
Number of Swap Operations	3,711,388	3,037,028	631,333	518	14.75
Swap Overhead (Speedup degradation over PO)	22%	19%	5%	1%	1%

Table 4. Number of Swap Operations and Associated Performance Overheads across Different Variants of POEM with Different Endurance Thresholds

Figure 10(b) provides insights into the tradeoffs between speedup and the reduction in cache write variation (\(\mbox{M}_{var}\)) across different variants of POEM. The state-of-the-art baselines, WCAB, WR, and WCAB+WR, are clustered near the origin (NBB), indicating their sub-optimal performance and endurance enhancements compared to the variants of proposed POEM. Among the variants of POEM, POEM-0 achieves the highest reduction in the cache write variation (21%) due to its most aggressive wear leveling strategy. However, its speedup is relatively lower (10%). For system designers aiming to achieve a more uniform cache write distribution, even if it comes at the expense of sub-optimal performance, POEM-0 emerges as one of the suitable design options, apart from POEM-200, which enhances the cache endurance similarly while offering more speedup (14%). However, POEM-400, POEM-600, and POEM-800 increasingly prioritize performance improvement over write redistribution, as previously indicated in the trends captured by Figure 10(a). POEM-400, POEM-600, and POEM-800 lead to speedups worth 34%, 39%, and 39%, respectively, while at the same time, reducing the imbalance in NVM writes by 11%, 8%, and 8% compared to NBB. POEM-800, with the lowest swap overheads, nearly matches the performance and endurance gains of PO. The adoption of aggressive bypass decisions for NVM write operations could improve both the endurance metrics (\(\mbox{M}_{max}\) and \(\mbox{M}_{var}\)), as are shown in Figures 10(a) and 10(b). While PO and WCAB improve metric \(\mbox{M}_{max}\) by 12% and 6%, respectively, over NBB, the improvements in metric \(\mbox{M}_{var}\) are 8% and 3% over NBB, respectively. The impact of NVM write bypass is more pertinent in the worst-case write frequency metric (\(\mbox{M}_{max}\)). Wear leveling policy explicitly addresses the skew in the underlying NVM write distribution, leading to additional endurance gain by re-distributing wear across various cache lines. That is why policies such as POEM-200, which employ explicit wear leveling, improve metrics \(\mbox{M}_{max}\) and \(\mbox{M}_{var}\) by \(2.6\%\) and 12%, respectively, over PO, which only performs NVM write bypass.

5.6 Speedup across Different NVM Access Latencies

Prior work consistently acknowledges that STT-MRAM write latency is significantly higher than the read latency. The reported ratio of write and read latencies varies across a range of 2–7 [3, 7, 10, 12, 21, 27, 35, 44, 45, 56, 57, 58, 68, 76]. The widely used NVM cache modeling utility, NVSim [1], provides read and write access latencies for our STT-MRAM L3 cache as 7 and 23 cycles, respectively, with a ratio of 3.29 between them, falling within the reported range. However, to account for the variation in read-write symmetry, we conduct experiments to measure how the overall speedup varies across different NVM access latencies.

Figure 11 summarizes the findings of the experiment by reporting the average speedup of POEM across the set of 12 workloads, corresponding to each choice of different write latency values. As shown in Figure 11, POEM’s average speedup becomes 34%, 31%, 26%, 23%, 18%, 14%, 12%, 5%, and 2% for write latency values of 23, 21, 19, 17, 15, 13, 11, 9, and 7 cycles, respectively. POEM stands out as a robust and effective NVM cache controller policy, strategically exploiting the tradeoff between cache contention and reuse, delivering significant overall speedup (greater than or equal to 12%), even in scenarios where the NVM write latency is significantly reduced by up to 12 cycles. When the write latency is set equal to the read latency, transforming the LLC essentially into an SRAM cache, bandwidth contention still remains a performance bottleneck [11, 51, 70], although being much less severe than in the case of an NVM LLC. Even for an SRAM LLC, POEM allocates the available shared LLC bandwidth more to critical reads, achieving some throughput gain. Even though the average throughput gain across the whole workload is comparatively lower (2%), POEM offers an average speedup worth 8% for the SRAM LLC across mixes 1–6, which, consisting of WWS and RWS applications, contend more for the LLC bandwidth. These observations indicate POEM’s robustness in delivering adequate speedup, even when the read write asymmetry is significantly reduced.

Fig. 11.

In Figure 12, we illustrate the variation of the average speedup offered by POEM across different read and write latency values, maintaining similar ratios between them. As both the cache access latencies are reduced, we observe POEM to offer less throughput gains due to the overall reduction in the NVM LLC queuing. For the cache read and write latency values of 7 and 23 cycles, 6 and 20 cycles, 5 and 17 cycles, and 4 and 14 cycles, POEM offers speedups worth 34%, 29%, 20%, and 14%, respectively. Even with a significant access time reduction of 3 cycles for read latency and 9 cycles for write latency, POEM achieves a significant average speedup of 14% over NBB across the workloads, with the highest speedup reaching up to 52% for mix2. These findings indicate that despite improvements in NVM access latencies, as long as their asymmetry persists, POEM remains effective by exploiting the asymmetry and mitigating contention with its aggressive write bypass decisions. These experiments measuring POEM’s performance gains with reduced LLC access latencies can also be interpreted as applying to an LLC where multiple accesses are overlapped in time, effectively relieving some of the contention.

Fig. 12.

5.7 Speedup across Different Core Counts

To quantify the impact of varying core counts on POEM’s speedup, we create sets of 6- and 4-core workloads using the benchmarks from the original 8-core workloads, depicted in Table 3. While creating the new workload sets, we ensure that they maintain diversity similar to the original 8-core workloads. For brevity, we do not provide the specific compositions of the new 6- and 4-core workloads. The mixes (mix-\(i\), for \(i \in [1,12]\)) across these three heterogeneous workload sets (8-, 6-, and 4-core) are not identical and should not be compared across experiments. Nevertheless, we can derive an overall perspective by comparing POEM’s average speedup across these three sets of workloads.

Figure 13 illustrates the average speedup achieved by POEM across sets of 8-, 6-, and 4-core workloads. On average, POEM improves the overall system throughput by 34%, 30%, and 23% for the 8-, 6-, and 4-core workloads, respectively. The decrease in speedup with a reduction in system core count is attributed to the natural reduction in the shared LLC contention. We collect the average LLC request queue length for each set of workloads to quantify the notion of contention. The average LLC request queue lengths are found to be 32.53, 24.38, and 12.6 across the sets of 8-, 6-, and 4-core workloads, respectively. Despite 6- and 4-core workloads resulting in \(25.05\%\) and \(61.27\%\) reductions in the LLC contention (request queue length) compared to the original 8-core workloads, POEM offers significant speedup throughout.

Fig. 13.

5.8 Speedup in Distributed LLC

As gem5’s classic memory model cannot support the implementation of a distributed LLC over a predefined network topology, we use gem5’s Ruby memory model to determine POEM’s performance gains for a distributed LLC. We conduct experiments on our original set of 12 workloads on an 8 MB LLC distributed into 1, 2, 4, and 8 slices over a \(4\times 4\) mesh NoC. For each case, we extract the values of access latency for individual slices from NVSim [1] and incorporate them into our experimental setup.

On an average across our original 8-core workloads (excluding mixes 7 and 8 due to their incomplete executions), POEM achieves speedups of 30%, 22%, 10%, and 5% when the LLC slice counts are 1, 2, 4, and 8, respectively, as shown in Figure 14. POEM operates independently in each LLC slice, and as the slice count increases, the contention in each slice decreases, leading to a reduction in POEM’s speedup. This trend is observed across different categories of workloads. For instance, mixes 1–3 consisting of WWS applications result in average speedup of 7%, 6%, 3%, and 0% for cases when LLC consists of 1, 2, 4, and 8 slices, respectively. For mixes 4–6 comprising WWS as well as RWS applications, POEM offers considerable speedup in all cases, with 63%, 47%, 20%, and 11% performance improvements over NBB with 1, 2, 4, and 8 LLC slices, respectively. For more heterogeneous workloads, such as mixes 9–12, POEM’s speedup becomes 28%, 18%, 8%, and 3% when the LLC is distributed into 1, 2, 4, and 8 slices, respectively. As we distribute the LLC more aggressively, the average queue length for individual slices decreases, with the slices handling multiple accesses concurrently. On average across the workloads, the average lengths of the request queue per LLC slice are found to be 24.29, 9.08, 3.31, and 1.55, respectively, for the monolithic, 2-sliced, 4-sliced, and 8-sliced LLC. Despite the distribution of overall LLC congestion across different request queues corresponding to individual slices through this architectural enhancement, POEM maintains its non-trivial speedup (\(\ge 5\%\)).

Fig. 14.

5.9 Speedup in Multi-port LLC

Multi-port cache architectures are more prevalent in the context of smaller caches, such as private L1 and L2 caches, due to their inherent performance and energy overheads. Nevertheless, some state-of-the-art LLCs in high-performance MPSoCs are anticipated to support two access ports for parallel handling of requests and responses, with one port dedicated to read/write operations while the other exclusively for writes. This architectural enhancement effectively mitigates contention between requests and responses, thereby resolving the contention between two sources of NVM write operations: writebacks and responses.

We modify the LLC model in gem5 to support two access ports, as discussed above, and execute NBB and POEM for our evaluation workloads. As depicted in Figure 15, POEM with a dual-port LLC achieves a considerable average speedup of 17% over NBB, which also runs on a dual-port LLC, across our workloads (excluding mixes 6 and 7 that failed to complete execution). Specifically, for mixes 1–3 consisting entirely of WWS applications, POEM achieves an average speedup of 47%, whereas we observe an average throughput gain of 9% across mixes 4 and 5, including both WWS and RWS applications. For more diverse workloads (mixes 9–12), POEM achieves an average speedup of 6%. These experiments indicate that POEM is capable of providing significant improvements in overall system throughput, even in the context of a dual-port LLC.

Fig. 15.

5.10 Parameter Selection

For the selection of parameters associated with POEM, we rely on a separate set of workloads than the ones used to obtain the final results. Across the set of 13 sensitivity workloads, we experiment with different combinations of values for each of the parameters, as outlined below.

5.10.1 Reuse Thresholds.

POEM uses four thresholds on the predicted reuse counts, two while bypassing writebacks, \(\mbox{Th}_{High\_wb}\), and \(\mbox{Th}_{Low\_wb}\), and the other two while bypassing responses: \(\mbox{Th}_{High\_resp}\), and \(\mbox{Th}_{Low\_resp}\). By varying each threshold between 1 and 4, we experiment with several combination of values for \(\mbox{Th}_{High\_wb}\), \(\mbox{Th}_{Low\_wb}\), \(\mbox{Th}_{High\_resp}\), \(\mbox{Th}_{Low\_resp}\), such as: \(\lbrace 1,1,1,1\rbrace\), \(\lbrace 2,1,2,1\rbrace\), \(\lbrace 4,2,2,1\rbrace\), \(\lbrace 2,1,4,2\rbrace\), \(\lbrace 4,3,2,1\rbrace\), \(\lbrace 4,1,4,1\rbrace\), and so on. Despite observing minor variation in the achieved speedups, we opt for \(\mbox{Th}_{High\_wb}=4\), \(\mbox{Th}_{Low\_wb}=2\), \(\mbox{Th}_{High\_resp}=2\), \(\mbox{Th}_{Low\_resp}=1\), as it yields the highest speedup and the lowest cache miss rate among all other options.

5.10.2 Contention Tolerance.

POEM uses empirical tolerance (\(\alpha\)) while evaluating the severity of contention by comparing the recent average of the read queue length with its running average. We vary \(\alpha\) across \(\lbrace 0, 0.1, 0.2, 0.4, 0.6, 0.8\rbrace\) and observe that the overall speedup remains almost identical across these choices. However, in comparison to NBB, these choices lead to increases in the cache miss rate by 26%, \(25.7\%\), \(25.3\%\), 16%, \(14.7\%\), and \(13.9\%\), respectively. So, we select the value of \(\alpha\) to be 0.8, which incurs minimal cache miss rate.

5.10.3 Reuse Predictor Size.

POEM employs reuse predictors to monitor patterns in recent instances of cache data reuse. We investigate various sizes for the reuse prediction tables: 512, 1,024, 2,048, and 4,096. We observe marginal variations in the attained speedup among these alternatives and choose the predictor size to be 512 entries, primarily due to its minimal storage overhead.

5.10.4 Epoch Length.

POEM collects performance counters at the end of each epoch. We conduct experiments with epoch duration of 50k, 100k, 200k, and 400k cycles, observing negligible performance variations. Although an epoch of 50k cycles results in a slightly higher speedup for POEM compared to the other choices, we adopt an epoch duration of 400k cycles to minimize the overhead associated to the collection of various performance counters.

5.10.5 Running Average Window.

While assessing the criticality of contention, POEM uses the running average of the read queue lengths across previous \(K\) epochs. We try different values for \(K\), such as 4, 8, 16, and 32, and observe that the achieved speedups are quite comparable. Therefore, we settle on a choice of \(K=4\), as it offers the advantage of minimal storage (registers) requirement for the computation of the running average.

5.10.6 Thresholds for Swap Bit Reset.

The swap bit of a cache line is reset multiple times during the program execution to control its swap eligibility. We use three thresholds on the block write frequency: \(\mbox{T}_1\), \(\mbox{T}_2\), and \(\mbox{T}_3\). We experiment with the following three combinations: (\(\mbox{T}_1 = 0.25 \times \mbox{E}_{Th}\), \(\mbox{T}_2 = 0.50 \times \mbox{E}_{Th}\), \(\mbox{T}_3 = 0.75 \times \mbox{E}_{Th}\)), (\(\mbox{T}_1 = 0.50 \times \mbox{E}_{Th}\), \(\mbox{T}_2 = 0.625 \times \mbox{E}_{Th}\), \(\mbox{T}_3 = 0.75 \times \mbox{E}_{Th}\)), and (\(\mbox{T}_1 = 0.125 \times \mbox{E}_{Th}\), \(\mbox{T}_2 = 0.25 \times \mbox{E}_{Th}\), \(\mbox{T}_3 = 0.50 \times \mbox{E}_{Th}\)). We observe almost identical behavior in terms of the two endurance metrics and choose the most intuitive combination: \(\mbox{T}_1 = 0.25 \times \mbox{E}_{Th}\), \(\mbox{T}_2 = 0.50 \times \mbox{E}_{Th}\), \(\mbox{T}_3 = 0.75 \times \mbox{E}_{Th}\).

6 Overhead

To assess the energy, area, and performance overheads of the different components (adders, comparators, decoders, registers, buffers, etc.) in the hardware implementation of POEM, we referred to Synopsys 22 nm technology library. In POEM, the request and response bypass policy determines whether incoming writebacks and responses should be inserted into their respective controller queues. This does not interfere with the actual processing of requests and responses in the NVM cache. Thus, in general, the bypass policy does not lie on the critical path. The only scenario where the bypass decision might impact the critical path is when there are no pending requests or responses in any of the controller queues, which never occurs across our simulations. The latency overhead of a single swap operation is calculated by adding together the latencies of NVM read and write accesses and the access overhead associated to the swap buffer. A swap operation obstructs regular cache accesses, thereby resulting in aggravated contention. We have incorporated these overheads into our cache model and quantified the performance overheads of swap operations in Table 4, Section 5.5.

Each NVM cache line is equipped with additional 12 bits of metadata, out of which 11 bits are utilized to keep track of the write frequency for each cache line, and the remaining one is used as the swap-bit. Our workloads recorded an average execution time of \(1.72\times 10^{8}\) cycles, which corresponds to 0.086 second for a 2 GHz clock. According to the estimates from NVSim [22] and CACTI [52], the default (without POEM’s modifications) energy consumption of the LLC averages to \(6.24\times 10^{7}\) nJ across our workloads. Out of this, \(1.42\times 10^7\) nJ is attributed to the SRAM tag array, while \(4.82\times 10^7\) nJ is consumed by the STT-MRAM data array. The further breakdown of the energy consumption of the SRAM tag array reveals a dynamic energy component of \(1.79\times 10^6\) nJ and a leakage energy component of \(1.24\times 10^7\) nJ. Notably, the leakage component accounts for \(87.43\%\) of the overall tag energy, contributing to \(19.94\%\) of the total LLC energy. The additional energy overheads associated to the bypass and endurance policies in POEM are \(7.22\times 10^4\) nJ and \(4.64\times 10^6\) nJ, respectively, indicating consumption of \(0.12\%\) and \(7.44\%\) additional energies over the default LLC energy. Within the energy overhead of POEM’s endurance policy, \(3.32\times 10^6\) nJ is attributed to the leakage from the additional 12 bits of SRAM meta-data (write counters and swap bits), \(1.32\times 10^6\) nJ to the swap operations, and the rest to the additional controller logic. Considering SRAM’s high leakage, \(71.5\%\) of the energy overhead associated to POEM’s endurance policy is specifically attributed to the leakage from additional SRAM metadata. In the absence of POEM, the default LLC tag array incurs a leakage energy consumption of \(1.24\times 10^7\) nJ. When the LLC tag array is modified according to POEM, we measure an increased leakage energy consumption of \(1.58\times 10^7\) nJ. The dynamic energy component of the unmodified LLC tag array amounts to \(1.79\times 10^6\) nJ, while the LLC tag array modified according to POEM incurs a dynamic energy expenditure of \(1.81\times 10^6\) nJ.

The default size of the LLC tag array is 720 KB, whereas the LLC tag array modified according to POEM has a size of 912 KB. According to the estimates from NVSim [1] and CACTI [2], the bypass and endurance policies of POEM introduce area overheads of 0.052 \(mm^2\) and 0.25 \(mm^2\), respectively, incurring \(1.67\%\) and \(8.06\%\) additional area footprint over the default LLC area of 3.10 \(mm^2\).

7 Conclusion and Future Work

While non-volatile memories (NVMs) emerge as promising alternatives to SRAM in designing cache memories for next-generation MPSoCs, they critically suffer from write inefficiency and limited endurance. Existing research addresses these issues separately, lacking a comprehensive approach. This study introduces Performance Optimization and Endurance Management (POEM), a novel NVM cache controller policy that dynamically bypasses writebacks and responses to alleviate the NVM cache contention and also distributes the wear across different cache lines efficiently. Across SPEC workloads, POEM achieves average speedups of 34% and \(28.8\%\) over the naïve baseline and state-of-the-art NVM bypass technique, respectively. POEM reduces the worst-case NVM write frequency by 15% over the baseline, achieving 11% more uniformity in the underlying NVM write distribution. Through appropriate configuration of the endurance threshold parameter, POEM effectively exploits inherent tradeoffs between system performance and NVM endurance, with a lower threshold favoring wear leveling and a higher one prioritizing performance improvement.

While POEM currently confines its swap operations within the cache set, future plans include leveraging more flexible swap operations across different cache sets with disproportionate write intensities. POEM’s aggressive bypass mitigates the NVM cache contention, enhancing the overall system throughput. However, such aggressive bypass impacts the cache locality adversely, with interesting implications for the energy consumption of the whole memory hierarchy. Future research involves exploring energy-aware adaptations of POEM. Additionally, hybrid cache architectures, featuring the availability of a fast SRAM memory along with slower NVMs, introduce novel optimization opportunities that we also would like to investigate in the future.

Acknowledgements

We would like to thank the anonymous reviewers for their valuable comments.

References

[1]

2013. NVSim - A performance, energy and area estimation tool for non-volatile memory (NVM). Retrieved from https://github.com/SEAL-UCSB/NVSim

Abstract

1 Introduction

2 Related Work

3 Motivation

4 POEM: Performance Optimization and Endurance Management for Non-volatile Caches

4.1 Request and Response Bypass Policy

4.2 Endurance Policy

5 Experiments and Results

5.1 Experimental Setup

5.2 Overall System Performance

5.3 NVM Cache Miss Rate

5.4 NVM Cache Endurance

5.4.1 Worst-case Write Frequency.

5.4.2 Cache Write Variation.

5.5 Performance vs. Endurance Tradeoffs

5.6 Speedup across Different NVM Access Latencies

5.7 Speedup across Different Core Counts

5.8 Speedup in Distributed LLC

5.9 Speedup in Multi-port LLC

5.10 Parameter Selection

5.10.1 Reuse Thresholds.

5.10.2 Contention Tolerance.

5.10.3 Reuse Predictor Size.

5.10.4 Epoch Length.

5.10.5 Running Average Window.

5.10.6 Thresholds for Swap Bit Reset.

6 Overhead

7 Conclusion and Future Work

Acknowledgements

References

Cited By

Index Terms

Recommendations

Endurance-aware cache line management for non-volatile caches

Composing lifetime enhancing techniques for non-volatile main memories

Reducing traffic generated by conflict misses in caches

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations