1 Introduction
In recent years, the demand for memory has grown rapidly due to the prevalence of big data applications, such as in-memory databases and graph processing [
36]. However, the slowdown of DRAM device scaling [
38] has constrained the capacity of local memory. As a result, various far memory techniques such as CXL-based memory expansion, disaggregated memory, and
Non-Volatile Main Memory (
NVMM) have emerged to help address the rising demand for larger and cost-effective memory capacities. In this article, far memory refers to all alternative memory technologies that provide higher capacity while maintaining standard direct load/store access semantics used by applications. This allows memory-hungry workloads to extend the size of the working set easily.
While providing high capacity at low cost, far memory also introduces latencies that are significantly longer and more variable than local DRAM. In a system with heterogeneous far memory devices (Figure
1), the latencies may range from 200 ns to over 5μs. It will bring considerable challenges to performance optimization, as applications and modern processors have been highly optimized based on the assumption of DRAM latency.
Traditional techniques for tolerating latency include caching, bulk data transfer, and memory access overlapping. Due to the poor temporal and spatial locality of many big data workloads, the first two methods are inadequate normally. On the other hand, big data programs often exhibit a large number of independent memory operations. Therefore, by overlapping more memory requests, achieving a high degree of memory-level parallelism (MLP) is the primary solution to effectively hide the latency of far memory.
Some domain-specific accelerators or many-core processors are capable of exploiting high MLP by providing a large number of lightweight hardware threads. However, these solutions are typically designed for specific applications and unsuitable for workloads with complex control flow. Meanwhile, general-purpose out-of-order (OoO) processors remain the mainstream in data centers due to their balanced cost and performance. With far memory adoption on the rise, the achievable MLP of OoO processors takes on increasing importance for overall performance.
However, the MLP that modern OoO processors can achieve is limited by their restricted instruction window, which is insufficient to hide the latency of far memory. OoO cores use complex hardware structures and logic like the reorder buffer (ROB), load/store queue (LSQ), and miss status holding registers (MSHRs) to extract potential MLP and track outstanding memory requests. Hence, instruction windows are highly constrained by hardware. Meanwhile, cache-missed memory operations can hold certain hardware resources for a long time, easily leading to resource exhaustion and pipeline stall.
The even longer latencies of far memory exacerbate this issue. Memory-bound workloads operating on far memory experience significant performance degradation compared to utilizing local DRAM. As Figure
2 shows, typical memory-bound workloads experience 3-4x slowdowns when far memory latency increases to 1 μs, which is state-of-the-art network latency [
1].
While increasing hardware resources such as the ROB and MSHRs can improve MLP, the scalability of these resources is limited by their complex hardware logic [
9]. Over the past few decades, extensive efforts have been made to overcome these scalability issues [
11,
44,
51,
61,
64]. However, the instruction windows of current state-of-the-art commercial processors are still limited to the order of several hundred. Achieving sufficient MLP for hiding the far memory latency still requires several times of that scale. Thus, focusing solely on increasing hardware resources is inadequate.
The key factor contributing to this problem is that the synchronous semantics of load/store instructions cause them to occupy critical hardware resources for a longer time when facing long latencies. This prevents other independent memory operations in programs from being issued, restricting the MLP that can be achieved.
Addressing this challenge requires asynchronous memory access techniques that separate memory request issuing from memory response. Instructions that invoke an asynchronous memory request can be retired right after issuing the request, leading to the immediate release of associated hardware resources rather than occupying resources for a full access period, as with traditional synchronous load/store instructions.
Prefetching and external memory access engines are two typical asynchronous mechanisms in modern processors. However, existing asynchronous memory access techniques still have limitations for far memory scenarios. While prefetching enables asynchronous memory request invoking, it lacks support for tracking request completion and managing returned data. This limits its effectiveness for complex scenarios with highly varying far memory latencies. Meanwhile, offloading to external memory engines incurs high startup and notification overhead, hindering its applicability to fine-grained and irregular memory accesses common in big data applications such as graph computing and in-memory databases. In a word, there lacks a better asynchronous mechanism inside OoO core to achieve high MLP when facing the issues of long and variable far memory latencies.
This article proposes a set of novel Asynchronous Memory Access Instructions (AMI) and an Asynchronous Memory Access Unit (AMU) inside a modern OoO CPU core. The AMI provides mechanisms for initiating asynchronous memory accesses and notifying completion, aiming to reduce critical resource occupancy during accesses and achieve full software management flexibility. The AMU provides efficient support for the AMI as well as management for massive outstanding requests in a cost-efficient manner.
The AMU achieves low-cost hardware resources and efficient execution of AMI through its tight integration within the processor core. It dynamically reserves a portion of the L2 cache as Scratchpad Memory (SPM). The SPM serves two critical functions. First, it provides program-managed data storage to address the register pressure issues. Second, the SPM offers sufficient state storage to track outstanding requests without costly content-addressable memory (CAM)-based queues.
AMI introduces aload and astore instructions for invoking asynchronous memory access requests and getfin for polling asynchronous responses. All instructions of AMI do not hold any general registers to avoid pipeline stalls. In fact, AMIs only move data between SPM and far memory, leaving the normal synchronous load and store to move data between registers and SPM.
While using L2-SPM instead of various registers enables scaling up easily, it introduces three new performance issues that must be solved: The first is that metadata access latency increases from register level to L2 level. The second is that when pipeline roll-back occurs due to failed speculation, state consistency in L2-SPM must also be maintained. The third is how to support memory disambiguation for massive outstanding memory operations without CAM-based hardware support. In this article, we also present our corresponding solutions as metadata caching, AMU speculation, and software disambiguation.
To address the programming complexity introduced by AMI, we also develop a coroutine-based framework implemented in C++. This framework abstracts away the low-level details of instruction scheduling and SPM management, providing a simplified programming model.
Overall, we present a hardware-software cooperative solution to achieve high MLP for OoO cores, allowing for efficient utilization of unlimited far memory resources. Our contribution can be concluded as follows:
1A set of novel AMI that models memory requests and responses as separate instructions. This reduces critical resource occupation and exposes more opportunities for software to exploit MLP.
A novel AMU architecture that efficiently executes AMI. AMU repurposes cache as metadata storage and program-managed data storage to support massive outstanding AMI requests.
An optimized micro-architecture design for AMU, including metadata caching and AMU speculation mechanism, and so on.
An efficient and software-only memory disambiguation mechanism to substitute the non-scalable CAM hardware.
A coroutine-based C++ framework to address programming complexity from AMI. The framework encapsulates low-level scheduling and SPM management.
A cycle-accurate model of AMU is built based on the Gem5 simulator. The evaluation results show that for memory-bound benchmarks, AMU achieves an average speedup of 2.42× when the additional latency introduced by far memory is 1 μs. In the case of the random access benchmark from HPCC, our technique can offer a speedup of 26.86 × when the far memory latency is 5 μs, with the average number of in-flight memory requests of single-core exceeding 130.
2 Background and Motivation
2.1 Far Memory and MLP
Far memory refers to a variety of memory technologies that offer higher capacity compared with DRAM, including device memory/remote memory based on
Non-Volatile Main Memory (
NVMM) [
46], cache-coherent interconnect-based memory/devices (e.g., CXL [
60]), and Disaggregated Memory. While offering much larger memory capacities, far memory presents challenges because of its long and highly variable latency.
Non-volatile Main Memory (NVMM) provides higher storage density and persistence by using new materials [
6,
35,
58]. When used as main memory, NVMM has a higher access latency than conventional DRAM. One example is the Intel Optane DC Persistent Memory Module, which exhibits an access latency that ranges from 200 to 300 nanoseconds [
46]. Another example is ultra-low latency flash (ULL-Flash) based devices [
33], which achieve a typical read latency of 3μs and write latency of 100μs. To date, NVMM has seen limited commercial adoption while various researches are still underway.
Cache-Coherent Interconnect Based Memory/Devices enables CPUs to directly access memory and storage devices without the overhead of extra software mechanisms like page swapping. The typical emerging coherent interconnects includes
Compute Express Link (
CXL) [
60], Open Coherent Accelerator Processor Interface (OpenCAPI) [
3] and Gen-Z [
4]. The latency of Direct attached CXL-enabled memory is approximately 200ns [
60], which is still three times larger than that of conventional local DRAM. For memory-semantic SSD, the access latency can be 35–738% slower than local DRAM, depending on the contention of SSD [
34]. CXL-enabled memory extensions have gotten more attention in recent years. Anyway, local extension within a node still has its capacity limitations.
Disaggregated Memory allows programs to directly leverage memory resources on remote nodes based on high speed interconnects. Both academia [
40] and industry [
5] have developed remote memory prototypes and products based on emerging interconnects. For coming CXL-enabled disaggregated memory systems, the latency depends on propagation delays, including the latency of switches and ports, which will reach more than 300 ns [
40]. For network-attached remote memory across multiple switches, the latency can reach 1μs–2μs, which is the state-of-the-art record [
1]. Furthermore, the newly released CXL 3.0 specification introduces support for
Global Fabric Attached Memory (
GFAM), enabling the construction of memory pools that across multiple nodes.
Besides different types of far memory technologies, highly variable latencies also come from other factors. Due to queueing latency and contention, accessing the same level of the memory hierarchy can exhibit noticeable latency variations [
19]. The potentially complex hierarchical structures inside far memory devices can also cause significant differences in memory access latencies.
Long and highly variable latencies further widen the gap between processor and memory. This exacerbates the traditional memory wall problem. Therefore, applications that want to get the benefit from the large capacity of far memory have to seek ways for latency tolerance first.
Traditional techniques for tolerating latency include caching, which reduces the average access latency by keeping hot data locally, and bulk data transfers, which move large contiguous blocks to minimize transaction count. However, both methods require memory locality. Big data applications often have extremely large working sets that are not cache-friendly. Some big data applications, such as graph computing and in-memory databases, exhibit fine-grained random patterns with poor temporal and spatial locality. Consequently, adjacent memory locations may not be related, making direct large block data transfers inefficient [
7].
Fortunately, these workloads often comprise numerous memory operations with weak dependencies, which provides potential opportunities for MLP extraction. However, in order to fully exploit the potential MLP, further mechanisms are required.
2.2 The Limitation of Out-of-Order CPU for MLP
The traditional approach to extracting MLP from workloads relies on aggressive OoO execution of CPU cores. By providing a wider instruction window and leveraging aggressive OoO execution, cores can issue more in-flight load/store instructions for higher MLP.
The limitation of OoO execution is that it relies on complex hardware logic to track in-flight instructions and memory requests. This results in poor scalability. For instance, structures like the LSQ and MSHRs are typically built using CAM, and increasing CAM capacity raises power consumption and latency rapidly. Although there has been a significant amount of work [
11,
44,
51,
61,
64] addressing these scalability issues in the past, simply scaling hardware is not a cost-effective solution. The MLP achievable with OoO execution is still constrained by certain critical hardware resources.
To overcome these constraints, previous research explores some approaches to utilize existing resources better to improve OoO aggressiveness, such as Runahead Execution [
27,
50,
51,
52,
53] and CLEAR [
31]. Runahead Execution allows processors to enter a “Runahead” mode when the ROB is exhausted, speculatively executing subsequent instructions to prefetch future memory accesses. CLEAR speculatively retires long-latency loads early to free up resources for other instructions. However, these kinds of techniques only address the issue of individual resource shortages like the ROB, but cannot handle the exhaustion of other resources such as MSHRs. As a result, they introduce more complex hardware logic without fundamentally solving the scalability issue for far memory.
The MLP demand for far memory has significantly exceeded the potential of contemporary OoO CPU. Assuming a far memory access latency of 1μs and a processor main frequency of 2 GHz, the access latency of a single far memory equates to 2,000 cycles. As memory operations comprise 30% of instructions [
43], avoiding all stalls would require thousands of ROB entries and hundreds of MSHRs. This is impractical for current technology. State-of-the-art processor cores only have several hundreds of ROB entries and several tens of MSHRs.
The key reason here is, in fact, the synchronous semantics of traditional load/store instructions. A synchronous operation occupies critical pipeline resources until it finishes. Due to the significantly longer latencies of far memory, processor resources are exhausted more rapidly, leading to more frequent pipeline stalls and preventing further memory operations from being dispatched.
2.3 Asynchronous Memory Access Techniques
To address the resource occupation issue, asynchronous memory accesses that separate memory request issuing from memory response processing are needed. Asynchronous memory accesses allow corresponding instructions to be committed once the memory request is issued, instead of blocking for the response. Thus, resources like ROB entries and physical registers can be freed earlier rather than occupying them for the full latency duration. Additionally, asynchronous memory access also enables software to be further involved in exposing more opportunities for MLP extraction.
Prefetching is a typical and widely used asynchronous memory access technique [
13,
14,
16,
20,
30,
32,
63,
71], which can be considered as a form of asynchronous load. Prefetch instructions do not hold ROB resources after the load request is issued. But prefetch loads still occupy MSHRs until data are returned later and put in cache for best effort. As an asynchronous mechanism, prefetching commonly faces timeliness issues [
37]. The prefetched data can either be evicted from caches before use (i.e., early prefetches) or arrive after the demand loads have occurred (i.e., late prefetches). Thus, the benefits of prefetching are limited without sophisticated optimization.
Figure
3 demonstrates this weakness, comparing the performance of a GUPS benchmark using GP [
16]-based prefetching to a baseline. GUPS involves random memory updates. The GP-based GUPS variant prefetches all addresses to be updated in a group before executing each update. As shown in Figure
3, the effectiveness of GP heavily depends on the group size. Different group sizes can cause the GP-based GUPS to outperform or underperform the baseline on the same hardware configuration and latency conditions. Moreover, the group size yielding the best performance varies greatly with hardware configurations. This example shows that prefetching would struggle to adapt to unpredictable latencies.
The limitation of prefetching is caused mainly by a lack of support for tracking responses and managing returned data. Software cannot obtain information about request completion, making it difficult to ensure that the prefetched data is available when needed. As a result, prefetching is not well-suited for scenarios with far memories that have highly variable latencies.
External memory engines also provide asynchronous memory access via offloading expensive memory operations to external engines for asynchronous execution. With an I/O-like interface, it provides response notification and explicit management of data storage. Typical techniques include Intel I/O Acceleration Technology (I/OAT), and Intel Data Streaming Accelerator.
However, these engines face high startup overhead. They generally require setting up and en-queuing descriptors for the engine to initiate memory requests. Although advanced engines allow directly converting writes into remote memory requests (e.g., Cray FMA), this still requires several-tens cycles as the engines are attached to NoC or I/O bus. Therefore, this approach is typically only suitable for applications that involve transferring large, contiguous blocks of data, but not applications with fine-grained and random accesses.
To reduce overhead, some works [
49,
54] have proposed aggregating many small messages issued by user-level threads into a larger packet to lower the average overhead. However, the software-based aggregation increases the latency of individual memory operations, necessitating applications to have inherently high MLP to fully hide the increased latencies.
Additionally, Some works [
17,
55] have proposed programmable memory engines. These engines are also located on the NoC, allowing complex memory access patterns through compiler or manual programming. However, these engines typically adopt a produce/consume model for interaction with the host core. The host core will send a non-speculative notification signal to the memory engine after consuming the data. These synchronous operations eliminate the advantages of OoO cores with aggressive speculative execution. Consequently, such works are more suitable for processors with simple cores such as in-order cores, VLIW, or many-core processors with many small cores rather than high-performance OoO designs.
Integrating external memory access engines inside the core is not straightforward. The engines would need to be redesigned to support the cancellation of requests for handling mis-speculation. Relying on the existing accelerator interfaces like
Rocket Chip Coprocessor (
RoCC) [
10] is also not possible. Because they lack speculative execution support, which makes them unsuitable for implementing tightly coupled coprocessors [
56]. Therefore, directly integrating external memory engines into an OoO pipeline presents difficulties.
In summary, existing asynchronous techniques still have limitations in far memory scenarios. Prefetching lacks a notification mechanism, making it difficult to apply to far memories with highly variable latencies. Offloading to external memory access engines incurs high startup overhead, making it challenging to apply to big data applications involving word-sized random memory access patterns, such as graph computing workloads.
2.4 Motivation
In this article, we want to design a built-in asynchronous memory access mechanism for an OoO core to meet the MLP demand of far memory without sacrificing the performance of local operations. Based on the previous discussions, an asynchronous memory access technique targeted far memory scenarios should support asynchronous request issuing, response notifications, and explicit storage management.
Our first design choice is where to put the storage. To meet the typical demand of far memory, up to several hundreds of simultaneous memory operations should be supported for a single OoO core. The data and metadata needed for it require tens of KBs of space. It exceeds the existing scope of register files even L1 cache. So, we choose to dynamically partition a portion of the L2 cache as SPM for massive far memory operations. While L2 partitioning incurs some impact, this is acceptable versus the performance gains of AMU. The dynamically software-adjustable SPM size avoids affecting non-AMU applications. Additionally, big data applications typically underutilize L2 caches [
39,
72]. Occupying part of the L2 cache as SPM thus has a limited negative impact on performance.
Secondly, we propose a set of new asynchronous instructions for accessing far memory, both for request initiation and response notification. The key design rule is that an asynchronous instruction should not hold any registers for long. Once a memory request is sent out, the initiation instruction should be retired at once. The later response check instruction should not be in a block-and-wait mode either. Between the two instructions, data and metadata for a pending memory operation are stored in the SPM rather than the hardware registers. As register allocation is done by modern compilers, we do not use hardware instructions for SPM data allocation and leave it for software. The new asynchronous instructions are only in charge of moving data between SPM and far memory. Moving data between GPR and SPM is done by normal synchronous load/store instruction. These operations then will have short and fixed latency since no cache-miss will occur anymore. Asynchronous and synchronous instructions work cooperatively for the high MLP.
Thirdly, we propose adding a special function unit for the new asynchronous instructions. The new asynchronous instructions can share the same fetching, decoding, and dispatching stage as normal instructions in an OoO pipeline. However, since asynchronous instructions have rather different requirements for resource scheduling and interact heavily with SPM, we propose to add a new function unit for their execution phase. To some extent, it is like a vector unit for vector instructions and vector registers.
Fourth, we choose to handle in-thread data consistency through software instead of non-scalable hardware. There are two possible sources of data conflicts and inconsistencies for instruction sequences. One is between asynchronous memory accesses and traditional load/store instructions. The other is conflicts among asynchronous instructions themselves. For consistency between load/store and asynchronous instructions, we choose to use dynamic partition to avoid hardware complexity. By partitioning, we could assume that asynchronous instructions and load/store would not access the same memory region simultaneously. Necessary cache flush operations are needed for region transition.
For consistency among asynchronous instructions themselves, it means disambiguation for these new memory access operations. Hardware handling disambiguation would have too high costs due to the large number of outstanding asynchronous memory requests. On the other side, big data programs often employ data parallelism or have large datasets, so the probability of data conflicts is low. Therefore, we choose to do conflict detection through software when necessary. For efficiency, data structures for disambiguation can be held either in SPM or local cache with flexibility.
We name the new asynchronous instructions and the supporting functional unit as AMI and AMU. Together with AMI and AMU, software can get full explicit control over massive memory requests to exploit enough MLP for far memory accesses.
4 Micro-architecture Design
4.1 Implementation of the Basic Functionalities
Figure
5 presents the structure of ALSU as well as an execution example. The
aload/
astore instructions are first decoded into two micro-ops (details in Section
4.2). The first micro-op is used for ID allocation, while the second generates the actual asynchronous memory request. The micro-ops are executed in the ALSU. For ID allocation, the execution unit first checks for available IDs. If a free ID exists, it performs allocation locally. Otherwise, it requests free IDs from the ASMC. After ID allocation, the micro-op for issuing the asynchronous request is executed. The ALSU constructs the asynchronous memory request and passes it to the ASMC. This request is handled similarly to a store request to the cache. The request is buffered in the store buffer before the instruction commit. At this point, the instruction has been completed and can wait in the ROB for retirement.
The ALSU concentrates on the instruction execution and the mis-speculation handling (details in Section
4.3). Therefore, the ASMC does not need to consider the intricacies introduced by speculative execution. Communication overhead between the ALSU and ASMC is also reduced through batching of ID transfers (details in Section
4.2). This lowers the overall design complexity by dividing different concerns between the two units.
Figure
6 illustrates the design of ASMC. Several modifications are made to the cache controller. First, the control logic is added to repurpose a portion of the cache area as SPM. This implementation is straightforward and is supported by several commercial processors. Second, several new memory commands are defined to support ID-related requests and asynchronous memory access requests. These extensions to the cache controller allow it to serve its original caching functionality while also managing metadata and coordinating memory operations for asynchronous memory access requests. Third, the protocol between L1 cache and L2 cache is expanded to support the newly defined commands. Supporting these new commands is straightforward as they do not interfere with maintaining cache consistency.
The ASMC utilizes three key data structures stored in the SPM metadata area to support the new memory commands: a finished list, a free list, and an Asynchronous Memory Access Request Table (AMART). For each asynchronous memory request, the ASMC indexes the AMART using a request ID to access the corresponding entry. Each entry contains metadata like the SPM address, memory address, request status, and other implementation-specific information. The request is then converted to a standard memory request sent to the MC. Upon response from memory, the ASMC re-indexes the AMART using the request ID to update the status. Once an asynchronous request is completed, its ID is written to the finished list by the ASMC. To reduce SPM accesses and improve performance, the ASMC caches a subset of IDs from the finished and free lists in on-chip registers.
The ASMC enables the transfer of large contiguous data blocks by dividing large requests into cache-line-sized sub-requests. The ASMC relies on a dedicated state machine to split large granularity memory requests. The states of memory requests are also tracked in the metadata region of the SPM.
4.2 Metadata Batching
To reduce overhead from frequent ALSU-ASMC communication, we develop a metadata batching approach.
The key idea involves using vector registers as buffers to aggregate metadata accesses. The major interaction between ALSU and ASMC involves pulling and pushing request IDs from/to the finished and free lists managed by the ASMC. “List vector registers” are thus introduced, holding portions of the IDs in these lists. As shown in Figure
5, each register contains a pointer to the next unused ID entry as well as multiple stored 16-bit IDs. The width of the List Vector Register matches the width of the physical vector registers (512 bits in this article). Thus, the ALSU can request a batch of IDs from the ASMC to store in the list vector register each time, or write back all the IDs in the register with a single request to the ASMC.
To minimize software complexity, the list vector registers are not directly accessible to software. Instead, they are accessed through internally-generated micro-ops. Each asynchronous memory access instruction is decoded into two micro-ops: one for performing conditional fetching/writing of IDs between vector registers and ASMC; the other handles the actual functionality of the instruction. Figure
5 shows a detailed example. The ID allocation micro-op first checks the list vector register for an available ID. If there are no IDs, it issues a request to the ASMC to fetch a batch of free IDs. Upon receiving the IDs from the ASMC, the allocation continues.
Figure
5 shows the detailed architecture for executing the micro-ops within the ALSU. The process contains two stages: execution and commit.
Execution The micro-ops are sent to the LSQ and dispatched to the corresponding execution unit. Two additional execution units are added to support asynchronous memory access. One unit handles aload/astore requests sent to the ASMC, while the other handles ID management micro-operations. The aload/astore request is similar to a conventional memory write request with command, data, and address, but with different semantics. The command field specifies aload/astore, the data field carries the request identifier and SPM data address encoded in the original instruction, and the address field contains the memory address also from the instruction.
Commit The asynchronous memory request is sent to the ASMC when the instruction reaches the head of the ROB, resembling atomic instructions or non-speculative instructions. However, ID management micro-ops can be speculatively executed for higher performance. Their requests may be forwarded to the ASMC prior to the older asynchronous memory access micro-ops.
4.3 Speculative Execution Support
The primary challenge of supporting speculative execution for AMU is managing the effects of micro-operations that modify metadata, such as finished/free lists in the SPM. For example, if a micro-op is speculatively executed but is later squashed, the speculatively fetched IDs could be lost prematurely. There are three cases of how AMI-decoded micro-ops can support out-of-order execution:
The first case is micro-ops responsible for initiating
aload/astore requests (e.g., the
ALoadExec micro-op shown in Figure
5). These micro-ops read register data to construct asynchronous memory requests passed to the ASMC. Because asynchronous memory requests can be regarded as a special type of store request, they are handled similarly by buffering in the store buffer before the instruction commit, just like the
store instruction.
The second case involves ID management micro-ops like retrieving completed IDs or allocating free IDs. When the list vector register still contains IDs, these micro-ops simply move an ID into a general-purpose register without issuing a batch ID fetch request to the ASMC. This allows it to behave like a regular register-to-register instruction, which can thus be handled via the traditional register renaming mechanism.
The third case involves the ID management micro-ops when the list vector register is empty. In this case, these micro-ops will issue a batch ID fetch request to the ASMC, changing the state of the queues in SPM. To address this problem, ALSU uses an “uncommitted ID register” to isolate ID updates from the ASMC. From the perspective of ASMC, the IDs taken by ALSU can be safely removed from the queue in SPM without considering rollback. As shown in Figure
5, the uncommitted ID register keeps the value of the list vector register corresponding to the micro-op that issued the batch ID fetch request. The value in the uncommitted ID register can be regarded as a checkpoint for this request. When a mis-prediction occurs, the register continues holding the IDs retrieved by the canceled instruction. Subsequent micro-ops that issue batch ID fetch requests will obtain IDs from the uncommitted ID register rather than the ASMC. This effectively restores the previous obtained IDs to the list vector register. Therefore, IDs fetched from the ASMC are not lost on mispredictions. The uncommitted ID register only being cleared on instruction commit. So, it can only store the result of one batch ID fetch request, requiring a second ID fetch to stall until the previous one is committed. However, this stall is infrequent, as the list vector register can hold 31 IDs before needing refills.
6 Evaluation
6.1 Evaluation Methodology
The proposed AMU is implemented by modifying the RISC-V OoO CPU model within the GEM5 simulator [
47]. The simulated system architecture, as depicted in Figure
7, includes a remote memory node connected to the CPU via CXL as a far memory. Since this work is only concerned with memory access latency, CXL is modeled using the serial link model of gem5. It models the packet delay, which is dependent on the size of the packet, as well as the bandwidth limitation of the interconnect. Internal details of CXL such as coherence protocols are not simulated. We integrate McPAT [
42] to estimate power consumption. Additionally, an RTL model was developed using Chisel HDL to evaluate the logic resources overhead.
Evaluation is conducted under four configurations. The
Baseline configuration, mimicking an Intel Golden Cove processor, is shown in Table
2. Moreover, we set up a configuration called
CXL Ideal (with BOP) which is an “ideal” configuration with an L2 best-offset hardware prefetcher [
48]. In this configuration, the maximum number of in-flight memory requests is significantly increased by setting the entries of MSHR to 256 at each cache level. This configuration provides a useful target for comparison, approximating prior works that boosted hardware resources or pure hardware designs to increase OoO execution potential. Although impractical by current technology, it establishes an upper bound on attainable performance through conventional means. The
AMU configuration refers to the proposed AMU architecture. In the evaluations, the size of the SPM is fixed at 64 KB. The
AMU (DMA-mode) is a limited
AMU configuration, simulating the performance of external memory access engines. This configuration limits the number of IDs that the list vector registers can buffer to 1, forbidding the micro-ops to be executed speculatively.
Due to simulation time constraints, some simplifications are made. However, this does not change the general conclusion of our articl. First, cache and dataset sizes are decreased to ensure a reasonable data loading time. Despite reduced cache capacity, it still reflects the fact that applications enjoying far memory tend to have a significantly large working set, making it challenging for the cache to cover completely. Second, the evaluation uses single-core configurations, focusing on the ability of AMU to enable high outstanding memory requests.
6.2 Benchmarks
Multiple memory-bound benchmarks from various suites were chosen. The selected benchmarks include random access (GUPS), STREAM,
binary search (
BS),
hash join (
HJ) [
15],
hash tables (
HT) [
18],
link list (
LL) [
28] and
skip lists (
SL) [
18], BFS,
Integer Sort (
IS), Redis and HPCG.
The benchmarks were modified with AMI to exploit the MLP, following the programming paradigm described in Section
5.2. The details of the benchmarks are shown in Table
3. Besides Redis and GUPS which are handwritten, other benchmarks are ported by using the proposed coroutine framework to reduce programming complexity. BS, HT, LL, SL, and Redis exploit RLP by launching multiple coroutines (256 for most, 128 for SL) to execute an independent key/value lookup task. Each coroutine sequentially generates random keys. The data structures being looked up are allocated in far memory. When accessing data structures in far memory, these workloads use AMI to initiate asynchronous memory access requests. The coroutines are interleave executed by the framework to hide far memory latencies. On the other hand, GUPS, HJ, HPCG, IS, and STREAM exploit LLP, as they contain iterations that are independent of each other. Each coroutine executes a portion of the iterations. Most benchmarks have small (less than 64B) access granularity, except STREAM, IS, and HPCG which were evaluated for the benefits of large granularity. As these three benchmarks involve accessing contiguous memory, performance is improved by loading 512B or more of contiguous data into the SPM with each
aload/
astore.
6.3 Performance and Power Evaluation
Figure
8 shows the normalized execution time of the benchmarks (the lower, the better). The far memory latency is adjusted to various values to simulate different far memory devices. For most cases, AMU maintains relatively constant performance as latency increases, demonstrating its ability to mitigate latency impacts.
AMU exhibits performance advantages in most benchmarks when the additional latency caused by far memory exceeds 500 ns. For BS, BFS, GUPS, HT, and LL, AMU performs well when additional latency is only 0.2 μs. These workloads primarily involve random memory access, preventing them from benefitting from caches. Their performance is limited by the random access capability of hardware. Additionally, irregular access patterns diminish hardware prefetcher effectiveness, hurting performance. As AMU enables issuing huge numbers of outstanding requests, it alleviates these bottlenecks of workloads. For IS, which accesses memory sequentially more often, AMU only outperforms other configurations when additional latency is over 1 μs.
The AMU demonstrates significant advantages over traditional external memory access engines simulated as AMU (DMA-mode). As engines located outside the core cannot leverage out-of-order execution and have higher latency and overhead, AMU avoids these disadvantages through specialized microarchitectural design within the processor. By supporting out-of-order execution and the proposed batching mechanisms within the core, the AMU is able to significantly reduce overhead, making fine-grained asynchronous memory access practical.
Figure
9 shows the average number of in-flight memory requests (MLP), which directly demonstrates a key benefit of the AMU. As latency increases, AMU-based workloads exhibit a corresponding rise in average MLP, as asynchronous memory access instructions enable applications to schedule more overlapping independent instruction streams for execution. With higher latency, more coroutines can be interleaved due to the ability to asynchronously initiate additional memory requests. The improvement in MLP effectively mitigates the impact of increased latency. In contrast, the performance of original codes relies heavily on short latency due to the limitations of hardware resources and synchronous semantics. Their MLP lacks scalability and does not improve as far memory latency rises.
Figure
10 shows the IPC of the benchmarks. It can be seen that adopting AMI significantly improves IPC. This demonstrates that the AMI, unlike traditional load/store instructions, does not stall for a long time in the ROB. Instead, they are committed very rapidly. This confirms that the proposed design is effective at reducing the consumption of critical hardware resources.
Figure
11 shows a breakdown of the power consumption. The additional power consumption of AMU is primarily due to the maintenance of metadata in the SPM and the increased instruction execution overhead caused by software-based instruction scheduling. Meanwhile, the performance enhancement offered by the AMU reduces the overall execution time, leading to a decrease in static power consumption and the counts of accessing power-consuming hardware resources such as ROB/IQ. This results in significant power consumption benefits. When the far memory latency is 500 ns, the geometric mean of the power consumption of AMU relative to the baseline is 1.3, indicating that the power consumption benefits brought by performance improvement do not entirely cover this additional overhead. However, when the latency reaches 1 μs, the geometric mean of power consumption of AMU compared to the baseline reduces to 0.9, and the extra power consumption overhead is effectively compensated by the power consumption benefits achieved through performance improvement.
Table
4 compares the performance of a software prefetching scheme versus AMU. The software prefetching scheme utilizes a compiler-based software prefetching [
62]. The data shows that software prefetching requires careful tuning of prefetch aggressiveness due to the lack of feedback from hardware on the availability of prefetched data. This makes it difficult for software prefetching to adapt effectively to scenarios with highly variable memory access latencies like far memory.
Additionally, Table
4 is a preliminary evaluation of the AMU LLVM pass. The evaluation indicates the compiler-directed optimizations outperformed manually ported versions significantly (e.g., GUPS and HJ). On the other hand, for the compiler-based STREAM, performance was notably lower compared to the hand-optimized version using large-granularity asynchronous memory accesses. This is because the current compiler only supports 8B granularity asynchronous memory accesses. This demonstrates that with continued advancement in compiler techniques, AMU still has the potential for higher performance gains.
Table
5 gives the overhead of memory disambiguation for two typical workloads. For HJ, the cost of memory disambiguation remains fairly stable at around 5%. For HT, the percentage of time spent on memory disambiguation is higher when the remote latency is low. However, as latency increases, this portion rapidly declines. Additionally, while there is a noticeable cost for memory disambiguation, the benefits from asynchronous memory access outweigh these overheads. Therefore, this overhead is acceptable. In future work, hardware-assisted techniques [
73] could be explored to further reduce the cost and unlock more of the potential of AMU.
6.4 Hardware Overhead
The on-chip storage overhead introduced by AMU is relatively low, as AMU reuses existing hardware resources. First, the list vector registers can employ the existing register renaming mechanism and use the general-purpose physical vector registers. Second, the metadata maintained by ASMC are stored in SPM, which is a part of the existing cache. Thus, no additional storage overhead is required for metadata.
The additional storage overhead of AMU is mainly internal state registers and relatively short queues. First, each state machine of AMU requires a 32-entry pending queue and several internal state registers. Second, ASMC prepares two list-vector-register-length buffers as a cache of the corresponding list. Third, there are two Uncommitted ID Registers in the ALSU. Therefore, the total overhead is only approximately a few KB and does not vary when the required MLP increases.
We implemented the AMU on NanHu-G, which is the second-generation of the open-source high-performance OoO RISC-V processor XiangShan [
70]. NanHu-G is a 4-issue OoO core with speculation execution and 96 ROB entries. The AMU design was evaluated on an FPGA platform to analyze its hardware cost. Furthermore, the area overhead of integrating the AMU is evaluated using Synopsys Design Compiler under the TSMC 28 nm HPC+ process technology. Table
6 shows the resource utilization compared to NanHu-G. The results indicate that the AMU can be efficiently implemented on modern processor cores with modest resource overheads.