1 Introduction
Hardware-software cooperative techniques offer a powerful approach to improving the performance and efficiency of general-purpose processors. These techniques involve communicating key application and semantic information from the software to the architecture to enable more powerful optimizations and resource management in hardware. Recent research proposes many such cross-layer techniques for various purposes, e.g., performance,
quality of service (QoS), memory protection, programmability, and security. For example, Whirlpool [
104] identifies and communicates regions of memory that have similar properties (i.e., data structures) in the program to the hardware, which uses this information to more intelligently place data in a
non-uniform cache architecture (NUCA) system. RADAR [
96] and EvictMe [
166] communicate which cache blocks will no longer be used in the program, such that cache policies can evict them. These are just a few examples in an increasingly large space of cross-layer techniques proposed in the form of hints implemented as new ISA instructions to aid cache replacement, prefetching, memory management, and so on [
22,
26,
58,
66,
96,
116,
117,
126,
136,
137,
159,
166,
179], program annotations/directives to convey program semantics [
3,
47,
58,
87,
104,
163], or interfaces to communicate an application’s QoS requirements for efficient partitioning and prioritization of shared hardware resources [
62,
93].
While cross-layer approaches have been demonstrated to be highly effective, such proposals are challenging to evaluate on real hardware as they require cross-layer changes to the hardware,
operating system (OS), application software, and
instruction-set architecture (ISA). Existing open-source infrastructures for implementing cross-layer techniques in real hardware include PARD [
62,
93] for QoS and Cheri [
174] for fine-grained memory protection and security. Unfortunately, these open-source infrastructures are not designed to provide key features required for
performance optimizations: (i) rich dynamic hardware-software interfaces, (ii) low-overhead metadata management, and (iii) interfaces to numerous hardware components such as prefetchers, caches, memory controllers, and so on.
In this work, we introduce
MetaSys (
Metadata Management
System for Cross-layer Performance Optimization), a full-system FPGA-based infrastructure, with a prototype in the RISC-V Rocket Chip system [
10], to enable rapid implementation and evaluation of diverse cross-layer techniques in real hardware. MetaSys comprises three key components: (1) A rich
hardware-software interface to communicate a
general and extensible set of application information to the hardware architecture at runtime. We refer to this additional application information as
metadata. Examples of metadata include memory access pattern information for prefetching, data reuse information for cache management, address bounds for hardware bounds checking, and so on. The interface is implemented as new instructions in the RISC-V ISA and is wrapped with easy-to-use software library abstractions. (2)
Metadata management support in the OS and hardware to store and access the communicated metadata. Hardware components performing optimizations can then efficiently query for the metadata. We use a
tagged memory-based design for metadata management where each memory address is tagged with an ID. This ID points to metadata that describes the data contained in the location specified by the memory address. (3)
Modularized components to quickly implement various cross-layer optimizations with interfaces to the metadata management support, OS, core, and memory system. Our FPGA-based infrastructure provides flexible modules that can be easily extended to implement different cross-layer optimizations.
The closest work to our proposed system is XMem [
164]. XMem proposes a general metadata management system that can communicate semantic information at compile time. This limits the use cases supported by XMem. MetaSys has the following benefits over XMem: First, MetaSys offers a richer interface that communicates a flexible amount of metadata at
runtime, rather than being limited to statically available program information. This enables a wider set of use cases and more powerful cross-layer techniques (as explained in Section
3.8). Second, MetaSys has a more optimized system design that is designed to be
lightweight in terms of the hardware complexity and changes to the ISA, without sacrificing versatility (Section
3.8). MetaSys incurs only a small area overhead of 0.02% (including 17 KB of additional SRAM), 0.2% memory overhead in DRAM, and adds only eight new instructions to the RISC-V ISA. Third, MetaSys is open-source and freely available, whereas XMem is neither implemented nor evaluated in real hardware with full-system support.
Use cases. Cross-layer techniques that can be implemented with MetaSys include performance optimizations such as cache management, prefetching, memory scheduling, data compression, and data placement; cross-layer techniques for QoS; and lightweight techniques for memory protection (see Section
7). To demonstrate the versatility and ease-of-use of MetaSys in implementing new cross-layer techniques, we implement and evaluate three hardware-software cooperative techniques: (i) prefetching for graph analytics applications; (ii) bounds checking in memory unsafe languages, and (iii) return address protection in stack frames. These techniques were quick to implement with MetaSys, each requiring only an additional ~100 lines of Chisel [
13] code on top of MetaSys’s hardware codebase (~1,800 lines of code).
Characterizing a general metadata management system. Using MetaSys, we perform the first detailed experimental characterization and limit study of the performance overheads of using a single common metadata management system to enable multiple diverse cross-layer techniques in a general-purpose processor. We make four new observations from our characterization across 24 applications and four microbenchmarks that were designed to stress MetaSys.
First, the performance overheads from the cross-layer interface and metadata system itself are on average very low (2.7% on average, up to 27% for the most intensive microbenchmark). Second, there is no performance loss from supporting multiple techniques that simultaneously query the shared metadata system. This indicates that MetaSys can be designed to be a scalable substrate. Third, the most critical factor in determining the performance overhead is the fundamental spatial and temporal locality in the accesses to the metadata itself. This determines the effectiveness of the metadata caches and the additional memory accesses to retrieve metadata. Fourth, we identify TLB misses from the required address translation when metadata is retrieved from memory as an important factor in performance overhead.
Conclusions from characterization. From our detailed characterization and implemented use cases on real hardware, we make the following conclusions: First, using a
single general metadata management system is a promising low-overhead approach to implement
multiple cross-layer techniques in future general-purpose processors. The significance of using a single framework is in enabling a wide range of cross-layer techniques with a single change to the hardware-software interface [
93,
164] and
consolidating common metadata management support; thus, making the adoption of new cross-layer techniques in future processors significantly easier. Second, we demonstrate that a common framework can simultaneously and scalably support multiple cross-layer optimizations. For our implemented use cases, we observe low performance overheads from using the general MetaSys system: 0.2% for prefetching, 14% for bounds checking, and 1.2% for return address protection.
This work makes the following major contributions.
•
We introduce MetaSys, the first full-system open-source FPGA-based infrastructure of a lightweight metadata management system. MetaSys provides a rich hardware-software interface that can be used to implement a diverse set of cross-layer techniques. We implement a prototype of MetaSys in a RISC-V system providing the required support in the hardware, OS, and the ISA to enable quick implementation and evaluation of new hardware-software cooperative techniques in real hardware.
•
We propose a new hardware-software interface that enables
dynamically communicating information and a more streamlined system design that can support a richer set of cross-layer optimizations than prior work [
164].
•
We present the first detailed experimental characterization of the performance and area overheads of a general hardware-software interface and lightweight metadata management system designed to enable multiple and diverse cross-layer performance optimizations. We identify key sources of inefficiencies and bottlenecks of a general metadata system on real hardware, and we demonstrate its effectiveness as a common substrate for enabling cross-layer techniques in CPUs.
•
We demonstrate the versatility and ease-of-use of the MetaSys infrastructure by implementing and evaluating three hardware-software cooperative techniques: (i) prefetching for graph analytics applications; (ii) efficient bounds checking for memory-unsafe languages; and (iii) return address protection for stack frames. We highlight other use cases that can be implemented with MetaSys.
2 Background and Related Work
Hardware-software cooperative techniques in CPUs. Cross-layer performance optimizations communicate
additional information across the application-system boundary. We refer to this information as
metadata. Metadata that is typically useful for performance optimization include program properties such as access patterns, read-write characteristics, data locality/reuse, data types/layouts, data “hotness,” and working set size. This metadata enables more intelligent hardware/system optimizations such as cache management, data placement, thread scheduling, memory scheduling, data compression, and approximation [
162,
163,
164]. For QoS optimizations, metadata includes application priorities and prioritization rules for allocation of resources such as memory bandwidth and cache space [
48,
62,
76,
93,
106,
107,
152,
153]. Memory safety optimizations may communicate base/bounds addresses of data structures [
43,
45].
A
general framework is a promising approach as it enables many cross-layer techniques with a single change to the hardware-software interface and enables
reusing the metadata management support across multiple optimizations. Such systems were recently proposed for performance [
163,
164], memory protection and security [
45,
174], and QoS [
62,
93].
A general framework to support a wide range of cross-layer optimizations—specifically for
performance—requires (i) a rich and dynamic hardware-software interface to communicate a diverse set of metadata at runtime, (ii) lightweight and
low-overhead metadata management [
164], and (iii) interfaces to numerous hardware components. Even small overheads imposed as a result of the system’s generality may overshadow the performance benefits of a cross-layer technique. General metadata systems may also impose significant complexity, performance, and power overheads to the processor. While prior work has demonstrated the significant benefits of cross-layer approaches, no previous work has characterized the efficiency and capacity limits of a general metadata system for cross-layer optimizations in CPUs.
Tagged architectures. MetaSys is inspired by the metadata management and interfaces proposed in XMem [
164] and the large body of work on tagged memory [
45,
53,
68,
173,
182] and capability-based systems [
27,
85,
168,
174]. We compare against the closest prior work, XMem, qualitatively in Section
3.8 and quantitatively in Section
5. Unlike all above works, our goal is to provide an open-source framework to implement and these prior cross-layer approaches in real hardware and to perform a detailed real-system characterization of such metadata systems for performance optimization.
Infrastructure for evaluating cross-layer techniques. Evaluating the overheads and feasibility of a newly proposed cross-layer technique is non-trivial. Fully characterizing the performance and area overheads either with a full-system cycle-accurate simulator or an FPGA implementation requires implementing: (i) Hardware support to implement the mechanism; (ii) OS support for OS-based cross-layer optimizations and to characterize the context-switch and system overheads of saving and handling a process’ metadata; and (iii) Compiler support and ISA modifications to add and recognize new instructions to communicate metadata.
Recent works propose general systems that are designed to enable cross-layer techniques for QoS (PARD [
62,
93]) or fine-grained memory protection and security (Cheri [
174]). PARD enables tagging of components and applications with IDs that are propagated with memory requests and enforcing QoS requirements in hardware. Cheri [
174] is a capability-based system that provides hardware support and ISA extensions to enable fine-grained memory protection. Neither system supports the (i) communication of diverse metadata at runtime, (ii) flexible granularity tagging of memory to enable efficient metadata lookups from multiple components, or (iii) interfaces to numerous hardware components (such as the prefetcher, caches, memory controllers) that are needed for
performance optimization.
Our Goal. Our goal in this work is twofold. First, we aim to develop an efficient and flexible open-source framework that enables rapid implementation of new cross-layer techniques to evaluate the associated performance, area, and power overheads, and thus their benefits and feasibility, in real hardware.
Second, we aim to perform the first detailed limit study to characterize and experimentally quantify the overheads associated with general metadata systems to determine their practicality for performance optimization in future CPUs.
3 MetaSys: Enabling and Evaluating Cross-layer Optimizations
To this end, we develop MetaSys, an open-source full-system FPGA-based infrastructure to implement and evaluate new cross-layer techniques in real hardware. MetaSys includes: (i) a rich hardware-software interface to dynamically communicate a flexible amount of metadata at runtime from the application to the hardware, using new RISC-V instructions; (ii) a tagged memory-based [
45,
53,
68,
173,
182] implementation of metadata management in the system and OS; and (iii) flexible modules to add new hardware optimizations with interfaces to the metadata, processor, memory, and OS. We build a prototype of MetaSys in the RISC-V Rocket Chip [
10] system.
We choose an FGPA implementation as opposed to a full-system simulator as: (i) This enables us to focus on feasibility as all components need to be fully implemented (e.g., ports, wires, buffers) and their impact on area, cycle time, power, and scalability is quickly visible. (ii) FPGAs are much faster, running full application simulations in a few minutes/hours as opposed to many days on a full-system simulator, making FPGAs a better fit for quick experimentation. (iii) The RTL generated can be used for more accurate area and power calculation and potential future synthesis on other systems.
Figure
1 depicts an overview of the major hardware components in MetaSys and their operation: The mapping management unit ❶, the optimization client ❷, and the metadata lookup unit ❸.
3.1 Tagged Memory-based Metadata Management
Similar to prior systems for taint-tracking, security, and performance optimization, MetaSys implements
tagged memory-based [
53,
68,
173,
182] metadata management. MetaSys associates metadata with memory address ranges of arbitrary sizes by tagging each memory address with an 8-bit (configurable) ID or tag. Each tag is a unique pointer to metadata that describes the data at the memory address. Hardware optimizations (e.g., in the cache, memory controller, or core) can query for the tag associated with any memory address and the metadata associated with the tag.
The mapping between each memory address and the corresponding ID is saved in a table in memory referred to as Metadata Mapping Table (MMT): ❹ in Figure
1. This table is allocated by the OS for each process and is saved in memory. In MetaSys (similar to XMem [
164] and Cheri [
174]), we tag
physical addresses. As a result, any virtual address has to be translated before indexing the MMT to retrieve the tag ID. To enable fast retrieval of IDs, we implement a cache for the MMT in hardware that stores frequently accessed mappings, referred to as the
Metadata Mapping Cache (MMC) ❺. MMC misses lead to memory accesses to retrieve mappings from the MMT in memory.
MetaSys can be configured to tag memory at flexible granularities. In Section
9.1, we evaluate the performance impact of the tagging granularity. The size of the MMT depends on the tagging granularity. For a 512 B mapping granularity, the MMT requires 0.2% of physical memory (16 MB in a 8 GB system). The MMC holds 128 entries, where each entry stores a physical-address-to-tag mapping, and is 608 B in size (8 bit entry and 30 bit tag).
We implement dedicated mapping tables for tag IDs rather than use the page table or TLBs for the following reasons: First, doing so obviates the need to modify the latency-critical address translation structures. Second, MetaSys associates physical addresses with Tag IDs rather than virtual addresses (to enable the memory controller and LLCs to look up metadata). Thus, a page table or TLB cannot be directly used to save Tag IDs as they are indexed with virtual addresses.
The actual metadata associated with any ID is saved in special SRAM caches that are private to each hardware component or optimization. For example, the prefetcher would separately save access pattern information, while a hardware bounds checker would privately save data structure boundary information. We refer to these stores as Private Metadata Tables (PMTs) ❻. The PMTs are saved near each component (private to each component) and are loaded/updated by MetaSys. The metadata (e.g., locality/“hotnes”) is encoded such that it can be directly interpreted by the component, e.g., a prefetcher.
3.2 The Hardware-Software Interface
Communicating application information with MetaSys requires (i) associating memory address ranges with a tag or ID of configurable size (8 bits by default) and (ii) associating each ID with the relevant metadata. The metadata could include program properties that describe the memory range, such as data locality/reuse, access patterns, read-write characteristics, data “hotness,” and data types/layouts. We use two operators (described below) that can be called in programs to dynamically communicate metadata.
To associate memory address ranges with an ID, we provide the
MAP/
UNMAP interface ❼ (similar to XMem [
164]).
MAP and
UNMAP are implemented as new RISC-V instructions that are interpreted by the
Mapping Management Unit (MMU) to map a range of memory addresses (from a given virtual address up to a certain length) to the provided ID. These mappings are saved by the MMU in the MMT. We also implement 2D and 3D versions of
MAP to efficiently map two-/three-dimensional address ranges in a multi-dimensional data structure with a single instruction.
To associate each ID with metadata, we provide the
CREATE interface.
CREATE ❽ takes three inputs from the application: the tag ID, the 8-bit ID for the
hardware component (i.e., prefetcher, bounds checker, etc., called Module ID), and 512 B of metadata.
CREATE directly populates the PMT of the appropriate hardware component with 512 B of (or less) metadata. Each PMT (private to the optimization client) has 256 entries assuming 8-bit tag IDs. The
CREATE operator overwrites the metadata at the entry indexed by the tag ID at the PMT specified by the module ID. All
CREATE and
MAP instructions are associated with the
next load/store instruction in program order to avoid inaccuracies due to out-of-order execution. In other words, an implicit dependence is created in hardware between these instructions and the next load/store, and they are committed together. This enables associating information with the next load/store and not just the memory region associated with it, e.g., in the bounds checking use case described in Section
6.1.
Table
1 lists the new instructions along with their arguments.
3.3 Metadata Lookup
Each optimization component is triggered by a hardware event ❾ (e.g., a cache miss). A component then retrieves the physical address corresponding to the virtual address associated with the event (e.g., the virtual address that misses in the cache) from the TLB ❿ (in case of L1 optimizations) and queries the MMC with the physical address to retrieve the associated tag ID. On a miss in the MMC, the mapping is retrieved from the MMT in memory. The optimization client uses the retrieved tag ID to obtain the appropriate metadata from the PMT. The optimization client is designed to flexibly implement a wide range of use cases and can be designed based on the optimization at hand. For example, the optimization client used to build the prefetcher use case in Section
5 has interfaces to the prefetcher, caches, memory controller, and TLBs to make implementing optimizations easier. Each client has a static ID (clientID) and a PMT that is updated by the
CREATE operator.
3.4 Operating System Support
We add OS support for metadata management in the RISC-V proxy kernel [
128], which can be booted on our Rocket RISC-V prototype: First, we add support to manage the MMT in memory, where the OS allocates the MMT in the physical address space and communicates the pointer to the MAP hardware support. Second, we add support to flush the PMTs during a context switch (similar to how the TLB is flushed). Third, if the OS changes the virtual to physical address mapping of a page, then to ensure consistency of the metadata, the MMT is updated by the OS to reflect the correct physical-address-to-tag-ID mapping and the corresponding MMC entries are invalidated. We modify the page allocation mechanism in the OS to do this. In addition, we also provide support to implement optimizations performed by the OS or with OS cooperation. To do so, MetaSys enables trapping into the OS to perform customized checks or optimizations (e.g., protection checks or altering virtual-to-physical mappings) based on specific hardware trigger events (using interrupt routines). We describe one such use case in Section
6.
3.5 Coherence/Consistency of Metadata in Multicore Systems
MetaSys can be flexibly extended to multicore processors. Metadata is maintained at a process-level, therefore, threads within the same process cannot have different metadata for the same data structure. The MMC is a per-core structure, while the Private Metadata Tables (PMTs) are per-component structures (e.g., at the memory controller, LLC, prefetcher). The two dynamic operators (CREATE and MAP) may cause challenges in coherence and consistency of metadata in multicore systems. CREATE directly updates metadata associated with the per-process tag ID, which is saved at the per-component PMTs. The PMTs are shared by all cores when the optimization component is also shared (and thus any updates by CREATE are automatically coherent). The PMTs for private components (e.g., L1 cache) are not coherent and can only be updated by the corresponding thread. MAP updates the mapping in the MMC, which is private to each core. To ensure coherence of the MMC mappings, a MAP update invalidates the corresponding MMC entry (if present) in other MMCs by broadcasting updates with a snoopy protocol. If the use case requires consistency of the metadata, i.e., ordering between a CREATE/MAP instruction and when it is visible to other cores, then barriers and fence instructions are used to enforce any required ordering between threads for updates to metadata.
3.6 Timing Sensitivity of Metadata
MetaSys supports three modes: (i) Force stall, where the instruction triggering a metadata lookup cannot commit until the optimization completes (e.g., for security use cases); (ii) No stall, where metadata lookups do not stall the core but are always resolved (e.g., for page placement, cache replacement); and (iii) Best effort, where lookups may be dropped to minimize performance overheads (e.g., for prefetcher training).
3.7 Software Library
We develop a software library that can be included in user programs to facilitate the use of MetaSys primitives
CREATE and
MAP (Table
2). The library exposes three functions: (i)
CREATE populates an entry indexed by the tag ID (
TagID) in the PMT of a hardware optimization client (
ClientID) with the corresponding metadata; (ii)
MAP updates the MMT by assigning tag IDs to memory addresses of the range (
start, end); (iii)
UNMAP resets the tag IDs of the corresponding address range in the MMT. While the operators can be directly used via the provided software library, their use can be simplified by using wrapper libraries that abstract away the need to directly manage tag IDs and their mappings.
3.8 Comparison to the XMem Framework [164]
MetaSys implements a tagged-memory-based system with a metadata cache similar to XMem [
164]. MetaSys however has three major benefits over XMem. First, MetaSys enables communicating metadata at
runtime using a more powerful
CREATE operator that is implemented as a new instruction. In XMem, metadata is communicated only
statically at compile time (
CREATE is hence a compiler pragma). MetaSys thus enables a wider set of optimizations including fine-grained memory safety, protection, prefetching, and so on, and enables communicating metadata that is dependent on program input and metadata that can be accurately known only at runtime (e.g., access patterns, data “hotness,” etc.). MetaSys was designed to efficiently handle these dynamic metadata updates. Second, the dynamic and more expressive
CREATE operator obviates the need for additional interfaces (
ACTIVATE/
DEACTIVATE) to track the validity of statically communicated metadata. This enables a more streamlined metadata system in MetaSys with fewer new instructions, tables, and lookups. Third, MetaSys allows the application programmer to directly select which cross-layer optimization to enable/disable and communicate metadata to, via the
CREATE operator. XMem, however, does not allow control of hardware optimizations from the application. Table
3 summarizes the MetaSys operators and compares to the corresponding operators in XMem.
Of the three MetaSys use cases we evaluate in this article, only return address protection (Section 6.2) can be implemented with XMem.3.9 FPGA-based Infrastructure
We build a full system prototype of MetaSys on an FPGA with the Rocket Chip RISC-V system [
10] and add the necessary support in the compiler, libraries, OS, ISA, and hardware. The modularized MetaSys components can also be ported to other RISC-V cores. We used the RoCC accelerator [
10] in the Rocket chip to implement the metadata management system. RoCC is a customizable module that enables interfacing with the core and memory. The hardware support implemented in ROCC comprises (i) the control logic to handle
MAPs and
CREATEs, (ii) control logic to perform metadata lookups by components that implement optimizations, and (iii) the memory for metadata caches (MMC and PMTs). We extended the RISC-V ISA with eight instructions (six for
MAP/
CREATE and two for OS operations). To implement all the hardware modules of MetaSys, we modified/added 1,781 lines of Chisel code in the Rocket Chip. As we demonstrate later, since the MetaSys hardware modules can be flexibly reused across multiple hardware-software optimizations, the techniques in our use cases only required 87–103 additional lines of Chisel code. The full MetaSys infrastructure is open-sourced [
57] including the Chisel code for the MetaSys hardware support, the RISC-V OS with the required modification, and the software libraries to expose the MetaSys primitives.
3.10 Implementing a Hardware-Software Cooperative Technique with MetaSys
To implement a new hardware technique with the baseline MetaSys code, we provide a flexible module (❶ in Figure
2) with a PMT and interfaces to the metadata lookup unit, to the core (to receive triggers), and interfaces to the cache controller. The interface to the lookup unit ❷ provides dynamic access to the metadata communicated by the
CREATE and
MAP operators. The interfaces to the core ❸ and the memory system ❹ can be used as
trigger events for optimization and lookups (e.g., a cache miss). The different components within the MetaSys logic itself (i.e., the metadata caches, logic to access the Metadata Mapping Table in memory, and the lookup logic) can be flexibly reconfigured.
3.11 Dynamically Typed or Managed Languages
MetaSys relies heavily on function calls/libraries that abstract away low-level details that call the MetaSys instructions even in C/C++. With managed and dynamically typed languages, the metadata associated with data structures/objects would be provided by the user with additional class/object member functions. The metadata could also be directly embedded within object/class definitions (e.g., a list or map in Python would by definition have certain access properties). Other properties (e.g., data types) would be provided by the interpreter (in the case of dynamically typed languages) and the mapping/remapping calls to memory addresses would be handled by the runtime during memory (de)allocation.
3.12 Comparison to Specialized Cross-layer Solutions
In comparison to specialized cross-layer solutions, MetaSys offers the following benefits: (i) Generality: toward implementing a large number of use cases, including more complex use cases such as specialized prefetching (Section
5), which amortizes the overall hardware cost; (ii) Flexibility and versatility in the implemented instructions: A challenge with specialized cross-layer solutions is the need to add new instructions that create challenges in forward/backward compatibility and also require changes across the stack for each new optimization. With MetaSys, the instructions are designed to be agnostic to the optimization and only require a one-off change to the hardware-software interface; (iii) Infrastructure for evaluation: MetaSys can be used to implement many specialized cross-layer techniques in real hardware, which would otherwise be a challenging programming task (as demonstrated in Sections
5 and
6). In Sections
5 and
6, we evaluate MetaSys’s ability to implement several cross-layer techniques.
5 Use Case 1: Hw-sw Cooperative Prefetching
Hardware-software cooperative prefetching techniques have been widely proposed to handle challenging access patterns such as in graph processing [
4,
5,
6,
18,
103,
113,
157,
181,
185], pointer-chasing [
7,
30,
49,
131,
132,
187], linear algebra computation [
32], and other applications [
9,
120,
165,
167]. In this section, we demonstrate how MetaSys can be flexibly used to implement and evaluate such prefetching techniques. We design a new prefetcher for graph applications that leverages knowledge of the semantics of graph data structures using MetaSys. Graph applications typically involve irregular pointer-chasing-based memory access patterns. The data-dependent non-sequential accesses in these workloads are challenging for spatial [
14,
21,
23,
52,
54,
55,
64,
71,
75,
80,
82,
99,
115,
125,
139,
140,
148,
150], temporal [
15,
19,
31,
33,
35,
51,
61,
65,
70,
73,
147,
169,
170,
171,
175,
176], and learning-based hardware prefetchers [
20,
59,
121,
122,
141,
142,
143,
184] that rely either on (i) program context information (e.g., program counter, cache line address) or (ii) memorizing long sequences of cache line addressses to generate accurate prefetch requests.
To implement the hardware support for our prefetcher, we only needed to add 87 lines of Chisel code to the baseline MetaSys codebase, all within the provided module for new optimization components.
5.1 Hardware-Software Cooperative Prefetching for Graph Analytics with MetaSys
Vertex-centric graph analytics typically involves first traversing a
work list containing vertices to be visited (❶ in Figure
3, left). For each vertex, the application accesses the
vertex list ❷ to retrieve the neighboring vertex IDs from the
edge list ❸. To perform computation on the graph, the application then operates on the properties of these neighboring vertices (retrieved from the property list ❹). Graph processing thus involves a series of memory accesses that depend on the contents of the work, vertex and edge lists.
In this use case, we design a prefetcher that can interpret the contents of each of the above data structures and appropriately compute the next data-dependent memory address to prefetch. To capture the required application information for each data structure, we use MetaSys’s CREATE interface to communicate the following metadata: (i) base address of the data structure that is indexed using the current data structure’s contents (64 bits); (ii) base address of the current data structure (64 bits); (iii) data type (32 bits) and size (32 bits) to determine the index of the next access; and (iv) the prefetching stride (6 bits). MAP then associates the address range of each data structure with the appropriate tag.
Listing
1 shows a detailed end-to-end example of how metadata is created in the application (BFS), how metadata tags are associated with the data structures of BFS, and how the prefetcher operates. Lines 2-10 (incorporated into the code of the BFS application) use the MetaSys software libraries to create metadata (with
CREATE) and associate it with the corresponding data structures (using
MAP).
CREATE saves the metadata in the PMT and
MAP updates the MMT. Lines 13-21 (incorporated into the hardware optimization client responsible for prefetching) describe the algorithm behind the MetaSys-based prefetcher. The prefetcher is implemented with an optimization client (ClientID = 0). The prefetcher essentially: (i) snoops every memory request from the core and retrieves the associated tag ID using MetaSys; (ii) queries the PMT to retrieve the communicated metadata (listed above); and (iii) uses the metadata to identify dependencies between the data structures of the application.
We describe a detailed walkthrough of how the prefetcher operates during the execution of the BFS application using Figure
3 and Listing
1. In Figure
3 (left), when the prefetcher snoops a memory request that targets the work list at index 0, it looks ahead (depending on the prefetching stride) to retrieve the contents of the work list at index 1. At this point, it also prefetches the contents of the vertex, edge, and property lists based on the computed index at each level. In graph applications where the work list is ordered, the prefetcher is configured to simply stream through the contents of the vertex and edge lists to prefetch the data dependent memory locations in the property list. The
\(snoop\_mem\_request(address)\) (Line 13) function is executed for each request sent by the core to the memory hierarchy. For every memory request, the prefetcher accesses the MMC using the address to receive the tag ID (using MetaSys’s lookup functionality). Next, it indexes the PMT using the tag ID to retrieve the metadata associated with the memory request. Using the metadata, the prefetcher determines if the request comes from one of the data structures of the application (Line 17). In this case, the prefetcher first prefetches ahead (Line 18) according to the stride and waits until it receives the value of the prefetched request (Line 19). Using the value, it calculates the address of the data-dependent data structure (e.g., value of WorkList used as an index for VertexList) and looks up the metadata for the newly formed address. The same procedure happens until no further data-dependency is found (Line 16).
The prefetcher can be flexibly configured (by associating metadata to data structures, Lines 2–10 in Listing
1) by the user based on the specific properties associated with any data structure, algorithm, and the desired aggressiveness of prefetcher.
5.2 Evaluation and Methodology
We evaluate the MetaSys-based prefetcher using eight graph analytics workloads from the Ligra framework [
145] using the Rocket Chip prototype of MetaSys with the system parameters listed in Table
4. We evaluate three configurations: (i) the baseline system with a hardware stride prefetcher [
55]; (ii)
GraphPref, a customized hardware prefetcher that implements the same idea described above without the generalized MetaSys support (similar to prior work [
5,
157]); and (iii) the MetaSys-based graph prefetcher. In the case of
GraphPref, all the required metadata (e.g., base and bound addresses, stride) are directly provided to the prefetcher using specialized instructions. Thus,
GraphPref is able to access metadata at low latency and does not access the memory hierarchy. The prefetcher works in the same way as the MetaSys-based prefetcher, however, in the case of MetaSys, the general CREATE/MAP instructions are used to communicate information and the metadata lookups access the MMC (which may lead to additional memory accesses when there is an MMC miss). Figure
3 (right) depicts the corresponding speedups, normalized to the baseline. We observe that the MetaSys graph prefetcher improves performance by 11.2% on average (up to 14.3%) over the baseline by accurately prefetching data-dependent memory accesses. It also significantly outperforms the stride prefetcher, which is unable to capture the irregular access patterns in graph workloads. Compared to
GraphPref, the MetaSys-based prefetcher performs almost as well: within 0.2% on average (within 0.8% for
BFS). The additional overheads of MetaSys come from the MMC misses and the larger number of instructions used. In terms of area, MetaSys requires 17 KB of SRAM (1 KB for the MMC and 16 KB for the Private Metadata Table) compared to the custom hardware prefetcher, which requires 8 KB of SRAM for the metadata. The custom prefetcher requires two additional instructions and additional logic to perform metadata lookups and create/update metadata. We found the area complexity to be slightly less for the custom solution as the SRAM requirements are lower (
\(\sim\)0.01% for custom hardware versus
\(\sim\)0.02% for MetaSys, compared to a 22 nm Intel CPU Core [
144]). However, MetaSys ’s overhead can be amortized over multiple use cases, whereas a custom solution is specific to a single use case.
We conclude that MetaSys can be used to flexibly implement and evaluate hardware-software cooperative techniques for prefetching by leveraging MetaSys’s metadata support and interfaces, incurring only small overheads from MetaSys’s general metadata management.
7 Other Use Cases of MetaSys
We briefly discuss various other cross-layer techniques that can be implemented with MetaSys (but would be challenging to implement with prior approaches like XMem [
164]).
Performance optimization techniques. MetaSys provides a low-overhead framework and a rich cross-layer interface to implement a diverse set of performance optimizations including cache management, prefetching, page placement in memory, approximation, data compression, DRAM cache management, and memory management in NUMA and NUCA systems [
2,
22,
26,
40,
41,
58,
66,
96,
116,
117,
126,
136,
137,
159,
166,
179]. MetaSys can flexibly implement the range of cross-layer optimizations supported by XMem [
164], and the Locality Descriptor [
163]. MetaSys’s
dynamic interface for metadata communication enables even more powerful optimizations than XMem including memory optimizations for dynamic data structures such as graphs. We already demonstrate one performance optimization in Section
5.
Techniques to enforce cross-layer quality of service (QoS). MetaSys can be used to implement cross-layer techniques to enforce QoS requirements of applications in shared environments [
48,
62,
76,
93,
106,
107,
152,
153]. MetaSys allows communicating an applications’ QoS requirements to hardware components (e.g., the last-level cache, memory controllers) to enable optimizations for partitioning and allocating shared resources such as cache space and memory bandwidth.
Hardware support for debugging and monitoring. MetaSys can be used to implement cross-layer techniques for performance debugging and bug detection by providing efficient mechanisms to track memory access patterns using its memory tagging and metadata lookup support. This includes efficient detection of memory safety violations [
123,
161] or concurrency bugs [
88,
89,
90,
91,
108,
188] such as data races, deadlocks, or atomicity violations.
Security and protection. MetaSys provides a substrate to implement low-overhead hardware techniques for security/protection: the tagged memory support can be used to implement protection for spatial memory safety [
42,
127,
172,
183], cache timing side-channels [
78] and stack protection [
86,
129]. For example, using MetaSys, software can tag memory accesses as security-critical or safe. Based on the metadata received for every access, MetaSys can activate/deactivate (for the specific access) the corresponding side-channel defense technique at runtime (e.g., protect from or undo speculation [
17,
25,
74,
134,
178]). We already demonstrate two security techniques in Section
6.
Garbage collection. MetaSys offers an efficient mechanism to track dead memory regions, unreachable objects, or young objects in managed languages. MetaSys is hence a natural substrate to implement hardware-software cooperative approaches (such as prior work [
69,
94,
95]) for garbage collection. For example, HAMM [
69], a hardware-software cooperative technique for reference counting, tracks the number of references to any object in hardware. It has many of the same metadata management components as MetaSys. HAMM uses a multi-level metadata cache to manage the large amounts of metadata associated with reference counting for each object. MetaSys was designed with modular interfaces that enable adding more levels to the metadata cache for such use cases.
OS optimizations. MetaSys can be used to implement OS optimizations that require hardware performance monitoring of memory access patterns, contention, reuse, and so on [
29,
47,
105,
118,
147,
148]. The metadata support in MetaSys can be used to implement this monitoring and then inform OS optimizations like thread scheduling, I/O scheduling, and page allocation/mapping [
40,
56,
77,
102,
105].
Cache optimizations. MetaSys enables various cache optimizations such as cache scrubbing [
136,
166] and cache prioritization [
22,
26,
58,
66,
96,
116,
117,
126,
136,
137,
159,
166,
179]. To implement such optimizations with MetaSys, the
CREATE operator is used to specify the expected
reuse of a data object at runtime. For example, objects can be tagged as having no reuse (e.g., once all threads have completed operations on it). Thus, upon encountering a cache miss (the trigger event), the cache controller can look up the expected reuse of different cache lines using MetaSys’s lookup mechanism and then evict the dead cache line. A similar mechanism can be used to retain cache lines that have high expected cache reuse.
Compressing sparse data structures. MetaSys can be used to support techniques that efficiently compress sparse data structures and accelerate sparse workloads [
72,
138]. For example, SMASH [
72] is a hardware-software cooperative technique that efficiently compresses sparse matrices using a hierarchy of bitmaps to encode non-zero cache lines and accelerates the discovery of the non-zero elements of the sparse matrix. Instead of using specialized hardware, SMASH could access the hierarchy of bitmaps and identify non-zero elements with MetaSys ’ metadata support.
Heterogeneous reliability memory optimizations. MetaSys ’ metadata support can be used by techniques that exploit heterogeneous reliability characteristics of memory devices to improve performance, power consumption, and system cost [
81,
87,
92,
135]. These techniques typically require support for dynamically looking up the error tolerance characteristics of data structures to place them in memory to satisfy a target bit error rate. MetaSys ’ metadata support is a natural candidate for providing these techniques with a means to query reliability characteristics of data structures.