SATURDAY, MARCH 1
11:30 am – 1:00 pm |
1:00 pm – 2:00 pm | Acacia C
Customizing the OS Storage and Memory Stacks With eBPF
Asaf Cidon (Columbia University);
Customizing the OS Storage and Memory Stacks With eBPF
Asaf Cidon (Columbia University);
Speaker: Asaf Cidon,
Abstract: In the 80's and 90's, there was a surge of exciting operating system designs, focused on how applications can customize and extend the kernel for their specific needs. While these designs never achieved direct widespread adoption, with the introduction of the eBPF framework, we can now customize and extend a monolithic OS like Linux in similar ways envisioned by these earlier works. In this talk, I will talk about our group's work on customizing the storage and memory stack of Linux using eBPF to achieve much higher performance in common datacenter workloads. I will present our work on XRP and BPF-oF, which allow applications to execute user-defined storage-functions, such as index lookups or aggregations, deep within the kernel, safely bypassing most of Linux's storage stack, speeding up systems like RocksDB by ~3X. I will also describe our ongoing work on customizing Linux's memory management, including frameworks that allow applications to customize the behavior of the Linux page cache and page fault handling mechanisms.
Speaker bio: Asaf Cidon is an associate professor of EE and CS (jointly affiliated) at Columbia University. He has broad research interests in software systems and security. His group's research was adopted in commercial systems used by many companies, including Meta, Apple and Snowflake, was recognized by best paper awards at OSDI, UseSec, CIDR, and ATC, and by the NSF CAREER and ARO young investigator awards. Prior to joining Columbia, Asaf spent several years in industry, where his last role was SVP and co-GM of Email Protection at Barracuda Networks. He joined Barracuda via the acquisition of Sookasa, a startup where he was the CEO and co-founder. He obtained a PhD and MS from Stanford and BS from the Technion.
Chair: Hung-Wei Tseng
2:10 pm-3:10 pm | Acacia C
2:10 pm-3:10 pm | Acacia B
Chair: Yuanchao Xu
Architecting An Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design
Haoyang Zhang (University of Illinois Urbana-Champaign);Yuqi Xue (University of Illinois Urbana Champaign);Yirui Eric Zhou (University of Illinois Urbana Champaign);Shaobo Li (University of Illinois Urbana Champaign);Jian Huang (University of Illinois at Urbana-Champaign);
Architecting An Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design
Haoyang Zhang (University of Illinois Urbana-Champaign);Yuqi Xue (University of Illinois Urbana Champaign);Yirui Eric Zhou (University of Illinois Urbana Champaign);Shaobo Li (University of Illinois Urbana Champaign);Jian Huang (University of Illinois at Urbana-Champaign);
Speaker: ,
Abstract: The CXL-based solid-state drive (CXL-SSD) provides a promising approach towards scaling the main memory capacity at low cost. However, the CXL-SSD has faced performance challenges due to the long flash access latency and unpredictable events such as garbage collections in the SSD, stalling the host processor and wasting compute cycles. Moreover, although the CXL interface enables the byte-granular data access to the SSD, accessing flash chips is still at page granularity due to physical limitations. The mismatch of access granularity causes significant unnecessary I/O traffic to flash chips, worsening the performance. In this work, we present SkyByte, an efficient CXL-based SSD that employs a holistic approach to address the aforementioned challenges by co-designing the host operating system (OS) and SSD controller. To alleviate the long memory stall when accessing the CXL-SSD, SkyByte revisits the OS context switch mechanism and enables opportunistic context switches upon the detection of long access delays. To accommodate byte-granular data accesses, SkyByte architects the internal DRAM of the SSD controller into a cacheline-level write log and a page-level data cache, and enables data coalescing upon log cleaning to reduce the I/O traffic to flash chips. SkyByte also employs optimization techniques that include adaptive page migration for exploring the performance benefits of fast host memory by promoting hot pages in CXL-SSD to the host. We implement SkyByte with a CXL-SSD simulator and evaluate its efficiency with various data-intensive applications. Our experiments show that SkyByte outperforms current CXL-based SSD by 6.11$\times$, and reduces the I/O traffic to flash chips by 23.08$\times$ on average. SkyByte also reaches 75\% of the performance of the ideal case that assumes unlimited DRAM capacity in the host, which offers an attractive cost-effective solution.
Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link
Dong Xu (University of California, Merced);Dong Li (University of California, Merced);
Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link
Dong Xu (University of California, Merced);Dong Li (University of California, Merced);
Speaker: ,
Abstract: The deep learning models (DL) are becoming bigger, easily beyond the memory capacity of a single accelerator. The recent progress in large DL training utilizes CPU memory as an extension of accelerator memory and offloads tensors to CPU memory to save accelerator memory. This solution transfers tensors between the two memories, creating a major performance bottleneck. We identify two problems during tensor transfers: (1) the coarse-grained tensor transfer creating difficulty in hiding transfer overhead, and (2) the redundant transfer that unnecessarily migrates value-unchanged bytes from CPU to accelerator. We introduce a cache coherence interconnect based on Compute Express Link (CXL) to build a cache coherence domain between CPU memory and accelerator memory. By slightly extending CXL to support an update cache-coherence protocol and avoiding unnecessary data transfers, we reduce training time by 33.7% (up to 55.4%) without changing model convergence and accuracy, compared with the state-of-the-art work in DeepSpeed.
FreqTier: Lightweight Adaptive Tiering for CXL Memory Systems
Kevin Song (University of Toronto);Jiacheng Yang (University of Toronto);Zixuan Wang (University of California San Diego);Jishen Zhao (University of California San Diego);Sihang Liu (University of Waterloo);Gennady Pekhimenko (University of Toronto / CentML);
FreqTier: Lightweight Adaptive Tiering for CXL Memory Systems
Kevin Song (University of Toronto);Jiacheng Yang (University of Toronto);Zixuan Wang (University of California San Diego);Jishen Zhao (University of California San Diego);Sihang Liu (University of Waterloo);Gennady Pekhimenko (University of Toronto / CentML);
Speaker: ,
Abstract: Modern workloads are demanding increasingly larger memory capacity. Compute Express Link (CXL)-based memory tiering has emerged as a promising solution for addressing this problem by utilizing traditional DRAM alongside slow-tier CXL memory devices. We analyze prior tiering systems and observe two challenges for high-performance memory tiering: adapting to skewed but dynamically varying data hotness distributions while minimizing memory and cache overhead due to tiering. To address these challenges, we propose FreqTier, an adaptive and lightweight tiering system for CXL memory. FreqTier tracks both long-term data access frequency and short-term access momentum \emph{simultaneously} to accurately capture and adapt to shifting hotness distributions. FreqTier reduces the metadata memory overhead by tracking data accesses \emph{probabilistically}, obtaining higher memory efficiency by trading off a small amount of tracking inaccuracy that has a negligible impact on application performance. To reduce cache overhead, FreqTier uses lightweight data structures that optimize for data locality to track data hotness. Our evaluations show that FreqTier outperforms prior tiering systems by up to $91\%$ ($19\%$ geomean), incurring $2.0-7.8\times$ less memory overhead and $1.7-3.5\times$ less cache misses.
Chair: Dong Li
Scalable and Robust DNA-based Storage via Coding Theory and Deep Learning
Daniella Bar-Lev (University of California San Diego);Itai Orr (Technion - Israel Institute of Technology);Omer Sabary (Technion - Israel Institute of Technology);Tuvi Etzion (Technion - Israel Institute of Technology);Eitan Yaakobi (Technion - Israel Institute of Technology);
Scalable and Robust DNA-based Storage via Coding Theory and Deep Learning
Daniella Bar-Lev (University of California San Diego);Itai Orr (Technion - Israel Institute of Technology);Omer Sabary (Technion - Israel Institute of Technology);Tuvi Etzion (Technion - Israel Institute of Technology);Eitan Yaakobi (Technion - Israel Institute of Technology);
Speaker: ,
Abstract: The global data sphere is expanding exponentially, projected to hit 180 Zettabytes by 2025, whereas current technologies are not anticipated to scale at nearly the same rate. DNA-based storage emerges as a crucial solution to this gap, enabling digital information to be archived in DNA molecules. This method enjoys major advantages over magnetic and optical storage solutions such as exceptional information density, enhanced data durability, and negligible power consumption to maintain data integrity. To access the data, an information retrieval process is employed, where some of the main bottlenecks are scalability and accuracy, which have a natural tradeoff between the two. Here we show a modular and holistic approach that combines Deep Neural Networks (DNN) trained on simulated data, Tensor-Product (TP) based Error-Correcting Codes (ECC), and a safety margin mechanism into a single coherent pipeline. We demonstrated our solution on 3.1MB of information using two different sequencing technologies. Our work improves upon the current leading solutions by up to x3200 increase in speed, 40\% improvement in accuracy, and offers a code rate of 1.6 bits per base in a high noise regime. In a broader sense, our work shows a viable path to commercial DNA storage solutions hindered by current information retrieval processes.
Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems
Daniella Bar-Lev (University of California San Diego);Omer Sabary (Technion - Israel Institute of Technology);Ryan Gabrys (University of California San Diego);Eitan Yaakobi (Technion - Israel Institute of Technology);
Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems
Daniella Bar-Lev (University of California San Diego);Omer Sabary (Technion - Israel Institute of Technology);Ryan Gabrys (University of California San Diego);Eitan Yaakobi (Technion - Israel Institute of Technology);
Speaker: ,
Abstract: Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost of sequencing information stands at roughly \$120/GB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by introducing the \textit{\textbf{DNA coverage depth problem}}, which aims to reduce the required number of reads to retrieve information from the storage system. Under this framework, our main goal is to understand the effect of error-correcting codes and retrieval algorithms on the required sequencing coverage depth. We establish that the expected number of reads that are required for information retrieval is minimized when the channel follows a uniform distribution. We also derive upper and lower bounds on the probability distribution of this number of required reads and provide a comprehensive upper and lower bound on its expected value. We further prove that for a noiseless channel and uniform distribution, MDS codes are optimal in terms of minimizing the expected number of reads. Additionally, we study the DNA coverage depth problem under the random-access setup, in which the user aims to retrieve just a specific information unit from the entire DNA storage system. We prove that the expected retrieval time is at least $k$ for $[n,k]$ MDS codes as well as for other families of codes. Furthermore, we present explicit code constructions that achieve expected retrieval times below $k$ and evaluate their performance through analytical methods and simulations. Lastly, we provide lower bounds on the maximum expected retrieval time. Our findings offer valuable insights for reducing the cost and latency of DNA storage.
DNA Tails for information storage, watermarking and data hiding
Chao Pan (Google);Kasra Tababtabaei (New England Bio Labs);Jin Sima (UIUC);Alvaro Hernandez (UIUC);Charles Schroeder (UIUC);Olgica Milenkovic (UIUC);
DNA Tails for information storage, watermarking and data hiding
Chao Pan (Google);Kasra Tababtabaei (New England Bio Labs);Jin Sima (UIUC);Alvaro Hernandez (UIUC);Charles Schroeder (UIUC);Olgica Milenkovic (UIUC);
Speaker: ,
Abstract: We describe the first prototype of a DNA-based data storage system which stores information by extending nicked locations on native double-stranded substrates into single stranded oligos (tails) whose lengths represent the values of the symbols at the given sites. The extension process is based on enzymatic synthesis, with nicking and extension steps staggered in order to ensure proper control of the extension lengths. The system may be viewed as a molecular form of flash memory in which nicked locations serve as cells and the added nucleotides correspond to electrons, with the total cell charges approximated by the tail lengths. The tail-encoding paradigm can also be used of watermarking and data hiding purposes, with the combinations of locations of tails encoding owner information. In addition to experimental results, we also describe typical tail-length errors and propose a new error-control coding scheme (based on rank modulation) which deals with the issues of inadequate (stumped) tail growth.
3:10 pm – 3:30 pm |
3:30 pm – 4:50 pm | Acacia C
Chair: Jie Ren
FlexMem: Adaptive Page Profiling and Migration for Tiered Memory
Dong Xu (University of California, Merced);Dong Li (University of California, Merced);
FlexMem: Adaptive Page Profiling and Migration for Tiered Memory
Dong Xu (University of California, Merced);Dong Li (University of California, Merced);
Speaker: ,
Abstract: Tiered memory, combining multiple memory components with different performance and capacity, provides a cost-effective solution to increase memory capacity and improve memory utilization. The existing system software to manage tiered memory often has limitations: (1) rigid memory profiling methods that cannot timely capture emerging memory access patterns or lose profiling quality, (2) rigid page demotion (i.e., the number of pages for demotion is driven by an invariant requirement on free memory space), and (3) rigid warm page range (i.e., emerging hot pages) that leads to unnecessary page demotion from fast to slow memory. To address the above limitations, we introduce FlexMem, a page profiling and migration system for tiered memory. FlexMem combines the performance counter-based and page hinting fault-based profiling methods to improve profiling quality, dynamically decides the number of pages for demotion based on the needs of accommodating hot pages (i.e., frequently accessed pages), and dynamically decides the warm page range based on how often the pages in the range is promoted to hot pages. We evaluate FlexMem with common memory-intensive benchmarks. Compared to the state-of-the-art (Tiering-0.8, TPP, and MEMTIS), FlexMem improves performance by 32%, 23%, and 27% on average respectively.
P2Cache: An Application-Directed Page Cache for Data-Intensive Applications
Dusol Lee (Seoul National University);Inhyuk Choi (Seoul National University);Chanyoung Lee (Seoul National University);Sungjin Lee (DGIST);Jihong Kim (Seoul National University);
P2Cache: An Application-Directed Page Cache for Data-Intensive Applications
Dusol Lee (Seoul National University);Inhyuk Choi (Seoul National University);Chanyoung Lee (Seoul National University);Sungjin Lee (DGIST);Jihong Kim (Seoul National University);
Speaker: ,
Abstract: We present P2Cache, a kernel-level page cache framework that enables developers to create a custom kernel-level page cache tailored to the I/O patterns of their applications. P2Cache extends a Linux kernel page cache by adding new probe points that are used to support application-programmable kernel page caches by eBPF programs. Our evaluation demonstrates that custom page caches implemented with our P2Cache deliver up to 32% performance gains in data-intensive graph applications, requiring minimal development effort.
DPUF: DPU-accelerated Near-storage Secure Filtering
Narangerelt Batsoyol (UC San Diego);Daniel Waddington (IBM Research);Swami Sundararaman (IBM Research);Steven Swanson (UC San Diego);
DPUF: DPU-accelerated Near-storage Secure Filtering
Narangerelt Batsoyol (UC San Diego);Daniel Waddington (IBM Research);Swami Sundararaman (IBM Research);Steven Swanson (UC San Diego);
Speaker: ,
Abstract: Querying data stored in cloud object stores like Amazon S3 often leads to network bottlenecks, particularly when large datasets need to be transferred over wide area networks (WANs) for processing. Encryption further complicates this challenge by requiring entire encrypted objects to be fetched from the object store before analysis. To address this, we aim to push down data filtering and perform secure computing near storage using a Data Processing Unit (DPU) integrated into the cloud server. We present DPUF, a DPU-assisted near-storage secure data filtering system that accelerates filtering operations by performing the query near the data and returning only the results of the query. By using the DPU as a secure enclave that is dedicated to and solely trusted by the client, DPUF provides a secure means of performing filtering of encrypted data near the shared (i.e., untrusted) storage system. Furthermore, our approach leverages on-board DPU accelerators and compute resources to maximize performance. In average, DPUF shows up to $19.7\times$ speedup over traditional client-side filtering. It can also reduce networking costs by up to $16\times$.
A Method to Achieve Hugely Parallel Compute in NAND Flash Memory
Oliver Hambrey (Siglead Europe Limited);
A Method to Achieve Hugely Parallel Compute in NAND Flash Memory
Oliver Hambrey (Siglead Europe Limited);
Speaker: ,
Abstract: A method to achieve compute using an off-the shelf non-volatile memory designed for the purpose of storage is presented. As such, the effort to design and synthesize logic around the memory array to achieve in-memory compute (an approach of many other in-memory compute methods) will not be needed, reducing cost and development time. The memory used is a 3D 3-bit per cell NAND flash memory (NFM) and so the in-memory compute method can be referred to as in NAND compute (iNC). Because of the hugely parallel nature of NFM (typically, in the region of 217 cells can be programmed or read in parallel), iNC enables hugely parallel computation (for instance the addition of 2^17 integers in parallel), and due to its non-volatile nature, the hope is that the overall power budget for such a method will be regarded as low. Additionally, the method uses standard NFM command sets, thereby it is attractive for implementation in SSD controllers. Although iNC cannot be regarded error-free in the same way that CPU compute is regarded error-free (because of the erroneous nature of NFM), the logic error rate may be low enough to tolerate applications robust to some level of computational error, such as classification problems using neural networks.
4:50 pm – 6:00 pm | Acacia C & D
6:00 pm | Acacia D
SUNDAY, MARCH 2
7:30 am – 8:30 am | Casuarina Ballroom
9:00 am – 10:00 am | Acacia C
The ISA of Modern Storage Systems: Interface, Specialization and Approaches
Jian Huang (UIUC);
The ISA of Modern Storage Systems: Interface, Specialization and Approaches
Jian Huang (UIUC);
Speaker: Jian Huang,
Abstract: Storage systems today have been built into a complicated ecosystem, which involves the development and deployment of storage devices, storage software, and application-level data stores. To rapidly meet the ever-increasing performance and efficiency requirements, the entire hardware and software stack need to adapt instantly in a coordinated fashion. This drives us to rethink the design and implementation approaches to building future storage systems. In this talk, I will present our recent studies on the data access Interface, architecture Specialization, and development Approaches (ISA) of storage systems, and discuss how they will reshape the storage ecosystem in the era of AI.
Speaker bio: Dr. Jian Huang is an Associate Professor and Y. T. Lo Faculty Fellow in the ECE department and an affiliated Associate Professor in Siebel School of Computing and Data Science at the University of Illinois at Urbana-Champaign. He received his Ph.D. in Computer Science at Georgia Tech in 2017. His research interests include computer systems and architecture, sustainable AI infrastructure, memory/storage systems, data systems, systems security, distributed systems, and especially the intersections of them. He has been working on memory/storage systems. Most recently, he is leading a team to work on sustainable AI infrastructures. His research contributions have been published at top-tier computer architecture and systems conferences such as ISCA, MICRO, ASPLOS, OSDI, and SOSP. His work received USENIX Best Paper Award, MICRO Best Paper Runner Up, and IEEE Micro Top Picks (and Honorable Mentions). He also received the inaugural SIGMICRO Early Career Award, NSF CAREER Award, NSF CRII Award, Dean’s Award for Early Innovation, NetApp Faculty Fellowship Award, and Google Faculty Research Award. He is a co-founder of the Workshop on Hot Topics in System Infrastructure (HotInfra).
Chair: Dong Li
10:00 am – 10:30 am | Acacia C
10:30 am-11:50 am | Acacia C
10:30 am-11:50 am | Acacia B
Chair: Sihang Liu
Caphammer: Exploiting and Protecting Capacitor Vulnerabilities in Energy Harvesting Systems
Jaeseok Choi (University of Central Florida);Hyunwoo Joe (Electronics and Telecommunications Research Institute(ETRI));Changhee Jung (Purdue University);Jongouk Choi (U. of Central Florida);
Caphammer: Exploiting and Protecting Capacitor Vulnerabilities in Energy Harvesting Systems
Jaeseok Choi (University of Central Florida);Hyunwoo Joe (Electronics and Telecommunications Research Institute(ETRI));Changhee Jung (Purdue University);Jongouk Choi (U. of Central Florida);
Speaker: ,
Abstract: Energy harvesting system (EHS) offers battery-less com- putation by capturing ambient energy from the environment and storing it in a capacitor as an energy buffer. However, EHS often faces frequent power outages due to the unreliable energy sources, making capacitors a critical component for intermittent computing. To ensure crash consistency during power failures, EHS typically relies on just-in-time (JIT) checkpointing to save volatile states into non-volatile memory (NVM) for recovery upon power restoration. Unfortunately, capacitors are inherently vulnerable to fre- quent charging/discharging cycles or exposure to over- voltages. This paper introduces Caphammer, a hardware attack that exploits the capacitor vulnerabilities in EHS through repeated power cycling or over-voltage stress, leading to degradation, denial of service (DoS), data corruption, and en- cryption failures. We found that commodity EHS devices like Powercast Wireless Sensor Nodes and TI-MSP430 platforms are susceptible to Caphammer. To address Caphammer attack, the paper proposes FanCap, a proactive solution that isolates degraded capacitors for self- healing while maintaining system functionality. With minimal runtime and energy overhead, FanCap effectively mitigates Caphammer, ensuring reliable operation and addressing ca- pacitor vulnerabilities in battery-free IoT devices.
Defending Against EMI Attacks on Just-In-Time Checkpoint for Resilient Intermittent Systems
Jaeseok Choi (U. of Central Florida);Hyunwoo Joe (Electronics and Telecommunications Research Institute(ETRI));Changhee Jung (Purdue University);Jongouk Choi (U. of Central Florida);
Defending Against EMI Attacks on Just-In-Time Checkpoint for Resilient Intermittent Systems
Jaeseok Choi (U. of Central Florida);Hyunwoo Joe (Electronics and Telecommunications Research Institute(ETRI));Changhee Jung (Purdue University);Jongouk Choi (U. of Central Florida);
Speaker: ,
Abstract: Intermittent systems have emerged as an alternative to battery-powered IoT devices. However, due to unstable ambient energy sources, the systems suffer from a frequent power failure issue. To address the issue, the systems uti- lize non-volatile memory (NVM) and employ some form of power failure recovery mechanisms that checkpoint and restore volatile data, i.e., registers, across power outages. The majority of prior works employ a just-in-time (JIT) checkpoint protocol for crash consistency, that detects an impending power outage and checkpoints volatile registers by leveraging a voltage monitor. Unfortunately, the voltage monitor is susceptible to elec- tromagnetic interference (EMI), which can be utilized as an attack surface. In particular, by generating malicious EMI signals, adversaries can manipulate the output of the volt- age monitor, leading to a denial of service (DoS) and data corruption. In this paper, we examine the vulnerability of voltage monitors by injecting EMI signals into commodity platforms like the TI-MSP430 and STM ARM Cortex-M microcontrollers. Experiments confirm that all nine tested microcontrollers are susceptible to EMI attacks, resulting in DoS and data corruption. Existing countermeasures, while effective, are costly and impractical for resource-constrained intermittent systems. To this end, we introduce GECKO, a lightweight, compiler-directed solution. By leveraging idem- potent processing and checkpoint pruning, GECKO mitigates EMI attacks while ensuring correct recovery without causing a significant performance overhead.
Nonvolatile SRAM for Intermittent AI Inference: Magneto-Electric FET-Based In-Memory Computing Design
Deniz Najafi (New Jersey Institute of Technology);Shaahin Angizi (New Jersey Institute of Technology);
Nonvolatile SRAM for Intermittent AI Inference: Magneto-Electric FET-Based In-Memory Computing Design
Deniz Najafi (New Jersey Institute of Technology);Shaahin Angizi (New Jersey Institute of Technology);
Speaker: ,
Abstract: This paper presents ME-SRAM, a novel nonvolatile SRAM design powered by Magneto-Electric FET (MEFET) technology, tailored for energy-efficient in-memory computing in intermittent AI inference scenarios. By leveraging the nonvolatile properties of MEFETs, ME-SRAM ensures data retention during power disruptions, enabling seamless system recovery without restarting computations. The architecture integrates efficient bulk bitwise operations and reduces static power during idle periods, supporting normally-off computing. Performance evaluations, including a binarized AlexNet case study, highlight ME-SRAM's significant advantages in energy efficiency, operational reliability, and resilience to process variations compared to existing solutions. This work establishes MEFET-based designs as a cornerstone for advancing energy-efficient AI and IoT applications.
Low-Overhead Security Protection for at-Rest PMOs
Drrick Greenspan (University of Central Florida);Naveed Ul Mustafa (New Mexico State University);Andres Delgado (University of Central Florida);Connor Bramham (University of Central Florida);Christopher Prats (University of Central Florida);Samu Wallace (University of Central Florida);Mark Heinrich (University of Central Florida College of Engineering and Computer Science);Yan Solihin (U. of Central Florida);
Low-Overhead Security Protection for at-Rest PMOs
Drrick Greenspan (University of Central Florida);Naveed Ul Mustafa (New Mexico State University);Andres Delgado (University of Central Florida);Connor Bramham (University of Central Florida);Christopher Prats (University of Central Florida);Samu Wallace (University of Central Florida);Mark Heinrich (University of Central Florida College of Engineering and Computer Science);Yan Solihin (U. of Central Florida);
Speaker: ,
Abstract: Persistent Memory (PM) is nearly as fast as traditional volatile memory while being denser and having the capability to retain data indefinitely. However, the long-lasting nature of PM means that, without encryption, it is vulnerable to data disclosure attacks. Recent research has introduced Persistent Memory Objects (PMOs) as an abstraction for PM that is both crash consistent and has the capacity to be secure against corruption and data disclosure attacks for PMOs at-rest. Previously presented PMO designs to protect against corruption and data disclosure attacks use whole PMO encryption/decryption with integrity verification (WEDI). While this design works, it is slow and inefficient. We address this problem with WEDI for the first time. First, we observe that a PMO can be broken into pages and that we can adopt demand paging for PMO encryption (per-page encryption). Second, we explore the design space of per-page PMO encryption and integrity verification, which we refer to as Low Overhead at-rest PMO Protection (LOaPP), and we discuss the trade-offs of each design. Third, we introduce a crash handler to ensure that PMOs are always secure, even in the face of crashes. Our new design, with per-page encryption alone, outperforms whole- PMO encryption without integrity verification (WED) by 1.4× and 2.6× for two sets of evaluated workloads. Adding per-page integrity verification on top of per-page encryption outperforms the original WEDI design by 2.19× and 2.62×.
Chair: Jie Ren
Outer Code Designs for Augmented and Local-Global Polar Code Architectures
Ziyuan Zhu (UCSD);Paul Siegel (UCSD);
Outer Code Designs for Augmented and Local-Global Polar Code Architectures
Ziyuan Zhu (UCSD);Paul Siegel (UCSD);
Speaker: ,
Abstract: In this paper, we introduce two novel methods to design outer polar codes for two previously proposed concatenated polar code architectures: augmented polar codes and local-global polar codes. These methods include a stopping set (SS) construction and a nonstationary density evolution (NDE) construction. Simulation results demonstrate the advantage of these methods over previously proposed constructions based on density evolution (DE) and LLR evolution.
PASS: A Persistence-Aware Spiral Storage
Wenyu Peng (San Diego State University);Tao Xie (San Diego State University);Paul Siegel (UCSD);
PASS: A Persistence-Aware Spiral Storage
Wenyu Peng (San Diego State University);Tao Xie (San Diego State University);Paul Siegel (UCSD);
Speaker: ,
Abstract: The advent of byte-addressable persistent memory (PM) has led to a resurgence of interest in adapting existing dynamic hashing schemes to PM. Compared with its two well-known peers (extendible hashing and linear hashing), spiral storage has received little attention due to its limitations. After an in-depth analysis, however, we discover that it has a good potential for PM. To show its strength, we develop a persistent spiral storage called PASS (Persistence-Aware Spiral Storage), which is facilitated by a group of new/existing techniques. Further, we conduct a comprehensive evaluation of PASS on a server equipped with Intel Optane DC Persistent Memory Modules (DCPMM). Experimental results demonstrate that compared with two state-of-the-art schemes it exhibits better performance.
Generalizing Functional Error Correction for Language Models
Wenyu Peng (University of California, San Diego);Simeng Zheng (University of California, San Diego);Michael Baluja (University of California, San Diego);Tao Xie (San Diego State University);Anxiao Jiang (TAMU);Paul Siegel (UCSD);
Generalizing Functional Error Correction for Language Models
Wenyu Peng (University of California, San Diego);Simeng Zheng (University of California, San Diego);Michael Baluja (University of California, San Diego);Tao Xie (San Diego State University);Anxiao Jiang (TAMU);Paul Siegel (UCSD);
Speaker: ,
Abstract: The goal of functional error correction is to preserve neural network performance when stored network weights are corrupted by noise. To achieve this goal, a selective protection (SP) scheme was proposed to optimally protect the functionally important bits in binary weight representations in a layer-dependent manner. Although it showed its effectiveness in image classification tasks on some relatively simple networks such as ResNet-18 and VGG-16, it becomes inadequate for emerging complex machine learning tasks generated from natural language processing and vision-language association domains. To solve this problem, we extend the SP scheme in three directions: task complexity, model complexity, and storage complexity. Extensions to complex natural language and vision-language tasks include text categorization and “zero-shot” textual classification of images. Extensions to more complex models with deeper block structures and attention mechanisms consist of Very Deep Convolutional Neural Network (VDCNN) and Contrastive Language-Image Pre-Training (CLIP) networks. Extensions to more complex storage configurations focus on distributed storage architectures to support model parallelism. Experimental results show that the optimized SP scheme preserves network performance in all of these settings. The results also provide insights into redundancy-performance trade-offs, generalizability of SP across datasets and tasks, and robustness of partitioned network architectures.
12:00 am – 1:30 pm | Acacia D
1:30 pm – 2:50 pm | Acacia B
Chair: Oliver Hambrey
Spatio-Temporal Generator for Flash Memory Systems Using Conditional Generative Nets
Simeng Zheng (University of California, San Diego);Chih-Hui Ho (University of California, San Diego);Wenyu Peng (University of California, San Diego);Paul Siegel (University of California, San Diego);
Spatio-Temporal Generator for Flash Memory Systems Using Conditional Generative Nets
Simeng Zheng (University of California, San Diego);Chih-Hui Ho (University of California, San Diego);Wenyu Peng (University of California, San Diego);Paul Siegel (University of California, San Diego);
Speaker: ,
Abstract: In this work, we propose Flash-Gen, a data-driven approach to generating flash memory read voltages in both space and time using conditional generative networks. This generative modeling method reconstructs read voltages from an individual memory cell based on the program levels of the cell and its surrounding cells, as well as the time stamp, in a time-efficient, resource-saving, and function-comprehensive manner. We propose a flash system optimization procedure, referred to as the Flash-Gen coding workflow, that leverages reconstructed read voltages for the development of error correction codes (ECCs) and constrained codes. Experimental results demonstrate that the model accurately captures the complex spatial and temporal features of the flash memory channel. Flash-Gen coding workflow can effectively address a range of important tasks, including threshold determination, coding performance estimation, and pattern characterization.
STRAW: A Stress-Aware WL-Based Read Reclaim Technique for High-Density NAND Flash-Based SSDs
Myoungjun Chun (Seoul National University);Jaeyong Lee (Seoul National University);Inhyuk Choi (Seoul National University);Jisung Park (POSTECH);Myungsuk Kim (Kyungpook National University);Jihong Kim (Seoul National University);
STRAW: A Stress-Aware WL-Based Read Reclaim Technique for High-Density NAND Flash-Based SSDs
Myoungjun Chun (Seoul National University);Jaeyong Lee (Seoul National University);Inhyuk Choi (Seoul National University);Jisung Park (POSTECH);Myungsuk Kim (Kyungpook National University);Jihong Kim (Seoul National University);
Speaker: ,
Abstract: Although read disturbance has emerged as a major reliability concern, managing read disturbance in modern NAND flash memory has not been thoroughly investigated yet. From a device characterization study using real modern NAND flash memory, we observe that reading a page incurs heterogeneous reliability impacts on each WL, which makes the existing block-level read reclaim extremely inefficient. We propose a new WL-level read-reclaim technique, called STRAW, which keeps track of the accumulated read-disturbance effect on each WL and reclaims only heavily-disturbed WLs. By avoiding unnecessary read-reclaim operations, STRAW reduces read-reclaim-induced page writes by 83.6% with negligible storage overhead.
Improving Read Performance of Modern SSDs by Enabling Read-Retry in Flash Chips
Jaeyong Lee (Seoul National University);Myoungjun Chun (Seoul National University);Myungsuk Kim (Kyungpook National University);Jisung Park (POSTECH);Jihong Kim (Seoul National University);
Improving Read Performance of Modern SSDs by Enabling Read-Retry in Flash Chips
Jaeyong Lee (Seoul National University);Myoungjun Chun (Seoul National University);Myungsuk Kim (Kyungpook National University);Jisung Park (POSTECH);Jihong Kim (Seoul National University);
Speaker: ,
Abstract: Modern high-performance SSDs have multiple flash channels operating in parallel to achieve their high I/O bandwidth. However, when the effective bandwidth of these flash channels declines, the SSD’s overall bandwidth is substantially impacted. In contemporary SSDs featuring high-density 3D NAND flash memory, frequent invocations of a read-retry procedure pose a significant challenge to fully utilizing the maximum I/O bandwidth of a flash channel. In this paper, we propose a novel read-retry optimization scheme, Retry-in-Flash (RiF), which proactively minimizes the amount of time wasted in conventional read-retry procedures. Unlike existing read-retry solutions that focus on identifying an optimal read-reference voltage for a sensed page, the RiF scheme focuses on determining early on whether a read-retry will be required for the sensed data. To know if a read-retry is needed or not at the earliest possible time, we propose a RiF-enabled flash chip with an on-die early-retry (ODEAR) engine. When the ODEAR engine determines that a sensed page requires a read-retry, a read-reference voltage is immediately adjusted, and the same page is re-read while ignoring the previously sensed page. By performing the key steps of a read-retry procedure inside a RiF flash chip without transferring the sensed uncorrectable page to an off-chip controller, the RiF scheme prevents the read bandwidth of a flash channel from being wasted due to failed read data. To evaluate the RiF scheme, we developed a prototype RiF-enabled flash chip and constructed a RiF-aware SSD simulator using RiF flash chips. Our evaluation results show that the proposed RiF scheme improves the effective SSD bandwidth by 72.1% on average over a state-of-the-art read-retry solution at 2K P/E cycles with negligible power and area overheads.
Giving 2D and 3D NAND Flash a Second Life With Page Isolation
Muhammed Ceylan Morgul (University of Virginia);Matchima Buddhanoy (Colorado State University);Biswajit Ray (Colorado State University);Mircea R. Stan (University of Virginia);
Giving 2D and 3D NAND Flash a Second Life With Page Isolation
Muhammed Ceylan Morgul (University of Virginia);Matchima Buddhanoy (Colorado State University);Biswajit Ray (Colorado State University);Mircea R. Stan (University of Virginia);
Speaker: ,
Abstract: \textit{Page Isolation}, which can be statically implementable to FTL, lowers the capacity loss, which is caused by the retirement of the aged blocks, by enabling the use of some pages of these blocks beyond at least 6.25$\times$ their anticipated lifetime, improving sustainability.
2:50 pm – 3:10 pm | Acacia B
3:10 pm – 4:50 pm | Acacia B
Chair: Dong Li
A Persistent Double-Ended Queue based on Software Combining
Panagiota Fatourou (FORTH ICS and University of Crete CSD, Greece);Petros Papadogiannakis (FORTH ICS and University of Crete CSD, Greece);
A Persistent Double-Ended Queue based on Software Combining
Panagiota Fatourou (FORTH ICS and University of Crete CSD, Greece);Petros Papadogiannakis (FORTH ICS and University of Crete CSD, Greece);
Speaker: ,
Abstract: This work presents PerDeque, the first specialized persistent double-ended queue (deque) for settings with Non-Volatile Memory (NVM). Our implementation persists data in NVM to support recovery of the deque's state after system crashes. Experiments show that PerDeque does not pay any significant persistence overhead in most cases. Moreover, it exhibits good performance and is highly scalable.
LightWSP: Whole-System Persistence on the Cheap
Yuchen Zhou (Purdue University);Jianping Zeng (Samsung Semiconductor);Changhee Jung (Purdue University);
LightWSP: Whole-System Persistence on the Cheap
Yuchen Zhou (Purdue University);Jianping Zeng (Samsung Semiconductor);Changhee Jung (Purdue University);
Speaker: ,
Abstract: Whole-system persistence (WSP) has recently attracted more interest thanks to its transparency and performance benefits over partial-system persistence where users are not only burdened by complex persistent programming but also incapable of using DRAM as LLC. Nevertheless, existing WSP work either introduces high hardware cost or causes non-trivial performance overhead. To this end, this paper presents LightWSP, a compiler/architecture co-design scheme that can achieve WSP in a lightweight yet performant manner. LightWSP compiler partitions program into a series of recoverable regions (epochs) with their live-out registers checkpointed, while LightWSP hardware persists the stores of the regions—whose boundary serves as a power failure recovery point—enforcing crash consistency; LightWSP leverages the battery-backed write pending queue (WPQ) of a memory controller as a redo buffer, i.e., all stores are first buffered in WPQ and then persisted together in non-volatile memory (NVM) at each region end. In this way, no matter when power failure happens, NVM is never corrupted by the stores of the power-interrupted region, facilitating correct recovery. In particular, LightWSP supports multiple memory controllers on the cheap without costly speculation/misspeculation handling mechanisms used by prior work. The experimental results with 38 applications show that LightWSP incurs only an average of 9.0% run-time overhead. This is on par with the state-of-the-art work, that complicates the core microarchitecture significantly with its intrusive design for memory controller speculation, yet the hardware cost of LightWSP is near zero (0.5B per core).
Hybrid Power Failure Recovery for Intermittent Computing
Gan Fang (Purdue University);Jongouk Choi (U. of Central Florida);Changhee Jung (Purdue University);
Hybrid Power Failure Recovery for Intermittent Computing
Gan Fang (Purdue University);Jongouk Choi (U. of Central Florida);Changhee Jung (Purdue University);
Speaker: ,
Abstract: Energy harvesting systems rely on either rollback or roll-forward recovery to resume power-interrupted program correctly. However, both recovery schemes have their own inherent drawbacks. To this end, this paper presents RollSwitch, a hybrid power failure recovery scheme that can achieve low-cost yet high-performance intermittent computation for energy harvesting systems. According to the underlying energy harvesting condition, RollSwitch dynamically switches between rollback and roll-forward recovery modes to maximize the performance. In particular, RollSwitch leverages the level of available energy in the capacitor as a proxy for determining the appropriate recovery mode. For this purpose, RollSwitch devises a simple capacitor energy predictor whose outcome governs the recovery mode selection in the near future. The experimental results demonstrate that RollSwitch achieves 15.0% and 19.8% average performance gains over the state-of-the-art rollback and roll-forward recovery schemes, respectively.