Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJune 2024
Decoupled Vector Runahead for Prefetching Nested Memory-Access Chains
Decoupled vector runahead (DVR) exploits massive amounts of memory-level parallelism to improve the performance of applications that feature indirect memory accesses by dynamically inferring loop bounds at runtime, recognizing striding loads, and ...
- research-articleDecember 2023
Decoupled Vector Runahead
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on MicroarchitecturePages 17–31https://doi.org/10.1145/3613424.3614255We present Decoupled Vector Runahead (DVR), an in-core prefetching technique, executing separately to the main application thread, that exploits massive amounts of memory-level parallelism to improve the performance of applications featuring indirect ...
- research-articleJuly 2022
Vector Runahead for Indirect Memory Accesses
Vector runahead delivers extremely high memory-level parallelism even for the chains of dependent memory accesses with complex intermediate address computation, which conventional runahead techniques fundamentally cannot handle and, therefore, have ...
- research-articleJune 2022
VMT: Virtualized Multi-Threading for Accelerating Graph Workloads on Commodity Processors
IEEE Transactions on Computers (ITCO), Volume 71, Issue 6Pages 1386–1398https://doi.org/10.1109/TC.2021.3086069Modern-day graph workloads operate on huge graphs through pointer chasing which leads to high last-level cache (LLC) miss rates and limited memory-level parallelism (MLP). Simultaneous Multi-Threading (SMT) effectively hides the memory access latencies ...
- research-articleJanuary 2022
The Forward Slice Core: A High-Performance, Yet Low-Complexity Microarchitecture
ACM Transactions on Architecture and Code Optimization (TACO), Volume 19, Issue 2Article No.: 17, Pages 1–25https://doi.org/10.1145/3499424Superscalar out-of-order cores deliver high performance at the cost of increased complexity and power budget. In-order cores, in contrast, are less complex and have a smaller power budget, but offer low performance. A processor architecture should ideally ...
- research-articleNovember 2021
Vector runahead
ISCA '21: Proceedings of the 48th Annual International Symposium on Computer ArchitecturePages 195–208https://doi.org/10.1109/ISCA52012.2021.00024The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microarchitectural prefetchers. ...
- research-articleSeptember 2020
The Forward Slice Core Microarchitecture
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 361–372https://doi.org/10.1145/3410463.3414629Superscalar out-of-order cores deliver high performance at the cost of increased complexity and power budget. In-order cores, in contrast, are less complex and have a smaller power budget, but offer low performance. A processor architecture should ...
- research-articleJanuary 2019
Precise Runahead Execution
IEEE Computer Architecture Letters (ICAL), Volume 18, Issue 1Pages 71–74https://doi.org/10.1109/LCA.2019.2910518Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively ...