More Web Proxy on the site http://driver.im/

research-article

DSPatch: Dual Spatial Pattern Prefetcher

Authors:

Sreenivas SubramoneyAuthors Info & Claims

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 531 - 544

https://doi.org/10.1145/3352460.3358325

Published: 12 October 2019 Publication History

Abstract

High main memory latency continues to limit performance of modern high-performance out-of-order cores. While DRAM latency has remained nearly the same over many generations, DRAM bandwidth has grown significantly due to higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked memory packaging (HBM). Current state-of-the-art prefetchers do not do well in extracting higher performance when higher DRAM bandwidth is available. Prefetchers need the ability to dynamically adapt to available bandwidth, boosting prefetch count and prefetch coverage when headroom exists and throttling down to achieve high accuracy when the bandwidth utilization is close to peak.

To this end, we present the Dual Spatial Pattern Prefetcher (DSPatch) that can be used as a standalone prefetcher or as a lightweight adjunct spatial prefetcher to the state-of-the-art delta-based Signature Pattern Prefetcher (SPP). DSPatch builds on a novel and intuitive use of modulated spatial bit-patterns. The key idea is to: (1) represent program accesses on a physical page as a bit-pattern anchored to the first "trigger" access, (2) learn two spatial access bit-patterns: one biased towards coverage and another biased towards accuracy, and (3) select one bit-pattern at run-time based on the DRAM bandwidth utilization to generate prefetches. Across a diverse set of workloads, using only 3.6KB of storage, DSPatch improves performance over an aggressive baseline with a PC-based stride prefetcher at the L1 cache and the SPP prefetcher at the L2 cache by 6% (9% in memory-intensive workloads and up to 26%). Moreover, the performance of DSPatch+SPP scales with increasing DRAM bandwidth, growing from 6% over SPP to 10% when DRAM bandwidth is doubled.

References

[1]

"6th Generation IntelÂő Processor Family," https://www.intel.com/content/www/us/en/processors/core/desktop-6th-gen-core-family-spec-update.html.

[2]

"7-Zip," https://www.7-zip.org/.

[3]

"Apache Cassandra," https://cassandra.apache.org/.

[4]

"Apache Hadoop," https://hadoop.apache.org/.

[5]

"Apache Spark™," https://www.cloudera.com/products/open-source/apache-hadoop/apache-spark.html.

[6]

"BigBench," https://blog.cloudera.com/blog/2014/11/bigbench-toward-an-industry-standard-benchmark-for-big-data-analytics/.

[7]

"HBM Specification," https://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf.

[8]

"HMC Specification v2.1," http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf.

[9]

"HP-LINPACK," https://www.netlib.org/benchmark/hpl/.

[10]

"JEDEC-DDR4," https://www.jedec.org/sites/default/files/docs/JESD79-4.pdf.

[11]

"JEDEC-GDDR5," https://www.jedec.org/category/keywords/gddr5.

[12]

"LPDDR4 Specification," https://www.jedec.org/sites/default/files/docs/JESD209-4.pdf.

[13]

"NAS Parallel Benchmark," https://github.com/benchmark-subsetting/NPB3.0-omp-C.

[14]

"PARSEC," http://parsec.cs.princeton.edu/.

[15]

"SPEC ACCEL©," https://www.spec.org/accel/.

[16]

"SPEC CPU 2006," https://www.spec.org/cpu2006/.

[17]

"SPEC CPU 2017," https://www.spec.org/cpu2017/.

[18]

"SPEC MPI© 2007," https://www.spec.org/mpi2007/.

[19]

"SPECjbb© 2015," https://www.spec.org/jbb2015/.

[20]

"SPECjEnterprise© 2010," https://www.spec.org/jEnterprise2010/.

[21]

"SYSmark 2014 ver 1.5," https://bapco.com/products/sysmark-2014/.

[22]

"TPC-C," http://www.tpc.org/tpcc/detail.asp.

[23]

"VP9 Encoding," https://trac.ffmpeg.org/wiki/Encode/VP9.

[24]

J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, "A Scalable Processing-in-memory Accelerator for Parallel Graph Processing," in ISCA, 2015.

[25]

M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Domino Temporal Data Prefetcher," in HPCA, 2018.

[26]

M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Bingo spatial data prefetcher," in HPCA, 2019.

[27]

K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhimenko, S. Khan, and O. Mutlu, "Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization," in SIGMETRICS, 2016.

[28]

R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, and Y. N. Patt, "Simultaneous Subordinate Microthreading (SSMT)," in ISCA, 1999.

[29]

T.-F. Chen and J.-L. Baer, "Effective hardware-based data prefetching for high-performance processors," in IEEE TC, 1995.

[30]

J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen, "Dynamic Speculative Precomputation," in MICRO, 2001.

[31]

F. Dahlgren, M. Dubois, and P. Stenström, "Sequential Hardware Prefetching in Shared-Memory Multiprocessors," in IEEE TPDS, 1995.

[32]

J. Dundas and T. Mudge, "Improving Data Cache Performance by Pre-executing Instructions Under a Cache Miss," in ICS, 1997.

[33]

E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, "Fairness via Source Throttling: A Configurable and High-performance Fairness Substrate for Multi-core Memory Systems," in ASPLOS, 2010.

[34]

E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, "Prefetch-aware Shared Resource Management for Multi-core Systems," in ISCA, 2011.

[35]

E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt, "Coordinated control of multiple prefetchers in multi-core systems," in MICRO, 2009.

[36]

E. Ebrahimi, O. Mutlu, and Y. N. Patt, "Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems," in HPCA, 2009.

[37]

M. Ferdman, S. Somogyi, and B. Falsafi, "Spatial memory streaming with rotated patterns," in In 1st JILP Data Prefetching Championship, 2009.

[38]

J. W. C. Fu, J. H. Patel, and B. L. Janssens, "Stride Directed Prefetching in Scalar Processors," in MICRO, 1992.

[39]

J. Gaur, M. Chaudhuri, and S. Subramoney, "Bypass and insertion algorithms for exclusive last-level caches," in ISCA, 2011.

[40]

S. Ghose, T. Li, N. Hajinazar, D. Senol Cali, and O. Mutlu, "Demystifying Complex Workload-DRAM Interactions: An Experimental Study," in SIGMETRICS, 2019.

[41]

M. Hashemi, O. Mutlu, and Y. N. Patt, "Continuous runahead: Transparent hardware acceleration for memory intensive workloads," in MICRO, 2016.

[42]

H. Hassan, M. Patel, J. S. Kim, A. G. Yaglikci, N. Vijaykumar, N. M. Ghiasi, S. Ghose, and O. Mutlu, "CROW: A Low-cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability," in ISCA, 2019.

[43]

I. Hur and C. Lin, "Memory Prefetching Using Adaptive Stream Detection," in MICRO, 2006.

[44]

Y. Ishii, M. Inaba, and K. Hiraki, "Access map pattern matching for data cache prefetch," in ISC, 2009.

[45]

Y. Ishii, M. Inaba, and K. Hiraki, "Unified Memory Optimizing Architecture: Memory Subsystem Control with a Unified Predictor," in ICS, 2012.

[46]

A.Jain and C. Lin, "Linearizing irregular memory accesses for improved correlated prefetching," in MICRO, 2013.

[47]

A. Jain and C. Lin, "Rethinking belady's algorithm to accommodate prefetching," in ISCA, 2018.

[48]

V. Janapa Reddi, B. C. Lee, T. Chilimbi, and K. Vaid, "Web Search Using Mobile Cores: Quantifying and Mitigating the Price of Efficiency," in ISCA, 2010.

[49]

D. A. Jiménez, "Dead block replacement and bypass with a sampling predictor," in JWAC, 2010.

[50]

D. Joseph and D. Grunwald, "Prefetching using Markov predictors," in ISCA, 1997.

[51]

N. P. Jouppi, "Improving Direct-mapped Cache Performance by the Addition of a Small Fully-associative Cache and Prefetch Buffers," in ISCA, 1990.

[52]

D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz, and D. Jimenez, "B-fetch: Branch prediction directed prefetching for chip-multiprocessors," in MICRO, 2014.

[53]

S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, "Profiling a Warehouse-scale Computer," in ISCA, 2015.

[54]

S. M. Khan, Y. Tian, and D. A. Jimenez, "Sampling dead block prediction for last-level caches," in MICRO, 2010.

[55]

J. Kim, S. H. Pugsley, P. V. Gratz, A. Reddy, C. Wilkerson, and Z. Chishti, "Path confidence based lookahead prefetching," in MICRO, 2016.

[56]

J. Kim, E. Teran, P. V. Gratz, D. A. Jiménez, S. H. Pugsley, and C. Wilkerson, "Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy," in ASPLOS, 2017.

[57]

Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, "A Case for Exploiting Subarray-level Parallelism (SALP) in DRAM," in ISCA, 2012.

[58]

C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, "Prefetch-aware DRAM controllers," in MICRO, 2008.

[59]

C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, "Improving Memory Bank-level Parallelism in the Presence of Prefetching," in MICRO, 2009.

[60]

D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, "Simultaneous multilayer access: Improving 3D-stacked memory bandwidth at low cost," in TACO, 2016.

[61]

D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu, "Adaptive-latency DRAM: Optimizing DRAM timing for the common-case," in HPCA, 2015.

[62]

D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, "Tiered-latency DRAM: A low latency and low cost DRAM architecture," in HPCA, 2013.

[63]

C.-K. Luk, "Tolerating Memory Latency Through Software-controlled Pre-execution in Simultaneous Multithreading Processors," in ISCA, 2001.

[64]

K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and M. Horowitz, "Towards energy-proportional datacenter memory with mobile DRAM," in ISCA, 2012.

[65]

P. Michaud, "Best-offset hardware prefetching," in HPCA, 2016.

[66]

O. Mutlu, H. Kim, and Y. N. Patt, "Techniques for efficient processing in runahead execution engines," in ISCA, 2005.

[67]

O. Mutlu, H. Kim, and Y. N. Patt, "Efficient runahead execution: Power-efficient memory latency tolerance," in IEEE Micro, 2006.

[68]

O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, "Runahead execution: An alternative to very large instruction windows for out-of-order processors," in HPCA, 2003.

[69]

O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, "Runahead Execution: An Effective Alternative to Large Instruction Windows," in IEEE Micro, 2003.

[70]

B. Panda and S. Balachandran, "Expert Prefetch Prediction: An Expert Predicting the Usefulness of Hardware Prefetchers," in IEEE CAL, 2016.

[71]

D. Pandiyan, S.-Y. Lee, and C.-J. Wu, "Performance, Energy Characterizations and Architectural Implications of An Emerging Mobile Platform Benchmark Suite-MobileBench," in IISWC, 2013.

[72]

S. H. Pugsley, Z. Chishti, C. Wilkerson, P.-f. Chuang, R. L. Scott, A. Jaleel, S.-L. Lu, K. Chow, and R. Balasubramonian, "Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers," in HPCA, 2014.

[73]

V. Seshadri and O. Mutlu, "In-DRAM Bulk Bitwise Execution Engine," CoRR, vol. abs/1905.09822, 2019. [Online]. Available: http://arxiv.org/abs/1905.09822

[74]

V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks," in TACO, 2015.

[75]

A. Seznec, "A new case for the TAGE branch predictor," in MICRO, 2011.

[76]

M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti, "Efficiently prefetching complex address patterns," in MICRO, 2015.

[77]

S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, "Spatial memory streaming," in ISCA, 2006.

[78]

S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt, "Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers," in HPCA, 2007.

[79]

L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa, "The Impact of Memory Subsystem Resource Sharing on Datacenter Applications," in ISCA, 2011.

[80]

P. H. Wang, J. D. Collins, H. Wang, D. Kim, B. Greene, K.-M. Chan, A. B. Yunus, T. Sych, S. F. Moore, and J. P. Shen, "Helper Threads via Virtual Multithreading on an Experimental Itanium®2 Processor-based Platform," in ASPLOS, 2004.

[81]

T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos, "Practical off-chip meta-data for temporal memory streaming," in HPCA, 2009.

[82]

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer, "SHiP: Signature-based hit predictor for high performance caching," in MICRO, 2011.

[83]

C.-J. Wu, A. Jaleel, M. Martonosi, S. C. Steely Jr, and J. Emer, "PACMan: prefetch-aware cache management for high performance caching," in MICRO, 2011.

[84]

C. Zilles and G. Sohi, "Execution-based Prediction Using Speculative Slices," in ISCA, 2001.

Cited By

Saglam BHo NFalquez CPortero ASchätzle FSuarez EPleiter D(2024)Data Prefetching on Processors with Heterogeneous MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695800(45-60)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695800
Cui YChen WCheng XYi J(2024)Hyperion: A Highly Effective Page and PC Based Delta PrefetcherACM Transactions on Architecture and Code Optimization10.1145/367539821:4(1-27)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3675398
He SWang ZTang XSun QDong D(2024)Chimera: Leveraging Hybrid Offsets for Efficient Data PrefetchingProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689613(144-155)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3689613
Show More Cited By

Index Terms

DSPatch: Dual Spatial Pattern Prefetcher
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Processors and memory architectures

Recommendations

Compiler Controlled Prefetching for Multiprocessors Using Low-Overhead Traps and Prefetch Engines

In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate ...
Profile-guided post-link stride prefetching
ICS '02: Proceedings of the 16th international conference on Supercomputing

Data prefetching is an effective approach to addressing the memory latency problem. While a few processors have implemented hardware-based data prefetching, the majority of modern processors support data-prefetch instructions and rely on compilers to ...
Maintaining Cache Coherence through Compiler-Directed Data Prefetching

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 2019

1104 pages

ISBN:9781450369381

DOI:10.1145/3352460

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MICRO '52

Sponsor:

SIGMICRO

MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 12 - 16, 2019

OH, Columbus, USA

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
956
Total Downloads

Downloads (Last 12 months)114
Downloads (Last 6 weeks)18

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Saglam BHo NFalquez CPortero ASchätzle FSuarez EPleiter D(2024)Data Prefetching on Processors with Heterogeneous MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695800(45-60)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695800
Cui YChen WCheng XYi J(2024)Hyperion: A Highly Effective Page and PC Based Delta PrefetcherACM Transactions on Architecture and Code Optimization10.1145/367539821:4(1-27)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3675398
He SWang ZTang XSun QDong D(2024)Chimera: Leveraging Hybrid Offsets for Efficient Data PrefetchingProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689613(144-155)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3689613
Martínez Palau FTorrents MArmejach ACasas M(2024)Exploiting Vector Code Semantics for Efficient Data Cache PrefetchingProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656635(98-109)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656635
Liu YChen MDe V(2024)Planaria: Pattern Directed Cross-page Composite PrefetcherProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3656499(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3656499
Xue FHan CLi XWu JZhang TLiu THao YDu ZGuo QZhang F(2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/3641853Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3641853
Lin YLin WXu JChen YJin ZQin JHe JCai SZhang YWang ZChen W(2024)PARS: A Pattern-Aware Spatial Data Prefetcher Supporting Multiple Region SizesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344298143:11(3638-3649)Online publication date: Nov-2024
https://doi.org/10.1109/TCAD.2024.3442981
Cebrian JJahre MRos A(2024)Temporarily Unauthorized Stores: Write First, Ask for Permission Later2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00065(810-822)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00065
Bera RRanganathan ARakshit JMahto SNori AGaur JOlgun AKanellopoulos KSadrosadati MSubramoney SMutlu O(2024)Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction Execution2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00017(88-102)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00017
Fu GXia TLuo ZChen RZhao WRen P(2024)Differential-Matching Prefetcher for Indirect Memory Access2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00040(439-453)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00040
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents