[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3352460.3358325acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

DSPatch: Dual Spatial Pattern Prefetcher

Published: 12 October 2019 Publication History

Abstract

High main memory latency continues to limit performance of modern high-performance out-of-order cores. While DRAM latency has remained nearly the same over many generations, DRAM bandwidth has grown significantly due to higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked memory packaging (HBM). Current state-of-the-art prefetchers do not do well in extracting higher performance when higher DRAM bandwidth is available. Prefetchers need the ability to dynamically adapt to available bandwidth, boosting prefetch count and prefetch coverage when headroom exists and throttling down to achieve high accuracy when the bandwidth utilization is close to peak.
To this end, we present the Dual Spatial Pattern Prefetcher (DSPatch) that can be used as a standalone prefetcher or as a lightweight adjunct spatial prefetcher to the state-of-the-art delta-based Signature Pattern Prefetcher (SPP). DSPatch builds on a novel and intuitive use of modulated spatial bit-patterns. The key idea is to: (1) represent program accesses on a physical page as a bit-pattern anchored to the first "trigger" access, (2) learn two spatial access bit-patterns: one biased towards coverage and another biased towards accuracy, and (3) select one bit-pattern at run-time based on the DRAM bandwidth utilization to generate prefetches. Across a diverse set of workloads, using only 3.6KB of storage, DSPatch improves performance over an aggressive baseline with a PC-based stride prefetcher at the L1 cache and the SPP prefetcher at the L2 cache by 6% (9% in memory-intensive workloads and up to 26%). Moreover, the performance of DSPatch+SPP scales with increasing DRAM bandwidth, growing from 6% over SPP to 10% when DRAM bandwidth is doubled.

References

[1]
"6th Generation IntelÂő Processor Family," https://www.intel.com/content/www/us/en/processors/core/desktop-6th-gen-core-family-spec-update.html.
[2]
"7-Zip," https://www.7-zip.org/.
[3]
"Apache Cassandra," https://cassandra.apache.org/.
[4]
"Apache Hadoop," https://hadoop.apache.org/.
[5]
"Apache Spark™," https://www.cloudera.com/products/open-source/apache-hadoop/apache-spark.html.
[6]
"BigBench," https://blog.cloudera.com/blog/2014/11/bigbench-toward-an-industry-standard-benchmark-for-big-data-analytics/.
[7]
"HBM Specification," https://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf.
[8]
"HMC Specification v2.1," http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf.
[9]
"HP-LINPACK," https://www.netlib.org/benchmark/hpl/.
[10]
"JEDEC-DDR4," https://www.jedec.org/sites/default/files/docs/JESD79-4.pdf.
[11]
"JEDEC-GDDR5," https://www.jedec.org/category/keywords/gddr5.
[12]
"LPDDR4 Specification," https://www.jedec.org/sites/default/files/docs/JESD209-4.pdf.
[13]
"NAS Parallel Benchmark," https://github.com/benchmark-subsetting/NPB3.0-omp-C.
[14]
"PARSEC," http://parsec.cs.princeton.edu/.
[15]
"SPEC ACCEL©," https://www.spec.org/accel/.
[16]
"SPEC CPU 2006," https://www.spec.org/cpu2006/.
[17]
"SPEC CPU 2017," https://www.spec.org/cpu2017/.
[18]
"SPEC MPI© 2007," https://www.spec.org/mpi2007/.
[19]
"SPECjbb© 2015," https://www.spec.org/jbb2015/.
[20]
"SPECjEnterprise© 2010," https://www.spec.org/jEnterprise2010/.
[21]
"SYSmark 2014 ver 1.5," https://bapco.com/products/sysmark-2014/.
[22]
"TPC-C," http://www.tpc.org/tpcc/detail.asp.
[23]
"VP9 Encoding," https://trac.ffmpeg.org/wiki/Encode/VP9.
[24]
J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, "A Scalable Processing-in-memory Accelerator for Parallel Graph Processing," in ISCA, 2015.
[25]
M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Domino Temporal Data Prefetcher," in HPCA, 2018.
[26]
M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Bingo spatial data prefetcher," in HPCA, 2019.
[27]
K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhimenko, S. Khan, and O. Mutlu, "Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization," in SIGMETRICS, 2016.
[28]
R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, and Y. N. Patt, "Simultaneous Subordinate Microthreading (SSMT)," in ISCA, 1999.
[29]
T.-F. Chen and J.-L. Baer, "Effective hardware-based data prefetching for high-performance processors," in IEEE TC, 1995.
[30]
J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen, "Dynamic Speculative Precomputation," in MICRO, 2001.
[31]
F. Dahlgren, M. Dubois, and P. Stenström, "Sequential Hardware Prefetching in Shared-Memory Multiprocessors," in IEEE TPDS, 1995.
[32]
J. Dundas and T. Mudge, "Improving Data Cache Performance by Pre-executing Instructions Under a Cache Miss," in ICS, 1997.
[33]
E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, "Fairness via Source Throttling: A Configurable and High-performance Fairness Substrate for Multi-core Memory Systems," in ASPLOS, 2010.
[34]
E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, "Prefetch-aware Shared Resource Management for Multi-core Systems," in ISCA, 2011.
[35]
E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt, "Coordinated control of multiple prefetchers in multi-core systems," in MICRO, 2009.
[36]
E. Ebrahimi, O. Mutlu, and Y. N. Patt, "Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems," in HPCA, 2009.
[37]
M. Ferdman, S. Somogyi, and B. Falsafi, "Spatial memory streaming with rotated patterns," in In 1st JILP Data Prefetching Championship, 2009.
[38]
J. W. C. Fu, J. H. Patel, and B. L. Janssens, "Stride Directed Prefetching in Scalar Processors," in MICRO, 1992.
[39]
J. Gaur, M. Chaudhuri, and S. Subramoney, "Bypass and insertion algorithms for exclusive last-level caches," in ISCA, 2011.
[40]
S. Ghose, T. Li, N. Hajinazar, D. Senol Cali, and O. Mutlu, "Demystifying Complex Workload-DRAM Interactions: An Experimental Study," in SIGMETRICS, 2019.
[41]
M. Hashemi, O. Mutlu, and Y. N. Patt, "Continuous runahead: Transparent hardware acceleration for memory intensive workloads," in MICRO, 2016.
[42]
H. Hassan, M. Patel, J. S. Kim, A. G. Yaglikci, N. Vijaykumar, N. M. Ghiasi, S. Ghose, and O. Mutlu, "CROW: A Low-cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability," in ISCA, 2019.
[43]
I. Hur and C. Lin, "Memory Prefetching Using Adaptive Stream Detection," in MICRO, 2006.
[44]
Y. Ishii, M. Inaba, and K. Hiraki, "Access map pattern matching for data cache prefetch," in ISC, 2009.
[45]
Y. Ishii, M. Inaba, and K. Hiraki, "Unified Memory Optimizing Architecture: Memory Subsystem Control with a Unified Predictor," in ICS, 2012.
[46]
A.Jain and C. Lin, "Linearizing irregular memory accesses for improved correlated prefetching," in MICRO, 2013.
[47]
A. Jain and C. Lin, "Rethinking belady's algorithm to accommodate prefetching," in ISCA, 2018.
[48]
V. Janapa Reddi, B. C. Lee, T. Chilimbi, and K. Vaid, "Web Search Using Mobile Cores: Quantifying and Mitigating the Price of Efficiency," in ISCA, 2010.
[49]
D. A. Jiménez, "Dead block replacement and bypass with a sampling predictor," in JWAC, 2010.
[50]
D. Joseph and D. Grunwald, "Prefetching using Markov predictors," in ISCA, 1997.
[51]
N. P. Jouppi, "Improving Direct-mapped Cache Performance by the Addition of a Small Fully-associative Cache and Prefetch Buffers," in ISCA, 1990.
[52]
D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz, and D. Jimenez, "B-fetch: Branch prediction directed prefetching for chip-multiprocessors," in MICRO, 2014.
[53]
S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, "Profiling a Warehouse-scale Computer," in ISCA, 2015.
[54]
S. M. Khan, Y. Tian, and D. A. Jimenez, "Sampling dead block prediction for last-level caches," in MICRO, 2010.
[55]
J. Kim, S. H. Pugsley, P. V. Gratz, A. Reddy, C. Wilkerson, and Z. Chishti, "Path confidence based lookahead prefetching," in MICRO, 2016.
[56]
J. Kim, E. Teran, P. V. Gratz, D. A. Jiménez, S. H. Pugsley, and C. Wilkerson, "Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy," in ASPLOS, 2017.
[57]
Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, "A Case for Exploiting Subarray-level Parallelism (SALP) in DRAM," in ISCA, 2012.
[58]
C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, "Prefetch-aware DRAM controllers," in MICRO, 2008.
[59]
C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, "Improving Memory Bank-level Parallelism in the Presence of Prefetching," in MICRO, 2009.
[60]
D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, "Simultaneous multilayer access: Improving 3D-stacked memory bandwidth at low cost," in TACO, 2016.
[61]
D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu, "Adaptive-latency DRAM: Optimizing DRAM timing for the common-case," in HPCA, 2015.
[62]
D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, "Tiered-latency DRAM: A low latency and low cost DRAM architecture," in HPCA, 2013.
[63]
C.-K. Luk, "Tolerating Memory Latency Through Software-controlled Pre-execution in Simultaneous Multithreading Processors," in ISCA, 2001.
[64]
K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and M. Horowitz, "Towards energy-proportional datacenter memory with mobile DRAM," in ISCA, 2012.
[65]
P. Michaud, "Best-offset hardware prefetching," in HPCA, 2016.
[66]
O. Mutlu, H. Kim, and Y. N. Patt, "Techniques for efficient processing in runahead execution engines," in ISCA, 2005.
[67]
O. Mutlu, H. Kim, and Y. N. Patt, "Efficient runahead execution: Power-efficient memory latency tolerance," in IEEE Micro, 2006.
[68]
O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, "Runahead execution: An alternative to very large instruction windows for out-of-order processors," in HPCA, 2003.
[69]
O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, "Runahead Execution: An Effective Alternative to Large Instruction Windows," in IEEE Micro, 2003.
[70]
B. Panda and S. Balachandran, "Expert Prefetch Prediction: An Expert Predicting the Usefulness of Hardware Prefetchers," in IEEE CAL, 2016.
[71]
D. Pandiyan, S.-Y. Lee, and C.-J. Wu, "Performance, Energy Characterizations and Architectural Implications of An Emerging Mobile Platform Benchmark Suite-MobileBench," in IISWC, 2013.
[72]
S. H. Pugsley, Z. Chishti, C. Wilkerson, P.-f. Chuang, R. L. Scott, A. Jaleel, S.-L. Lu, K. Chow, and R. Balasubramonian, "Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers," in HPCA, 2014.
[73]
V. Seshadri and O. Mutlu, "In-DRAM Bulk Bitwise Execution Engine," CoRR, vol. abs/1905.09822, 2019. [Online]. Available: http://arxiv.org/abs/1905.09822
[74]
V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks," in TACO, 2015.
[75]
A. Seznec, "A new case for the TAGE branch predictor," in MICRO, 2011.
[76]
M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti, "Efficiently prefetching complex address patterns," in MICRO, 2015.
[77]
S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, "Spatial memory streaming," in ISCA, 2006.
[78]
S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt, "Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers," in HPCA, 2007.
[79]
L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa, "The Impact of Memory Subsystem Resource Sharing on Datacenter Applications," in ISCA, 2011.
[80]
P. H. Wang, J. D. Collins, H. Wang, D. Kim, B. Greene, K.-M. Chan, A. B. Yunus, T. Sych, S. F. Moore, and J. P. Shen, "Helper Threads via Virtual Multithreading on an Experimental Itanium®2 Processor-based Platform," in ASPLOS, 2004.
[81]
T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos, "Practical off-chip meta-data for temporal memory streaming," in HPCA, 2009.
[82]
C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer, "SHiP: Signature-based hit predictor for high performance caching," in MICRO, 2011.
[83]
C.-J. Wu, A. Jaleel, M. Martonosi, S. C. Steely Jr, and J. Emer, "PACMan: prefetch-aware cache management for high performance caching," in MICRO, 2011.
[84]
C. Zilles and G. Sohi, "Execution-based Prediction Using Speculative Slices," in ISCA, 2001.

Cited By

View all
  • (2024)Data Prefetching on Processors with Heterogeneous MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695800(45-60)Online publication date: 30-Sep-2024
  • (2024)Hyperion: A Highly Effective Page and PC Based Delta PrefetcherACM Transactions on Architecture and Code Optimization10.1145/367539821:4(1-27)Online publication date: 1-Jul-2024
  • (2024)Chimera: Leveraging Hybrid Offsets for Efficient Data PrefetchingProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689613(144-155)Online publication date: 14-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
October 2019
1104 pages
ISBN:9781450369381
DOI:10.1145/3352460
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data prefetching
  2. memory latency
  3. microarchitecture

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

MICRO '52
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)114
  • Downloads (Last 6 weeks)18
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Data Prefetching on Processors with Heterogeneous MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695800(45-60)Online publication date: 30-Sep-2024
  • (2024)Hyperion: A Highly Effective Page and PC Based Delta PrefetcherACM Transactions on Architecture and Code Optimization10.1145/367539821:4(1-27)Online publication date: 1-Jul-2024
  • (2024)Chimera: Leveraging Hybrid Offsets for Efficient Data PrefetchingProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689613(144-155)Online publication date: 14-Oct-2024
  • (2024)Exploiting Vector Code Semantics for Efficient Data Cache PrefetchingProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656635(98-109)Online publication date: 30-May-2024
  • (2024)Planaria: Pattern Directed Cross-page Composite PrefetcherProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3656499(1-6)Online publication date: 23-Jun-2024
  • (2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/3641853Online publication date: 22-Jan-2024
  • (2024)PARS: A Pattern-Aware Spatial Data Prefetcher Supporting Multiple Region SizesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344298143:11(3638-3649)Online publication date: Nov-2024
  • (2024)Temporarily Unauthorized Stores: Write First, Ask for Permission Later2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00065(810-822)Online publication date: 2-Nov-2024
  • (2024)Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction Execution2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00017(88-102)Online publication date: 29-Jun-2024
  • (2024)Differential-Matching Prefetcher for Indirect Memory Access2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00040(439-453)Online publication date: 2-Mar-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media