article

Data prefetching by dependence graph precomputation

Authors:

Murali Annavaram,

Jignesh M. Patel,

Edward S. DavidsonAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 29, Issue 2

Pages 52 - 61

https://doi.org/10.1145/384285.379251

Published: 01 May 2001 Publication History

Get Access

Abstract

Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the processor. Prefetching data by predicting the miss address is one way to tolerate the cache miss latencies. But current applications with irregular access patterns make it difficult to accurately predict the address sufficiently early to mask large cache miss latencies. This paper explores an alternative to predicting prefetch addresses, namely precomputing them. The Dependence Graph Precomputation scheme (DGP) introduced in this paper is a novel approach for dynamically identifying and precomputing the instructions that determine the addresses accessed by those load/store instructions marked as being responsible for most data cache misses. DGP's dependence graph generator efficiently generates the required dependence graphs at run time. A separate precomputation engine executes these graphs to generate the data addresses of the marked load/store instructions early enough for accurate prefetching. Our results show that 94% of the prefetches issued by DGP are useful, reducing the D-cache miss stall time by 47%. Thus DGP takes us about half way from an already highly tuned baseline system toward perfect D-cache performance. DGP improves the overall performance of a wide range of applications by 7% over tagged next line prefetching, by 13% over a baseline processor with no prefetching, and is within 15% of the perfect D-cache performance.

References

[1]

M. Annavaram, G. Tyson, and E. Davidson. Instruction Overhead and Data Locality Effects in Superscalar Processors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, pages 95-100, April 2000.

Digital Library

Google Scholar

[2]

D. Bitton, D. DeWitt, and C. Turbyfill. Benchmarking Database Systems A Systematic Approach. In 9th International Conference on Very Large Data Bases, pages 8-19, October 1983.

Digital Library

Google Scholar

[3]

D. Burger and T. Austin. The SimpleScalar Tool Set. Technical report, University of Wisconsin-Madison, Computer ScienceDepartment Technical Report #1342, June 1997.

Google Scholar

[4]

M. Carey, D. DeWitt, M. Franklin, N. Hall, M. McAuliffe, J. Naughton, D. Schuh, M. Solomon, C. Tan, O. Tsatalos, S. White, and M. Zwilling. Shoring Up Persistent Applications. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pages 383-394, May 1994.

Digital Library

Google Scholar

[5]

R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th International Symposium on Computer Architecture, pages 186-195, June 1999.

Digital Library

Google Scholar

[6]

J. D. Collins, H. Wang, D. M. Tullsen, H. J. Christopher, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, page ??, July 2001.

Digital Library

Google Scholar

[7]

T. P. P. Council. TPC Benchmark H Standard Specification (Decision Support). In Revision 1.1.0, June 1999.

Google Scholar

[8]

A. Farcy, O. Temam, R. Espasa, and T. Juan. Dataflow Analysis of Branch Mispredictions and Its Applications to Early Resolution of Branch Outcomes. In Proceedings of the 31st International Symposium on Microarchitecture, pages 59- 68, Dec 1998.

Digital Library

Google Scholar

[9]

Y. Patt, S. Patel, M. Evers, D. Friendly, and J. Stark. One Billion Transistors, One Uniprocessor, One Chip. In IEEE COMPUTER, volume 30(9), pages 51-57, Sept. 1997.

Digital Library

Google Scholar

[10]

A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proceedings of the High Performance Computer Architecture, pages 37-48, Jan 2001.

Digital Library

Google Scholar

[11]

C. Selvidge. Compilation-Based Prefetching for Memory Latency Tolerance. PhD thesis, MIT, May 1992.

Digital Library

Google Scholar

[12]

A. Srivastava and D. Wall. A Practical System for Intermodule Code Optimization at Link-Time. Technical Report Technical Report 92/6, Digital Western Research Laboratory, June 1992.

Google Scholar

[13]

M. Weiser. Program Slicing. IEEE Transactions on Software Engineering, 11(4):352-357, 1984.

Google Scholar

[14]

C. Zilles and G. Sohi. Understanding the Backward Slices of Performance Degrading Instructions. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 172-181, June 2000.

Digital Library

Google Scholar

Cited By

View all

Zheng YHan CZhang TZhang FWang J(2024)A dependence graph pattern mining method for processor performance analysisPerformance Evaluation10.1016/j.peva.2024.102409164(102409)Online publication date: May-2024
https://doi.org/10.1016/j.peva.2024.102409
Naithani AAinsworth SJones TEeckhout LMartínez JDuato JJohn L(2021)Vector runaheadProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00024(195-208)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00024
Wang ZNowatzki TManne SHunter HAltman E(2019)Stream-based memory access specialization for general purpose processorsProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322229(736-749)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322229
Show More Cited By

Recommendations

Data prefetching by dependence graph precomputation
ISCA '01: Proceedings of the 28th annual international symposium on Computer architecture

Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the processor. Prefetching data by predicting the miss address is one way to tolerate the cache miss latencies. But current applications with irregular ...
Maintaining Cache Coherence through Compiler-Directed Data Prefetching

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses ...
Stealth prefetching
Proceedings of the 2006 ASPLOS Conference

Prefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 29, Issue 2

Special Issue: Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01)

May 2001

262 pages

ISSN:0163-5964

DOI:10.1145/384285

Editor:
Per Stenström
Chalmers Univ. of Technology

Issue’s Table of Contents

ISCA '01: Proceedings of the 28th annual international symposium on Computer architecture
June 2001
289 pages
ISBN:0769511627
DOI:10.1145/379240
Chairman:
Per Stenström
Chalmers Univ. of Technology

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001

Published in SIGARCH Volume 29, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

141
Total Citations
View Citations
749
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zheng YHan CZhang TZhang FWang J(2024)A dependence graph pattern mining method for processor performance analysisPerformance Evaluation10.1016/j.peva.2024.102409164(102409)Online publication date: May-2024
https://doi.org/10.1016/j.peva.2024.102409
Naithani AAinsworth SJones TEeckhout LMartínez JDuato JJohn L(2021)Vector runaheadProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00024(195-208)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00024
Wang ZNowatzki TManne SHunter HAltman E(2019)Stream-based memory access specialization for general purpose processorsProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322229(736-749)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322229
Yasir Qadri MQadri NFleury MMcDonald-Maier K(2017)Energy-efficient data prefetch buffering for low-end embedded processorsMicroelectronics Journal10.1016/j.mejo.2017.01.01462(57-64)Online publication date: Apr-2017
https://doi.org/10.1016/j.mejo.2017.01.014
Hashemi MMutlu OPatt YHsu WYang CLipasti MLee H(2016)Continuous runaheadThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195712(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195712
Chadha GMahlke SNarayanasamy S(2015)Accelerating asynchronous programs through event sneak peekACM SIGARCH Computer Architecture News10.1145/2872887.275037343:3S(642-654)Online publication date: 13-Jun-2015
https://dl.acm.org/doi/10.1145/2872887.2750373
Chadha GMahlke SNarayanasamy SMarr DAlbonesi D(2015)Accelerating asynchronous programs through event sneak peekProceedings of the 42nd Annual International Symposium on Computer Architecture10.1145/2749469.2750373(642-654)Online publication date: 13-Jun-2015
https://dl.acm.org/doi/10.1145/2749469.2750373
Qadri MQadri NFleury MMcDonald-Maier K(2015)Software-Controlled Instruction Prefetch Buffering for Low-End ProcessorsJournal of Circuits, Systems and Computers10.1142/S021812661550161324:10(1550161)Online publication date: Dec-2015
https://doi.org/10.1142/S0218126615501613
Patrick CKandemir MKaraköy MSon SChoudhary AHariri SKeahey K(2010)Cashing in on hints for better prefetching and caching in PVFS and MPI-IOProceedings of the 19th ACM International Symposium on High Performance Distributed Computing10.1145/1851476.1851499(191-202)Online publication date: 21-Jun-2010
https://dl.acm.org/doi/10.1145/1851476.1851499
Byna SChen YSun X(2009)Taxonomy of Data Prefetching for Multicore ProcessorsJournal of Computer Science and Technology10.1007/s11390-009-9233-424:3(405-417)Online publication date: 26-May-2009
https://doi.org/10.1007/s11390-009-9233-4
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Data prefetching by dependence graph precomputation

Maintaining Cache Coherence through Compiler-Directed Data Prefetching

Stealth prefetching