[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1508244.1508259acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Published: 07 March 2009 Publication History

Abstract

Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%.

References

[1]
D. Albonesi. Selective cache ways: on-demand cache resource allocation. In MICRO, pages 248--259, 1999.
[2]
C. Antonopoulos, D. Nikolopoulos, and T. Papatheodorou. Scheduling algorithms with bus bandwidth considerations for SMPs. In ICPP, pages 547--554, 2003.
[3]
R. Azimi, L. Soares, M. Stumm, T. Walsh, and A. Demke Brown. PATH: page access tracking to improve memory management. In ISMM, pages 31--42, 2007.
[4]
R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In ICS, pages 101--110, 2005.
[5]
R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In MICRO, pages 245--257, 2000.
[6]
E. Berg and E. Hagersten. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In ISPASS, pages 20--27, 2004.
[7]
E. Berg and E. Hagersten. Fast data-locality profiling of native execution. In SIGMETRICS, pages 169--180, 2005.
[8]
E. Berg, H. Zeffer, and E. Hagersten. A statistical multiprocessor cache model. In ISPASS, pages 89--99, 2006.
[9]
D. Bruening, E. Duesterwald, and S. Amarasinghe. Design and implementation of a dynamic optimization framework for Windows. In FDDO, 2001.
[10]
B. Buck and J. Hollingsworth. An API for runtime code patching. J. of High Performance Computing Applications, 14(4):317--329, 2000.
[11]
D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA, pages 340--351, 2005.
[12]
S. Cho and L Jin. Managing distributed, shared L2 caches through OS-level page allocation. In MICRO, pages 455--468, 2006.
[13]
J. Edler and M. Hill. Dinero IV trace-driven uniprocessor cache simulator. URL http://www.cs.wisc.edu/~markhill/DineroIV.
[14]
A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In USENIX ATC, pages 26--26, 2005.
[15]
F. Guo and Y. Solihin. An analytical model for cache replacement policy performance. In SIGMETRICS, pages 228--239, 2006.
[16]
R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In ICS, pages 257--266, 2004.
[17]
R. Iyer, L. Zhao, F. Guo, R. Illikkal, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, pages 25--36, 2007.
[18]
J. Kim, J. Choi, J. Kim, S. Noh, S. Min, Y. Cho, and C. Kim. A low-overhead high-performance unified buffer management scheme that exploits sequential and looping references. In OSDI, pages 119--34, 2000.
[19]
S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, pages 111--122, 2004.
[20]
Y. Kim, M. Hill, and D. Wood. Implementing stack simulation for highlyassociative memories. In SIGMETRICS, pages 212--213, 1991.
[21]
J. Liedtke, H. Hartig, and M. Hohmuth. OS-controlled cache predictability for real-time systems. In RTAS, pages 213--227, 1997.
[22]
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA, pages 367--378, 2008.
[23]
C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA, pages 176--185, 2004.
[24]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI, pages 190--200, 2005.
[25]
R. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation techniques and storage hierarchies. IBM Systems J., 9(2):78--117, 1970.
[26]
K. Meng, R. Joseph, R. Dick, and L. Shang. Multi-optimization power management for chip multiprocessors. In PACT, pages 177--186, 2008.
[27]
M. Olszewski, K. Mierle, A. Czajkowski, and A. Demke Brown. JIT instrumentation: a novel approach to dynamically instrument operating systems. In EuroSys, pages 3--16, 2007.
[28]
R. Patterson, G. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed prefetching and caching. In SOSP, pages 79--95, 1995.
[29]
M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO, pages 423--432, 2006.
[30]
N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In PACT, pages 2--12, 2006.
[31]
R. Rajkumar, C. Lee, J. Lehoczky, and D. Siewiorek. A resource allocation model for QoS management. In RTSS, pages 298--308, 1997.
[32]
A. Settle, J. Kihm, A. Janiszewski, and D. Connors. Architectural support for enhanced SMT job scheduling. In PACT, 2004.
[33]
J. Seward and N. Nethercote. Using Valgrind to detect undefined value errors with bit-precision. In USENIX ATC, pages 17--30, 2005.
[34]
X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In POPL, pages 55--61, 2007.
[35]
T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In ISCA, pages 336--349, 2003.
[36]
A. Snavely and D. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In ASPLOS, pages 234--244, 2000.
[37]
L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer. In MICRO, 2008.
[38]
G. Soundararajan, J. Chen, M. Sharaf, and C. Amza. Dynamic partitioning of the cache hierarchy in shared data centers. In VLDB, pages 635--646, 2008.
[39]
S. Srikantaiah, M. Kandemir, and M. Irwin. Adaptive set pinning: Managing shared caches in chip multiprocessors. In ASPLOS, pages 135--144, 2008.
[40]
H. Stone, J. Turek, and J. Wolf. Optimal partitioning of cache memory. IEEE TOC, 41(9):1054--1068, 1992.
[41]
E. Suh, L Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. The J. of Supercomputing, 28(1):7--26, Apr. 2004.
[42]
D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared L2 caches on multicore systems in software. In WIOSCA, pages 26--33, 2007.
[43]
D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In EuroSys, pages 47--58, 2007.
[44]
D. Thiebaut, H. Stone, and J.Wolf. Improving disk cache hit-ratios through cache partitioning. IEEE TOC, 41(6):665--676, 1992.
[45]
T. Yang, E. Berger, S. Kaplan, and J. Moss. CRAMM: virtual memory support for garbage-collected applications. In OSDI, pages 103--116, 2006.
[46]
Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In CGO, pages 299--311, 2007.
[47]
P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In ASPLOS, pages 177--188, 2004.
[48]
Y. Zhou, J. Philbin, and K. Li. The multi-queue replacement algorithm for second level buffer caches. In USENIX ATC, pages 91--104, 2001.

Cited By

View all
  • (2023)Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive GroupingACM Transactions on Architecture and Code Optimization10.1145/362952521:1(1-26)Online publication date: 21-Oct-2023
  • (2023)A Space-Efficient Fair Cache Scheme Based on Machine Learning for NVMe SSDsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322141034:1(383-399)Online publication date: 1-Jan-2023
  • (2022)Adaptive Page Migration Policy With Huge Pages in Tiered Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.303668671:1(53-68)Online publication date: 1-Jan-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS XIV: Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
March 2009
358 pages
ISBN:9781605584065
DOI:10.1145/1508244
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 44, Issue 3
    ASPLOS 2009
    March 2009
    346 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/1508284
    Issue’s Table of Contents
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 37, Issue 1
    ASPLOS 2009
    March 2009
    346 pages
    ISSN:0163-5964
    DOI:10.1145/2528521
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 March 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache management
  2. cache partitioning
  3. chip multiprocessor
  4. dynamic optimization
  5. hardware performance counters
  6. miss rate curve
  7. multicore processor
  8. online optimization
  9. performance monitoring unit
  10. shared cache
  11. shared cache management

Qualifiers

  • Research-article

Conference

ASPLOS09

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive GroupingACM Transactions on Architecture and Code Optimization10.1145/362952521:1(1-26)Online publication date: 21-Oct-2023
  • (2023)A Space-Efficient Fair Cache Scheme Based on Machine Learning for NVMe SSDsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322141034:1(383-399)Online publication date: 1-Jan-2023
  • (2022)Adaptive Page Migration Policy With Huge Pages in Tiered Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.303668671:1(53-68)Online publication date: 1-Jan-2022
  • (2020)Accuracy-Aware Adaptive Traffic Monitoring for Software DataplanesIEEE/ACM Transactions on Networking10.1109/TNET.2020.297695228:3(986-1001)Online publication date: Jun-2020
  • (2019)DynaSprintProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358301(426-439)Online publication date: 12-Oct-2019
  • (2019)Directed Statistical Warming through Time TravelingProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358264(1037-1049)Online publication date: 12-Oct-2019
  • (2019)A Relational Theory of LocalityACM Transactions on Architecture and Code Optimization10.1145/334110916:3(1-26)Online publication date: 20-Aug-2019
  • (2019)EMBAProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337863(1-12)Online publication date: 5-Aug-2019
  • (2019)Beating OPT with Statistical Clairvoyance and Variable Size CachingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304067(243-256)Online publication date: 4-Apr-2019
  • (2018)ResQProceedings of the 15th USENIX Conference on Networked Systems Design and Implementation10.5555/3307441.3307466(283-297)Online publication date: 9-Apr-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media