More Web Proxy on the site http://driver.im/

research-article

Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

Authors:

Kelefouras Vasilios,

Keramidas Georgios,

Voros NikolaosAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 17, Issue 3

Article No.: 72, Pages 1 - 25

https://doi.org/10.1145/3202663

Published: 22 May 2018 Publication History

Abstract

One of the biggest challenges in multicore platforms is shared cache management, especially for data-dominant applications. Two commonly used approaches for increasing shared cache utilization are cache partitioning and loop tiling. However, state-of-the-art compilers lack efficient cache partitioning and loop tiling methods for two reasons. First, cache partitioning and loop tiling are strongly coupled together, and thus addressing them separately is simply not effective. Second, cache partitioning and loop tiling must be tailored to the target shared cache architecture details and the memory characteristics of the corunning workloads.

To the best of our knowledge, this is the first time that a methodology provides (1) a theoretical foundation in the above-mentioned cache management mechanisms and (2) a unified framework to orchestrate these two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by an order of magnitude keeping at the same time the number of arithmetic/addressing instructions to a minimal level. We motivate this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details (i.e., cache size and associativity), and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)-optimal solution is requested. To this end, we present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space.

References

[1]

F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. 2006. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE, 295--305.

Digital Library

[2]

L. Almagor, Keith D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves, D. Subramanian, L. Torczon, and T. Waterman. 2004. Finding effective compilation sequences. ACM SIGPLAN Not. 39, 7, 231--239.

Digital Library

[3]

Utpal Banerjee. 1993. Linear Equations and Inequalities. Springer, Boston, MA, 49--94.

[4]

Bin Bao and Chen Ding. 2013. Defensive loop tiling for shared cache. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE, 1--11.

Digital Library

[5]

M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. ACM SIGPLAN Not. 44, 4, 219--228.

Digital Library

[6]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2, 1--7.

Digital Library

[7]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008a. A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Not. 43, 6, 101--113.

Digital Library

[8]

Uday Bondhugula, J. Ramanujam, and P. Sadayppan. 2008b. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI’08).

[9]

Bach D. Bui, Marco Caccamo, Lui Sha, and Joseph Martinez. 2008. Impact of cache partitioning on multi-tasking real time embedded systems. In Proceedings of the 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications, 101--110.

Digital Library

[10]

Jichuan Chang and Gurindar S. Sohi. 2014. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing (ICS’07). ACM, 80--81.

Digital Library

[11]

Shi-Kuo Chang. 2003. Data Structures and Algorithms. Series on Software Engineering and Knowledge Engineering, Vol. 13. World Scientific.

[12]

Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. 2005. ACME: Adaptive compilation made efficient. ACM SIGPLAN Not. 40, 7, 69--77.

Digital Library

[13]

Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steve Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. 2006. Exploring the structure of the space of compilation sequences using randomized search algorithms. J. Supercomput. 36, 2, 135--151.

Digital Library

[14]

Xiaoning Ding, Kaibo Wang, and Xiaodong Zhang. 2011. ULCC: A user-level facility for optimizing shared cache performance on multicores. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, 103--112. http://dblp.uni-trier.de/db/conf/ppopp/ppopp2011.html.

Digital Library

[15]

Haakon Dybdahl and Per Stenstrm. 2007. An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE, 2--12. http://dblp.uni-trier.de/db/conf/hpca/hpca2007.html.

Digital Library

[16]

M. Haneda, P. M. W. Khnijnenburg, and H. A. G. Wijshoff. 2005. Automatic selection of compiler options using non-parametric inferential statistics. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE, 123--132.

Digital Library

[17]

T. Henretty, A. Rountev, H. Lin, J. Ramanujam, Q. Lu, T. F. Ngai, Y. Chen, C. Alias, U. Bondhugula, P. Sadayappan, and S. Krishnamoorthy. 2009. Data layout transformation for enhancing data locality on NUCA chip multiprocessors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 348--357.

Digital Library

[18]

Mahmut Kandemir, Sai Prashanth Muralidhara, Sri Hari Krishna Narayanan, Yuanrui Zhang, and Ozcan Ozturk. 2009. Optimizing shared cache behavior of chip multiprocessors. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, 505--516.

Digital Library

[19]

Mahmut Kandemir, Taylan Yemliha, SaiPrashanth Muralidhara, Shekhar Srikantaiah, Mary Jane Irwin, and Yuanrui Zhnag. 2010. Cache topology aware computation mapping for multicores. ACM SIGPLAN Not. 45, 6, 74--85.

Digital Library

[20]

Dimitris Kaseridis, Jeffrey Stuecheli, and Lizy K. John. 2009. Bank-aware dynamic cache partitioning for multicore architectures. In Proceedings of the International Conference on Parallel Processing (ICPP’09). 18--25.

Digital Library

[21]

Vasilios Kelefouras, Angeliki Kritikakou, and Costas Goutis. 2015. A methodology for speeding up loop kernels by exploiting the software information and the memory architecture. Comput. Lang. Syst. Structures 41, 21--41. http://dblp.uni-trier.de/db/journals/cl/cl41.html.

Digital Library

[22]

DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, Article 51, 12 pages.

Digital Library

[23]

Hyoseung Kim, Arvind Kandhalu, and Ragunathan Rajkumar. 2013. A coordinated approach for practical OS-level cache management in multi-core real-time systems. In Proceedings of the 25th Euromicro Conference on Real-Time Systems, (ECRTS’13). 80--89.

Digital Library

[24]

P. M. W. Knijnenburg, T. Kisuki, K. Gallivan, and M. F. P. O’Boyle. 2004. The effect of cache models on iterative compilation for combined tiling and unrolling. Concur. Comput. Pract. Exper. 16, 2--3, 247--270.

Digital Library

[25]

Prasad Kulkarni, Stephen Hines, Jason Hiser, David Whalley, Jack Davidson, and Douglas Jones. 2004. Fast searches for effective optimization phase sequences. ACM SIGPLAN Not. 39, 6, 171--182.

Digital Library

[26]

Prasad A. Kulkarni, David B. Whalley, and Gary S. Tyson. 2007. Evaluating heuristic optimization phase order search algorithms. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’07). IEEE, 157--169.

Digital Library

[27]

Prasad A. Kulkarni, David B. Whalley, Gary S. Tyson, and Jack W. Davidson. 2009. Practical exhaustive optimization phase order exploration and evaluation. ACM Trans. Archit. Code Optim. 6, 1, Article 1, 36 pages.

Digital Library

[28]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, 469--480.

Digital Library

[29]

Jacob Lidman, Daniel J. Quinlan, Chunhua Liao, and Sally A. McKee. 2012. ROSE: FTTransform-A source-to-source translation framework for exascale fault-tolerance research. In Proceedings of the 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W’12). IEEE, 1--6.

[30]

Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture (HPCA’08). IEEE, 367--378. http://dblp.uni-trier.de/db/conf/hpca/hpca2008.html.

[31]

Jun Liu, Yuanrui Zhang, Wei Ding, and Mahmut T. Kandemir. 2011. On-chip cache hierarchy-aware tile scheduling for multicore machines. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE, 161--170. http://dblp.uni-trier.de/db/conf/cgo/cgo2011.html.

Digital Library

[32]

Antoine Monsifrot, François Bodin, and Rene Quiniou. 2002. A machine learning approach to automatic production of compiler heuristics. In Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications (AIMSA’02). 41--50. http://dl.acm.org/citation.cfm?id=646053.677574.

Digital Library

[33]

Miquel Moret, Francisco J. Cazorla, Alex Ramrez, and Mateo Valero. 2008. MLP-aware dynamic cache partitioning. In HiPEAC. Lecture Notes in Computer Science, Vol. 4917. Springer, 337--352. Retrieved from http://dblp.uni-trier.de/db/conf/hipeac/hipeac2008.html.

Digital Library

[34]

Dimitrios S. Nikolopoulos. 2003. Code and data transformations for improving shared cache performance on SMT processors. In Proceedings of the 5th International Symposium on High Performance Computing (ISHPC’03). 54--69.

[35]

Eunjung Park, Sameer Kulkarni, and John Cavazos. 2011. An evaluation of different modeling techniques for iterative compilation. In Proceedings of the 14th International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES’11). ACM, 65--74.

Digital Library

[36]

L.-N. Pouchet. 2012. PolyBench/C Benchmark Suite. Retrieved from http://web.cs.ucla.edu/ pouchet/software/polybench/.

[37]

Rakesh Reddy and Peter Petrov. 2010. Cache partitioning for energy-efficient and interference-free embedded multitasking. ACM Trans. Embed. Comput. Syst. 9, 3, Article 16, 35 pages.

Digital Library

[38]

Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Parameterized tiled loops for free. ACM SIGPLAN Not. 42, 6, 405--414.

Digital Library

[39]

Mark Stephenson, Saman Amarasinghe, Martin Martin, and Una-May O’Reilly. 2003. Meta optimization: Improving compiler heuristics with machine learning. ACM SIGPLAN Not. 38, 5, 77--90.

Digital Library

[40]

Karthik T. Sundararajan, Vasileios Porpodas, Timothy M. Jones, Nigel P. Topham, and Bjorn Franke. 2012. Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12). IEEE, 311--322. http://dblp.uni-trier.de/db/conf/hpca/hpca2012.html.

Digital Library

[41]

I-Jui Sung, Nasser Anssari, John A. Stratton, and Wen-Mei W. Hwu. 2012. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. Int. J. Parallel Program. 40, 1, 4--24.

[42]

David Tam, Reza Azimi, Livio Soares, and Michael Stumm. 2007. Managing shared L2 caches on multicore systems in software. In Proceedings of the Workshop on the Interaction Between Operating Systems and Computer Architecture.

[43]

Michele Tartara and Stefano Crespi Reghizzi. 2013. Continuous learning of compiler heuristics. ACM Trans. Archit. Code Optim. 9, 4, Article 46, 25 pages.

Digital Library

[44]

R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimization of software and the ATLAS project. Parallel Comput. 27, 1--2, 3--35.

[45]

Ying Ye, Richard West, Zhuoqun Cheng, and Ye Li. 2014. COLORIS: A dynamic cache partitioning system using page coloring. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, 381--392.

Digital Library

[46]

Chenjie Yu and Peter Petrov. 2010. Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, 132--137.

Digital Library

[47]

Xiao Zhang, Sandhya Dwarkadas, and Kai Shen. 2009. Towards practical page coloring-based multicore cache management. In Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys’09). ACM, 89--102.

Digital Library

[48]

Xing Zhou, Jean-Pierre Giacalone, Marıa Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, 207--218.

Digital Library

Cited By

Modi GBagchi AJindal NMandal APanda P(2023)CABARRE: Request Response Arbitration for Shared Cache ManagementACM Transactions on Embedded Computing Systems10.1145/360809622:5s(1-24)Online publication date: 31-Oct-2023
https://dl.acm.org/doi/10.1145/3608096
Kelefouras VDjemame KKeramidas GVoros N(2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s10766-022-00734-5
Forsberg BSolieri MBertogna MBenini LMarongiu A(2021)The Predictable Execution Model in PracticeACM Transactions on Embedded Computing Systems10.1145/346537020:5(1-25)Online publication date: 29-Jul-2021
https://dl.acm.org/doi/10.1145/3465370
Show More Cited By

Index Terms

Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Retargetable compilers
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Shared memory algorithms

Recommendations

Code-based cache partitioning for improving hardware cache performance
ICUIMC '12: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication

Recently, improving hardware cache performance is getting more important, because the performance gap between processor and memory has caused "memory wall" problem. Most cache designs are based on the LRU replacement policy which is effective for high-...
Defensive loop tiling for shared cache
CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Loop tiling is a compiler transformation that tailors an application's working set to fit in a cache hierarchy. On today's multicore processors, part of the hierarchy especially the last level cache (LLC) is shared. The available cache space in shared ...
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping

This article presents an algorithm to reduce cache conflicts and improve cache localities. The proposed algorithm analyzes locality reference space for each reference pattern, partitions the multi-level cache into several parts with different sizes, and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 17, Issue 3

May 2018

309 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3185335

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 22 May 2018

Accepted: 01 March 2018

Revised: 01 September 2017

Received: 01 February 2017

Published in TECS Volume 17, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

European Commission under Horizon 2020 Research and Innovation Action
“WCET-Aware Parallelization of Model-Based Applications for Heterogeneous Parallel Systems (ARGO),”

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
159
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Modi GBagchi AJindal NMandal APanda P(2023)CABARRE: Request Response Arbitration for Shared Cache ManagementACM Transactions on Embedded Computing Systems10.1145/360809622:5s(1-24)Online publication date: 31-Oct-2023
https://dl.acm.org/doi/10.1145/3608096
Kelefouras VDjemame KKeramidas GVoros N(2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s10766-022-00734-5
Forsberg BSolieri MBertogna MBenini LMarongiu A(2021)The Predictable Execution Model in PracticeACM Transactions on Embedded Computing Systems10.1145/346537020:5(1-25)Online publication date: 29-Jul-2021
https://dl.acm.org/doi/10.1145/3465370
Kelefouras VDjemame K(2019)A methodology correlating code optimizations with data memory accesses, execution time and energy consumptionThe Journal of Supercomputing10.1007/s11227-019-02880-zOnline publication date: 13-May-2019
https://doi.org/10.1007/s11227-019-02880-z

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents