[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

Published: 22 May 2018 Publication History

Abstract

One of the biggest challenges in multicore platforms is shared cache management, especially for data-dominant applications. Two commonly used approaches for increasing shared cache utilization are cache partitioning and loop tiling. However, state-of-the-art compilers lack efficient cache partitioning and loop tiling methods for two reasons. First, cache partitioning and loop tiling are strongly coupled together, and thus addressing them separately is simply not effective. Second, cache partitioning and loop tiling must be tailored to the target shared cache architecture details and the memory characteristics of the corunning workloads.
To the best of our knowledge, this is the first time that a methodology provides (1) a theoretical foundation in the above-mentioned cache management mechanisms and (2) a unified framework to orchestrate these two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by an order of magnitude keeping at the same time the number of arithmetic/addressing instructions to a minimal level. We motivate this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details (i.e., cache size and associativity), and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)-optimal solution is requested. To this end, we present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space.

References

[1]
F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. 2006. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE, 295--305.
[2]
L. Almagor, Keith D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves, D. Subramanian, L. Torczon, and T. Waterman. 2004. Finding effective compilation sequences. ACM SIGPLAN Not. 39, 7, 231--239.
[3]
Utpal Banerjee. 1993. Linear Equations and Inequalities. Springer, Boston, MA, 49--94.
[4]
Bin Bao and Chen Ding. 2013. Defensive loop tiling for shared cache. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE, 1--11.
[5]
M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. ACM SIGPLAN Not. 44, 4, 219--228.
[6]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2, 1--7.
[7]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008a. A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Not. 43, 6, 101--113.
[8]
Uday Bondhugula, J. Ramanujam, and P. Sadayppan. 2008b. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI’08).
[9]
Bach D. Bui, Marco Caccamo, Lui Sha, and Joseph Martinez. 2008. Impact of cache partitioning on multi-tasking real time embedded systems. In Proceedings of the 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications, 101--110.
[10]
Jichuan Chang and Gurindar S. Sohi. 2014. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st Annual International Conference on Supercomputing (ICS’07). ACM, 80--81.
[11]
Shi-Kuo Chang. 2003. Data Structures and Algorithms. Series on Software Engineering and Knowledge Engineering, Vol. 13. World Scientific.
[12]
Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. 2005. ACME: Adaptive compilation made efficient. ACM SIGPLAN Not. 40, 7, 69--77.
[13]
Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steve Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. 2006. Exploring the structure of the space of compilation sequences using randomized search algorithms. J. Supercomput. 36, 2, 135--151.
[14]
Xiaoning Ding, Kaibo Wang, and Xiaodong Zhang. 2011. ULCC: A user-level facility for optimizing shared cache performance on multicores. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, 103--112. http://dblp.uni-trier.de/db/conf/ppopp/ppopp2011.html.
[15]
Haakon Dybdahl and Per Stenstrm. 2007. An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE, 2--12. http://dblp.uni-trier.de/db/conf/hpca/hpca2007.html.
[16]
M. Haneda, P. M. W. Khnijnenburg, and H. A. G. Wijshoff. 2005. Automatic selection of compiler options using non-parametric inferential statistics. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE, 123--132.
[17]
T. Henretty, A. Rountev, H. Lin, J. Ramanujam, Q. Lu, T. F. Ngai, Y. Chen, C. Alias, U. Bondhugula, P. Sadayappan, and S. Krishnamoorthy. 2009. Data layout transformation for enhancing data locality on NUCA chip multiprocessors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 348--357.
[18]
Mahmut Kandemir, Sai Prashanth Muralidhara, Sri Hari Krishna Narayanan, Yuanrui Zhang, and Ozcan Ozturk. 2009. Optimizing shared cache behavior of chip multiprocessors. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, 505--516.
[19]
Mahmut Kandemir, Taylan Yemliha, SaiPrashanth Muralidhara, Shekhar Srikantaiah, Mary Jane Irwin, and Yuanrui Zhnag. 2010. Cache topology aware computation mapping for multicores. ACM SIGPLAN Not. 45, 6, 74--85.
[20]
Dimitris Kaseridis, Jeffrey Stuecheli, and Lizy K. John. 2009. Bank-aware dynamic cache partitioning for multicore architectures. In Proceedings of the International Conference on Parallel Processing (ICPP’09). 18--25.
[21]
Vasilios Kelefouras, Angeliki Kritikakou, and Costas Goutis. 2015. A methodology for speeding up loop kernels by exploiting the software information and the memory architecture. Comput. Lang. Syst. Structures 41, 21--41. http://dblp.uni-trier.de/db/journals/cl/cl41.html.
[22]
DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, Article 51, 12 pages.
[23]
Hyoseung Kim, Arvind Kandhalu, and Ragunathan Rajkumar. 2013. A coordinated approach for practical OS-level cache management in multi-core real-time systems. In Proceedings of the 25th Euromicro Conference on Real-Time Systems, (ECRTS’13). 80--89.
[24]
P. M. W. Knijnenburg, T. Kisuki, K. Gallivan, and M. F. P. O’Boyle. 2004. The effect of cache models on iterative compilation for combined tiling and unrolling. Concur. Comput. Pract. Exper. 16, 2--3, 247--270.
[25]
Prasad Kulkarni, Stephen Hines, Jason Hiser, David Whalley, Jack Davidson, and Douglas Jones. 2004. Fast searches for effective optimization phase sequences. ACM SIGPLAN Not. 39, 6, 171--182.
[26]
Prasad A. Kulkarni, David B. Whalley, and Gary S. Tyson. 2007. Evaluating heuristic optimization phase order search algorithms. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’07). IEEE, 157--169.
[27]
Prasad A. Kulkarni, David B. Whalley, Gary S. Tyson, and Jack W. Davidson. 2009. Practical exhaustive optimization phase order exploration and evaluation. ACM Trans. Archit. Code Optim. 6, 1, Article 1, 36 pages.
[28]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). ACM, 469--480.
[29]
Jacob Lidman, Daniel J. Quinlan, Chunhua Liao, and Sally A. McKee. 2012. ROSE: FTTransform-A source-to-source translation framework for exascale fault-tolerance research. In Proceedings of the 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W’12). IEEE, 1--6.
[30]
Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture (HPCA’08). IEEE, 367--378. http://dblp.uni-trier.de/db/conf/hpca/hpca2008.html.
[31]
Jun Liu, Yuanrui Zhang, Wei Ding, and Mahmut T. Kandemir. 2011. On-chip cache hierarchy-aware tile scheduling for multicore machines. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE, 161--170. http://dblp.uni-trier.de/db/conf/cgo/cgo2011.html.
[32]
Antoine Monsifrot, François Bodin, and Rene Quiniou. 2002. A machine learning approach to automatic production of compiler heuristics. In Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications (AIMSA’02). 41--50. http://dl.acm.org/citation.cfm?id=646053.677574.
[33]
Miquel Moret, Francisco J. Cazorla, Alex Ramrez, and Mateo Valero. 2008. MLP-aware dynamic cache partitioning. In HiPEAC. Lecture Notes in Computer Science, Vol. 4917. Springer, 337--352. Retrieved from http://dblp.uni-trier.de/db/conf/hipeac/hipeac2008.html.
[34]
Dimitrios S. Nikolopoulos. 2003. Code and data transformations for improving shared cache performance on SMT processors. In Proceedings of the 5th International Symposium on High Performance Computing (ISHPC’03). 54--69.
[35]
Eunjung Park, Sameer Kulkarni, and John Cavazos. 2011. An evaluation of different modeling techniques for iterative compilation. In Proceedings of the 14th International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES’11). ACM, 65--74.
[36]
L.-N. Pouchet. 2012. PolyBench/C Benchmark Suite. Retrieved from http://web.cs.ucla.edu/ pouchet/software/polybench/.
[37]
Rakesh Reddy and Peter Petrov. 2010. Cache partitioning for energy-efficient and interference-free embedded multitasking. ACM Trans. Embed. Comput. Syst. 9, 3, Article 16, 35 pages.
[38]
Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Parameterized tiled loops for free. ACM SIGPLAN Not. 42, 6, 405--414.
[39]
Mark Stephenson, Saman Amarasinghe, Martin Martin, and Una-May O’Reilly. 2003. Meta optimization: Improving compiler heuristics with machine learning. ACM SIGPLAN Not. 38, 5, 77--90.
[40]
Karthik T. Sundararajan, Vasileios Porpodas, Timothy M. Jones, Nigel P. Topham, and Bjorn Franke. 2012. Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA’12). IEEE, 311--322. http://dblp.uni-trier.de/db/conf/hpca/hpca2012.html.
[41]
I-Jui Sung, Nasser Anssari, John A. Stratton, and Wen-Mei W. Hwu. 2012. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. Int. J. Parallel Program. 40, 1, 4--24.
[42]
David Tam, Reza Azimi, Livio Soares, and Michael Stumm. 2007. Managing shared L2 caches on multicore systems in software. In Proceedings of the Workshop on the Interaction Between Operating Systems and Computer Architecture.
[43]
Michele Tartara and Stefano Crespi Reghizzi. 2013. Continuous learning of compiler heuristics. ACM Trans. Archit. Code Optim. 9, 4, Article 46, 25 pages.
[44]
R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimization of software and the ATLAS project. Parallel Comput. 27, 1--2, 3--35.
[45]
Ying Ye, Richard West, Zhuoqun Cheng, and Ye Li. 2014. COLORIS: A dynamic cache partitioning system using page coloring. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, 381--392.
[46]
Chenjie Yu and Peter Petrov. 2010. Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, 132--137.
[47]
Xiao Zhang, Sandhya Dwarkadas, and Kai Shen. 2009. Towards practical page coloring-based multicore cache management. In Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys’09). ACM, 89--102.
[48]
Xing Zhou, Jean-Pierre Giacalone, Marıa Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, 207--218.

Cited By

View all
  • (2023)CABARRE: Request Response Arbitration for Shared Cache ManagementACM Transactions on Embedded Computing Systems10.1145/360809622:5s(1-24)Online publication date: 31-Oct-2023
  • (2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
  • (2021)The Predictable Execution Model in PracticeACM Transactions on Embedded Computing Systems10.1145/346537020:5(1-25)Online publication date: 29-Jul-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 17, Issue 3
May 2018
309 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3185335
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 22 May 2018
Accepted: 01 March 2018
Revised: 01 September 2017
Received: 01 February 2017
Published in TECS Volume 17, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cache partitioning
  2. data array layouts
  3. loop tiling
  4. memory management
  5. page coloring

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • European Commission under Horizon 2020 Research and Innovation Action
  • “WCET-Aware Parallelization of Model-Based Applications for Heterogeneous Parallel Systems (ARGO),”

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)CABARRE: Request Response Arbitration for Shared Cache ManagementACM Transactions on Embedded Computing Systems10.1145/360809622:5s(1-24)Online publication date: 31-Oct-2023
  • (2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 1-Aug-2022
  • (2021)The Predictable Execution Model in PracticeACM Transactions on Embedded Computing Systems10.1145/346537020:5(1-25)Online publication date: 29-Jul-2021
  • (2019)A methodology correlating code optimizations with data memory accesses, execution time and energy consumptionThe Journal of Supercomputing10.1007/s11227-019-02880-zOnline publication date: 13-May-2019

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media