[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3062341.3062385acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Cache locality optimization for recursive programs

Published: 14 June 2017 Publication History

Abstract

We present an approach to optimize the cache locality for recursive programs by dynamically splicing---recursively interleaving---the execution of distinct function invocations. By utilizing data effect annotations, we identify concurrency and data reuse opportunities across function invocations and interleave them to reduce reuse distance. We present algorithms that efficiently track effects in recursive programs, detect interference and dependencies, and interleave execution of function invocations using user-level (non-kernel) lightweight threads. To enable multi-core execution, a program is parallelized using a nested fork/join programming model. Our cache optimization strategy is designed to work in the context of a random work stealing scheduler. We present an implementation using the MIT Cilk framework that demonstrates significant improvements in sequential and parallel performance, competitive with a state-of-the-art compile-time optimizer for loop programs and a domain-specific optimizer for stencil programs.

References

[1]
K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work-stealing. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS, pages 1–12, 2010.
[2]
M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: expressing locality and independence with logical regions. In SC Conference on High Performance Computing Networking, Storage and Analysis, page 66, 2012.
[3]
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), pages 207–216, 1995.
[4]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 101–113, 2008.
[5]
U. Bondhugula, V. Bandishti, and I. Pananilath. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems, 28(5):1285–1298, May 2017.
[6]
Boost Context. Boost Context. http://www.boost. org/doc/libs/1_56_0/libs/context/doc/ html/index.html.
[7]
Z. Budimlic, M. G. Burke, V. Cavé, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. M. Peixotto, V. Sarkar, F. Schlimbach, and S. Tasirlar. Concurrent collections. Scientific Programming, 18(3-4):203–217, 2010.
[8]
R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. Journal of the ACM, 24(1): 44–67, 1977.
[9]
E. Chan, E. S. Quintana-Ortí, G. Quintana-Ortí, and R. A. van de Geijn. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116–125, 2007.
[10]
R. Chandra, A. Gupta, and J. L. Hennessy. Data locality and load balancing in COOL. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), pages 249–259, 1993.
[11]
S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In SPAA: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 105–115, 2007.
[12]
M. E. Conway. Design of a separable transition-diagram compiler. Communications of the ACM, 6(7):396–408, 1963.
[13]
J. S. Danaher, I. A. Lee, and C. E. Leiserson. Programming with exceptions in JCilk. Science of Computer Programming, 63(2):147–171, 2006.
[14]
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 212–223, 1998.
[15]
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In 23rd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–12, 2009.
[16]
Y. Guo, Y. Zhao, V. Cavé, and V. Sarkar. SLAW: a scalable locality-aware adaptive work-stealing scheduler for multicore systems. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 341–342, 2010.
[17]
S. Z. Guyer and C. Lin. An annotation language for optimizing software libraries. In Proceedings of the Second Conference on Domain-Specific Languages (DSL), pages 39–52, 1999.
[18]
S. Heumann, V. S. Adve, and S. Wang. The tasks with effects model for safe concurrency. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 239–250, 2013.
[19]
Y. Jo and M. Kulkarni. Enhancing locality for recursive traversals of recursive structures. In Proceedings of the 26th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 463–482, 2011.
[20]
R. L. B. Jr., S. Heumann, N. Honarmand, S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman. Safe nondeterminism in a deterministic-by-default parallel language. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 535–548, 2011.
[21]
K. Kennedy, B. Broom, K. D. Cooper, J. Dongarra, R. J. Fowler, D. Gannon, S. L. Johnsson, J. M. Mellor-Crummey, and L. Torczon. Telescoping languages: A strategy for automatic generation of scientific problem-solving systems from annotated libraries. Journal of Parallel Distribued Computing, 61(12):1803–1826, 2001.
[22]
D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 Java Grande Conference, pages 36–43, 2000.
[23]
J. Lifflander, S. Krishnamoorthy, and L. V. Kalé. Optimizing data locality for fork/join programs using constrained work stealing. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 857–868, 2014.
[24]
B. D. Marsh, M. L. Scott, T. J. LeBlanc, and E. P. Markatos. First-class user-level theads. In Proceedings of the Thirteenth ACM Symposium on Operating System Principles (SOSP), pages 110–121, 1991.
[25]
V. Maslov. Delinearization: An efficient way to break multiloop dependence equations. In Proceedings of the ACM SIGPLAN’92 Conference on Programming Language Design and Implementation (PLDI), pages 152–161, 1992.
[26]
MIT Cilk 5.4.6. MIT Cilk 5.4.6. http://supertech. lcs.mit.edu/cilk.
[27]
V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. A transformation framework for optimizing task-parallel programs. ACM Transactions on Programming Languages and Systems, 35(1):3:1–3:48, 2013.
[28]
OpenMP Architecture Review Board. OpenMP Specification and Features. http://openmp.org/wp/, May 2008.
[29]
A. Pan and V. Pai. Runtime-driven shared last-level cache management for task-parallel programs. Technical Report 466, Department of Electrical and Computer Engineering, Purdue University, 2015.
[30]
J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In ASPLOS-VII Proceedings - Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 60–71, 1996.
[31]
L.-N. Pouchet. Polybench: The polyhedral benchmark suite, 2012.
[32]
D. J. Quinlan, M. Schordan, Q. Yi, and A. Sæbjørnsen. Classification and utilization of abstractions for optimization. In Leveraging Applications of Formal Methods, First International Symposium (ISoLA), pages 57–73, 2004.
[33]
J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. 2007.
[34]
A. D. Robison. Composable parallel patterns with Intel Cilk Plus. Computing in Science and Engineering, 15(2):66–71, 2013.
[35]
R. Rugina and M. C. Rinard. Automatic parallelization of divide and conquer algorithms. In Proceedings of the 1999 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 72–83, 1999.
[36]
R. Rugina and M. C. Rinard. Pointer analysis for structured parallel programs. ACM Transactions on Programming Languages and Systems, 25(1):70–116, 2003.
[37]
R. Rugina and M. C. Rinard. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. ACM Transactions on Programming Languages and Systems, 27(2): 185–235, 2005.
[38]
S. Seo, A. Amer, P. Balaji, C. Bordage, G. Bosilca, A. Brooks, A. Castello, D. Genet, T. Herault, P. Jindal, L. Kale, S. Krishnamoorthy, J. Lifflander, H. Lu, E. Meneses, M. Snir, Y. Sun, and P. H. Beckman. Argobots: a lightweight threading/tasking framework. Technical Report ANL/MCS-P5515-0116, Argonne National Laboratory, 2016.
[39]
A. K. Sujeeth, T. Rompf, K. J. Brown, H. Lee, H. Chafi, V. Popic, M. Wu, A. Prokopec, V. Jovanovic, M. Odersky, and K. Olukotun. Composition and reuse with compiled domainspecific languages. In ECOOP - Object-Oriented Programming - 27th European Conference, pages 52–78, 2013.
[40]
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 117–128, 2011.
[41]
O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10’s task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 267–276, 2012.
[42]
TPL. The Task Parallel Library. http://msdn. microsoft.com/en-us/magazine/cc163340.
[43]
aspx, Oct. 2007.
[44]
S. Treichler, M. Bauer, and A. Aiken. Language support for dynamic, hierarchical data partitioning. In Proceedings of the ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA), pages 495–514, 2013.
[45]
T. L. Veldhuizen. Active libraries and universal languages. PhD thesis, Indiana University, 2004.
[46]
E. M. Westbrook, J. Zhao, Z. Budimlic, and V. Sarkar. Permission regions for race-free parallelism. In Runtime Verification - Second International Conference (RV), pages 94–109, 2011.
[47]
K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: An API for programming with millions of lightweight threads. In 22nd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–8, 2008.
[48]
X10. The X10 Programming Language. www.research. ibm.com/x10/, Mar. 2006.

Cited By

View all
  • (2023)Architecture-Aware CurryingProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00029(250-264)Online publication date: 21-Oct-2023
  • (2023)Global Store Statement Aggregation2023 IEEE 14th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)10.1109/PAAP60200.2023.10391749(1-6)Online publication date: 24-Nov-2023
  • (2023)Data Recomputation for Multithreaded Applications2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323776(01-09)Online publication date: 28-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2017
708 pages
ISBN:9781450349888
DOI:10.1145/3062341
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. locality optimization
  2. recursive programs

Qualifiers

  • Research-article

Conference

PLDI '17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)2
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Architecture-Aware CurryingProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00029(250-264)Online publication date: 21-Oct-2023
  • (2023)Global Store Statement Aggregation2023 IEEE 14th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)10.1109/PAAP60200.2023.10391749(1-6)Online publication date: 24-Nov-2023
  • (2023)Data Recomputation for Multithreaded Applications2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323776(01-09)Online publication date: 28-Oct-2023
  • (2022)Arbitrarily Parallelizable Code: A Model of Computation Evaluated on a Message-Passing Many-Core SystemComputers10.3390/computers1111016411:11(164)Online publication date: 18-Nov-2022
  • (2022)TaskStream: accelerating task-parallel workloads by recovering program structureProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507706(1-13)Online publication date: 28-Feb-2022
  • (2021)PAVERACM Transactions on Architecture and Code Optimization10.1145/345116418:3(1-26)Online publication date: 8-Jun-2021
  • (2021)Scaling implicit parallelism via dynamic control replicationProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441587(105-118)Online publication date: 17-Feb-2021
  • (2020)Design and Implementation Techniques for an MPI-Oriented AMT Runtime2020 Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI52011.2020.00009(31-40)Online publication date: Nov-2020
  • (2019)D2PProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356205(1-22)Online publication date: 17-Nov-2019
  • (2017)Locality-Aware Dynamic Task Graph Scheduling2017 46th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2017.16(70-80)Online publication date: Aug-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media