[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/CGO.2013.6494992acmconferencesArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Runtime dependence computation and execution of loops on heterogeneous systems

Published: 23 February 2013 Publication History

Abstract

GPUs have been used for parallel execution of DOALL loops. However, loops with indirect array references can potentially cause cross iteration dependences which are hard to detect using existing compilation techniques. Applications with such loops cannot easily use the GPU and hence do not benefit from the tremendous compute capabilities of GPUs. In this paper, we present an algorithm to compute at runtime the cross iteration dependences in such loops. The algorithm uses both the CPU and the GPU to compute the dependences. Specifically, it effectively uses the compute capabilities of the GPU to quickly collect the memory accesses performed by the iterations by executing the slice functions generated for the indirect array accesses. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Another interesting aspect of the proposed solution is that it pipelines the dependence computation of the future level with the actual computation of the current level to effectively utilize the resources available in the GPU. We use NVIDIA Tesla C2070 to evaluate our implementation using benchmarks from Polybench suite and some synthetic benchmarks. Our experiments show that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences.

References

[1]
M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In ICS, 2008.
[2]
M. Baskaran, J. Ramanujam, P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In CC, 2010.
[3]
B. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. Padua, P. Petersen, B. Pottenger, L. Rauchwerger, P. Tu, S. Weatherford. Polaris: The Next Generation in Parallelizing Compilers. In LCPC, 1994.
[4]
D. K. Chen, P. C. Yew, J. Torellas. An efficient algorithm for the run time parallelization of doacross loops. In Supercomputing, 1994.
[5]
J. Saltz, R. Mirchandaney, K. Crowley. Run-time parallelization and scheduling of loops. IEEE Trans. Computers, 1991.
[6]
M. Feng, R. Gupta, L. N. Bhuyan. Speculative Parallelization on GPGPUs. In PPoPP, 2012.
[7]
W W. L. Fung, I. Singh, A. Brownsword, T M. Aamodt Hardware Transactional Memory for GPU Architectures. In MICRO, 2011.
[8]
H. Kim, N. P. Johnson, J. W. Lee, S. A. Mahlke, D. I. August. Automatic Speculative DOALL for Clusters. In CGO, 2012.
[9]
S. Lee, S. J. Min. R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In PPoPP, 2009.
[10]
E. Lindholm, J. Nickolls, S. Oberman, J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 2008.
[11]
M. Kim, H. Kim, C Luk. SD3: A Scalable Approach to Dynamic Data-Dependence Profiling. In MICRO, 2010.
[12]
L. N. Pouchet. The Polyhendral Benchmark suite. http://www.cse.ohio-state.edu/~pouchet/software/polybench.
[13]
NVIDIA Corp, NVIDIA CUDA: Compute Unified Device Architecture: Programming Guide, Version 4.2, 2012.
[14]
NVIDIA Corp, Fermi Compute Architecture White Paper.
[15]
L. Rauchwerger, N. Amato, D. Padua. A scalable method to runtime loop parallelism. In IJPP, July 1995.
[16]
M. Samadi, A. Horamati, J. Lee, S. Mahlke. Paragon: Collaborative Speculative Loop Execution on GPU and CPU. GPGPU-5 2012.
[17]
Stanford Compiler Group. SUIF: A parallelizing and optimizing research compiler. Technical Report CSL-TR-94-620, Stanford University, Computer Systems Laboratory, 1994.
[18]
H. Yu, Z. Li, Multi-slicing: A Compiler-Supported Parallel Approach to Data Dependence Profiling, In ISSTA, 2012.
[19]
E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, X. Shen On-the-Fly Elimination of Dynamic Irregularities for GPU Computing, In ASPLOS, 2011.
[20]
C. Zhu, P. C. Yew. A scheme to enfore data dependence on large multiprocessor systems. IEEE Trans. Software Engineering, June 1987.
[21]
X. Zhuang, A. E. Eichenberger, Y. Luo, Kevin O'Brien, Kathryn O'Brien. Exploiting Parallelism with Dependence-Aware Scheduling, In PACT, 2009.

Cited By

View all
  • (2023)Runtime Composition of Iterations for Fusing Loop-carried Sparse DependenceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607097(1-15)Online publication date: 12-Nov-2023
  • (2022)Optimizing sparse computations jointlyProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508439(459-460)Online publication date: 2-Apr-2022
  • (2020)MatRoxProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374548(389-402)Online publication date: 19-Feb-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
February 2013
366 pages
ISBN:9781467355247

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 23 February 2013

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Runtime Composition of Iterations for Fusing Loop-carried Sparse DependenceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607097(1-15)Online publication date: 12-Nov-2023
  • (2022)Optimizing sparse computations jointlyProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508439(459-460)Online publication date: 2-Apr-2022
  • (2020)MatRoxProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374548(389-402)Online publication date: 19-Feb-2020
  • (2018)ParSyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291739(1-15)Online publication date: 11-Nov-2018
  • (2018)JugglerACM SIGPLAN Notices10.1145/3200691.317849253:1(54-67)Online publication date: 10-Feb-2018
  • (2018)JugglerProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178492(54-67)Online publication date: 10-Feb-2018
  • (2018)ParSyProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00065(1-15)Online publication date: 11-Nov-2018
  • (2016)Automating wavefront parallelization for sparse matrix computationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014959(1-12)Online publication date: 13-Nov-2016
  • (2015)Distributed memory code generation for mixed Irregular/Regular computationsACM SIGPLAN Notices10.1145/2858788.268851550:8(65-75)Online publication date: 24-Jan-2015
  • (2015)Distributed memory code generation for mixed Irregular/Regular computationsProceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2688500.2688515(65-75)Online publication date: 24-Jan-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media