More Web Proxy on the site http://driver.im/

research-article

Scaling large-data computations on multi-GPU accelerators

Authors:

Putt Sakdhnagool,

Rudolf EigenmannAuthors Info & Claims

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Pages 443 - 454

https://doi.org/10.1145/2464996.2465023

Published: 10 June 2013 Publication History

Abstract

Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programming models, limited device memory sizes and overheads of data transfers between CPU and accelerator memories are among the open challenges that restrict the widespread use of accelerators. First, this paper proposes a mechanism and an implementation to automatically pipeline the CPU-GPU memory channel so as to overlap the GPU computation with the memory copies, alleviating the data transfer overhead. Second, in doing so, the paper presents a technique called Computation Splitting, COSP, that caters to arbitrary device memory sizes and automatically manages to run out-of-card OpenMP-like applications on GPUs. Third, a novel adaptive runtime tuning mechanism is proposed to automatically select the pipeline stage size so as to gain the best possible performance. The mechanism adapts to the underlying hardware in the starting phase of a program and chooses the pipeline stage size. The techniques are implemented in a system that is able to translate an input OpenMP program to multiple GPUs attached to the same host CPU. Experimentation on a set of nine benchmarks shows that, on average, the pipelining scheme improves the performance by 1.49x, while limiting the runtime tuning overhead to 3% of the execution time.

References

[1]

Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. "Automatic CPU-GPU communication management and optimization". SIGPLAN Not., 47(6):142--151, June 2011.

Digital Library

[2]

Seyong Lee and Rudolf Eigenmann. "OpenMPC: Extended OpenMP Programming and Tuning for GPUs". In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11. IEEE Computer Society, 2010.

Digital Library

[3]

Mehdi Amini, Fabien Coelho, Francois Irigoin, and Ronan Keryell. "Static Compilation Analysis for Host-Accelerator Communication Optimization". In 24th Int. Workshop on Languages and Compilers for Parallel Computing (LCPC), sep 2011. Also Technical Report MINES ParisTech A/476/CRI.

[4]

Liang Gu, Jakob Siegel, and Xiaoming Li. "Using GPUs to compute large out-of-card FFTs". In Proceedings of the international conference on Supercomputing, ICS '11, pages 255--264. ACM, 2011.

Digital Library

[5]

Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. "Mars: a MapReduce framework on graphics processors". In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pages 260--269. ACM, 2008.

Digital Library

[6]

J.A. Stuart and J.D. Owens. "Multi-GPU MapReduce on GPU Clusters". In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 1068--1079, may 2011.

Digital Library

[7]

Fengguang Song, Stanimire Tomov, and Jack Dongarra. "Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems". In Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pages 365--376. ACM, 2012.

Digital Library

[8]

Fengquan Zhang, Lei Hu, Jiawen Wu, and Xukun Shen. "A SPH-based method for interactive fluids simulation on themulti-GPU". In Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry, VRCAI '11, pages 423--426. ACM, 2011.

Digital Library

[9]

NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. NVIDIA Corporation, 2011.

[10]

T.D. Han and T.S. Abdelrahman. "hiCUDA: High-Level GPGPU Programming". Parallel and Distributed Systems, IEEE Transactions on, 22(1):78--90, jan. 2011.

Digital Library

[11]

Sain zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen mei W. Hwu. "CUDA-Lite: Reducing GPU programming complexity". In In: LCPC'08. Volume 5335 of LNCS, pages 1--15. Springer, 2008.

Digital Library

[12]

Y. Yan, M. Grossman, and V. Sarkar. "JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA". Euro-Par 2009 Parallel Processing, page 887--899, 2009.

Digital Library

[13]

OpenACC. http://www.openacc-standard.org/, November 2011.

[14]

Sunpyo Hong and Hyesoon Kim. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness". In Proceedings of the 36th annual international symposium on Computer architecture, ISCA '09, pages 152--163. ACM, 2009.

Digital Library

[15]

Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A. Stratton, and Wen-mei W. Hwu. "Program optimization space pruning for a multithreaded gpu". In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, CGO '08, pages 195--204. ACM, 2008.

Digital Library

[16]

NVIDIA Corporation. CUDA C Best Practices Guide, May 2011.

[17]

Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. "OpenMP to GPGPU: a compiler framework for automatic translation and optimization". SIGPLAN Not., 44(4):101--110, February 2009.

Digital Library

[18]

OpenMP. Improvement to support Accelerators http: //openmp.org/wp/2012/03/openmp-is-being-improved-for-accelerators-multicore-and-embedded-systems/,2012.

[19]

Hansang Bae, Dheya Mustafa, Jae-Woo Lee, Aurangzeb, Hao Lin, Chirag Dave, Rudolf Eigenmann, and Samuel Midkiff. "The Cetus Source-to-Source Compiler Infrastructure: Overview and Evaluation". International Journal of Parallel Programming, pages 1--15, 2012. 10.1007/s10766-012-0211-z.

[20]

Hansang Bae and Rudolf Eigenmann. "Interprocedural symbolic range propagation for optimizing compilers". In Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing, LCPC'05, pages 413--424. Springer-Verlag, 2006.

Digital Library

[21]

StreamIt Benchmarks {Online}. Available: http://groups. csail.mit.edu/cag/streamit/shtml/benchmarks.shtml.

[22]

NVIDIA. GPU Computing SDK {Online} https://developer.nvidia.com/gpu-computing-sdk.

[23]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. "Rodinia: A benchmark suite for heterogeneous computing". In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC '09, pages 44--54. IEEE Computer Society, 2009.

Digital Library

[24]

William Thies, Michal Karczmarek, and Saman P. Amarasinghe. "StreamIt: A Language for Streaming Applications". In Proceedings of the 11th International Conference on Compiler Construction, CC '02, pages 179--196. Springer-Verlag, 2002.

Digital Library

[25]

Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke. "Sponge: portable stream programming on graphics engines". In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '11, pages 381--392. ACM, 2011.

Digital Library

[26]

A. Hagiescu, Huynh Phung Huynh, Weng-Fai Wong, and R.S.M. Goh. "Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs". In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 467--478, may 2011.

Digital Library

[27]

Huynh Phung Huynh, Andrei Hagiescu, Weng-Fai Wong, and Rick Siow Mong Goh. "Scalable framework for mapping streaming applications onto multi-GPU systems". In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 1--10. ACM, 2012.

Digital Library

[28]

M. Bauer, H. Cook, and B. Khailany. "CudaDMA: Optimizing GPU memory bandwidth via warp specialization". In High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for, pages 1--11, nov. 2011.

Digital Library

[29]

Abhishek Udupa, R. Govindarajan, and Matthew J. Thazhuthaveetil. "Software Pipelined Execution of Stream Programs on GPUs". In Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '09, pages 200--209. IEEE Computer Society, 2009.

Digital Library

[30]

A.M. Aji, J. Dinan, D. Buntinas, P. Balaji, Wu chun Feng, K.R. Bisset, and R. Thakur. "MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems". In High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on, pages 647--654, june 2012.

Digital Library

[31]

Yongpeng Zhang and Frank Mueller. "Auto-generation and auto-tuning of 3D stencil codes on GPU clusters". In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO '12, pages 155--164. ACM, 2012.

Digital Library

[32]

Pritish Jetley, Lukasz Wesolowski, Filippo Gioachin, Laxmikant V. Kalé, and Thomas R. Quinn. "Scaling Hierarchical N-body Simulations on GPU Clusters". In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11. IEEE Computer Society, 2010.

Digital Library

[33]

Akira Nukada and Satoshi Matsuoka. "Auto-tuning 3-D FFT library for CUDA GPUs". In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 30:1--30:10. ACM, 2009.

Digital Library

[34]

Jee W. Choi, Amik Singh, and Richard W. Vuduc. "Model-driven autotuning of sparse matrix-vector multiply on GPUs". In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 115--126. ACM, 2010.

Digital Library

[35]

Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. "Achieving a single compute device image in OpenCL for multiple GPUs". In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 277--288. ACM, 2011.

Digital Library

Cited By

Mussabayev RMussabayev R(2024)High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall DataMathematics10.3390/math1213193012:13(1930)Online publication date: 21-Jun-2024
https://doi.org/10.3390/math12131930
Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Suluhan HGener SFusco AMack JDagli IBelviranli MEdemen CAkoglu A(2024)A Runtime Manager Integrated Emulation Environment for Heterogeneous SoC Design with RISC-V Cores2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00013(23-30)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00013
Show More Cited By

Index Terms

Scaling large-data computations on multi-GPU accelerators
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

SIMD Monte-Carlo Numerical Simulations Accelerated on GPU and Xeon Phi

The efficiency of a pleasingly parallel application is studied for several computing platforms. A real world problem, i.e., Monte-Carlo numerical simulations of stratospheric balloon envelope drift descent is considered. We detail the optimization of ...
Performance of CPU/GPU compiler directives on ISO/TTI kernels

GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is ...
Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures

The aim of this paper is to show that the multidimensional Monte Carlo integration can be efficiently implemented on computers with modern multicore CPUs and manycore accelerators including Intel MIC and GPU architectures using a new vectorized version ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

June 2013

512 pages

ISBN:9781450321303

DOI:10.1145/2464996

General Chair:
Allen D. Malony
University of Oregon, USA
,
Program Chairs:
Mario Nemirovsky
Barcelona Supercomputing Center, Spain
,
Sam Midkiff
Purdue University, USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'13

Sponsor:

SIGARCH

ICS'13: International Conference on Supercomputing

June 10 - 14, 2013

Oregon, Eugene, USA

Acceptance Rates

ICS '13 Paper Acceptance Rate 43 of 202 submissions, 21%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
536
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mussabayev RMussabayev R(2024)High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall DataMathematics10.3390/math1213193012:13(1930)Online publication date: 21-Jun-2024
https://doi.org/10.3390/math12131930
Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Suluhan HGener SFusco AMack JDagli IBelviranli MEdemen CAkoglu A(2024)A Runtime Manager Integrated Emulation Environment for Heterogeneous SoC Design with RISC-V Cores2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00013(23-30)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00013
Mack JHassan SKumbhare NCastro Gonzalez MAkoglu A(2023)CEDR: A Compiler-integrated, Extensible DSSoC RuntimeACM Transactions on Embedded Computing Systems10.1145/352925722:2(1-34)Online publication date: 24-Jan-2023
https://dl.acm.org/doi/10.1145/3529257
Jaroš MŘíha LStrakoš PŠpeťko M(2021)GPU Accelerated Path Tracing of Massive ScenesACM Transactions on Graphics10.1145/344780740:2(1-17)Online publication date: 27-Apr-2021
https://dl.acm.org/doi/10.1145/3447807
Kim JLee SJohnston BVetter J(2021)IRIS: A Portable Runtime System Exploiting Multiple Heterogeneous Programming Systems2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622873(1-8)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622873
Zheng RLiu YJin H(2020)Optimizing non-coalesced memory access for irregular applications with GPU computingFrontiers of Information Technology & Electronic Engineering10.1631/FITEE.190026221:9(1285-1301)Online publication date: 17-Sep-2020
https://doi.org/10.1631/FITEE.1900262
Li ZPeng BWeng C(2020)XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU PlatformIEEE Transactions on Computers10.1109/TC.2020.296830269:6(819-831)Online publication date: 1-Jun-2020
https://doi.org/10.1109/TC.2020.2968302
Yeh TSabne ASakdhnagool PEigenmann RRogers T(2019)PagodaACM Transactions on Parallel Computing10.1145/33656576:4(1-23)Online publication date: 19-Nov-2019
https://dl.acm.org/doi/10.1145/3365657
Xie XLiang YLi XTan WWeissman JButt ASmirni E(2019)CuLDAProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325407(195-205)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325407
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten