[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2464996.2465023acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Scaling large-data computations on multi-GPU accelerators

Published: 10 June 2013 Publication History

Abstract

Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programming models, limited device memory sizes and overheads of data transfers between CPU and accelerator memories are among the open challenges that restrict the widespread use of accelerators. First, this paper proposes a mechanism and an implementation to automatically pipeline the CPU-GPU memory channel so as to overlap the GPU computation with the memory copies, alleviating the data transfer overhead. Second, in doing so, the paper presents a technique called Computation Splitting, COSP, that caters to arbitrary device memory sizes and automatically manages to run out-of-card OpenMP-like applications on GPUs. Third, a novel adaptive runtime tuning mechanism is proposed to automatically select the pipeline stage size so as to gain the best possible performance. The mechanism adapts to the underlying hardware in the starting phase of a program and chooses the pipeline stage size. The techniques are implemented in a system that is able to translate an input OpenMP program to multiple GPUs attached to the same host CPU. Experimentation on a set of nine benchmarks shows that, on average, the pipelining scheme improves the performance by 1.49x, while limiting the runtime tuning overhead to 3% of the execution time.

References

[1]
Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. "Automatic CPU-GPU communication management and optimization". SIGPLAN Not., 47(6):142--151, June 2011.
[2]
Seyong Lee and Rudolf Eigenmann. "OpenMPC: Extended OpenMP Programming and Tuning for GPUs". In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11. IEEE Computer Society, 2010.
[3]
Mehdi Amini, Fabien Coelho, Francois Irigoin, and Ronan Keryell. "Static Compilation Analysis for Host-Accelerator Communication Optimization". In 24th Int. Workshop on Languages and Compilers for Parallel Computing (LCPC), sep 2011. Also Technical Report MINES ParisTech A/476/CRI.
[4]
Liang Gu, Jakob Siegel, and Xiaoming Li. "Using GPUs to compute large out-of-card FFTs". In Proceedings of the international conference on Supercomputing, ICS '11, pages 255--264. ACM, 2011.
[5]
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. "Mars: a MapReduce framework on graphics processors". In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pages 260--269. ACM, 2008.
[6]
J.A. Stuart and J.D. Owens. "Multi-GPU MapReduce on GPU Clusters". In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 1068--1079, may 2011.
[7]
Fengguang Song, Stanimire Tomov, and Jack Dongarra. "Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems". In Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pages 365--376. ACM, 2012.
[8]
Fengquan Zhang, Lei Hu, Jiawen Wu, and Xukun Shen. "A SPH-based method for interactive fluids simulation on themulti-GPU". In Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry, VRCAI '11, pages 423--426. ACM, 2011.
[9]
NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. NVIDIA Corporation, 2011.
[10]
T.D. Han and T.S. Abdelrahman. "hiCUDA: High-Level GPGPU Programming". Parallel and Distributed Systems, IEEE Transactions on, 22(1):78--90, jan. 2011.
[11]
Sain zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen mei W. Hwu. "CUDA-Lite: Reducing GPU programming complexity". In In: LCPC'08. Volume 5335 of LNCS, pages 1--15. Springer, 2008.
[12]
Y. Yan, M. Grossman, and V. Sarkar. "JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA". Euro-Par 2009 Parallel Processing, page 887--899, 2009.
[13]
OpenACC. http://www.openacc-standard.org/, November 2011.
[14]
Sunpyo Hong and Hyesoon Kim. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness". In Proceedings of the 36th annual international symposium on Computer architecture, ISCA '09, pages 152--163. ACM, 2009.
[15]
Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A. Stratton, and Wen-mei W. Hwu. "Program optimization space pruning for a multithreaded gpu". In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, CGO '08, pages 195--204. ACM, 2008.
[16]
NVIDIA Corporation. CUDA C Best Practices Guide, May 2011.
[17]
Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. "OpenMP to GPGPU: a compiler framework for automatic translation and optimization". SIGPLAN Not., 44(4):101--110, February 2009.
[18]
OpenMP. Improvement to support Accelerators http: //openmp.org/wp/2012/03/openmp-is-being-improved-for-accelerators-multicore-and-embedded-systems/,2012.
[19]
Hansang Bae, Dheya Mustafa, Jae-Woo Lee, Aurangzeb, Hao Lin, Chirag Dave, Rudolf Eigenmann, and Samuel Midkiff. "The Cetus Source-to-Source Compiler Infrastructure: Overview and Evaluation". International Journal of Parallel Programming, pages 1--15, 2012. 10.1007/s10766-012-0211-z.
[20]
Hansang Bae and Rudolf Eigenmann. "Interprocedural symbolic range propagation for optimizing compilers". In Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing, LCPC'05, pages 413--424. Springer-Verlag, 2006.
[21]
StreamIt Benchmarks {Online}. Available: http://groups. csail.mit.edu/cag/streamit/shtml/benchmarks.shtml.
[22]
NVIDIA. GPU Computing SDK {Online} https://developer.nvidia.com/gpu-computing-sdk.
[23]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. "Rodinia: A benchmark suite for heterogeneous computing". In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC '09, pages 44--54. IEEE Computer Society, 2009.
[24]
William Thies, Michal Karczmarek, and Saman P. Amarasinghe. "StreamIt: A Language for Streaming Applications". In Proceedings of the 11th International Conference on Compiler Construction, CC '02, pages 179--196. Springer-Verlag, 2002.
[25]
Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke. "Sponge: portable stream programming on graphics engines". In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS '11, pages 381--392. ACM, 2011.
[26]
A. Hagiescu, Huynh Phung Huynh, Weng-Fai Wong, and R.S.M. Goh. "Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs". In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 467--478, may 2011.
[27]
Huynh Phung Huynh, Andrei Hagiescu, Weng-Fai Wong, and Rick Siow Mong Goh. "Scalable framework for mapping streaming applications onto multi-GPU systems". In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 1--10. ACM, 2012.
[28]
M. Bauer, H. Cook, and B. Khailany. "CudaDMA: Optimizing GPU memory bandwidth via warp specialization". In High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for, pages 1--11, nov. 2011.
[29]
Abhishek Udupa, R. Govindarajan, and Matthew J. Thazhuthaveetil. "Software Pipelined Execution of Stream Programs on GPUs". In Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '09, pages 200--209. IEEE Computer Society, 2009.
[30]
A.M. Aji, J. Dinan, D. Buntinas, P. Balaji, Wu chun Feng, K.R. Bisset, and R. Thakur. "MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems". In High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on, pages 647--654, june 2012.
[31]
Yongpeng Zhang and Frank Mueller. "Auto-generation and auto-tuning of 3D stencil codes on GPU clusters". In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO '12, pages 155--164. ACM, 2012.
[32]
Pritish Jetley, Lukasz Wesolowski, Filippo Gioachin, Laxmikant V. Kalé, and Thomas R. Quinn. "Scaling Hierarchical N-body Simulations on GPU Clusters". In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11. IEEE Computer Society, 2010.
[33]
Akira Nukada and Satoshi Matsuoka. "Auto-tuning 3-D FFT library for CUDA GPUs". In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 30:1--30:10. ACM, 2009.
[34]
Jee W. Choi, Amik Singh, and Richard W. Vuduc. "Model-driven autotuning of sparse matrix-vector multiply on GPUs". In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 115--126. ACM, 2010.
[35]
Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. "Achieving a single compute device image in OpenCL for multiple GPUs". In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 277--288. ACM, 2011.

Cited By

View all
  • (2024)High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall DataMathematics10.3390/math1213193012:13(1930)Online publication date: 21-Jun-2024
  • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
  • (2024)A Runtime Manager Integrated Emulation Environment for Heterogeneous SoC Design with RISC-V Cores2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00013(23-30)Online publication date: 27-May-2024
  • Show More Cited By

Index Terms

  1. Scaling large-data computations on multi-GPU accelerators

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing
    June 2013
    512 pages
    ISBN:9781450321303
    DOI:10.1145/2464996
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. large-data
    3. openMP
    4. out-of-card computations
    5. pipelining
    6. tuning

    Qualifiers

    • Research-article

    Conference

    ICS'13
    Sponsor:
    ICS'13: International Conference on Supercomputing
    June 10 - 14, 2013
    Oregon, Eugene, USA

    Acceptance Rates

    ICS '13 Paper Acceptance Rate 43 of 202 submissions, 21%;
    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 02 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall DataMathematics10.3390/math1213193012:13(1930)Online publication date: 21-Jun-2024
    • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
    • (2024)A Runtime Manager Integrated Emulation Environment for Heterogeneous SoC Design with RISC-V Cores2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00013(23-30)Online publication date: 27-May-2024
    • (2023)CEDR: A Compiler-integrated, Extensible DSSoC RuntimeACM Transactions on Embedded Computing Systems10.1145/352925722:2(1-34)Online publication date: 24-Jan-2023
    • (2021)GPU Accelerated Path Tracing of Massive ScenesACM Transactions on Graphics10.1145/344780740:2(1-17)Online publication date: 27-Apr-2021
    • (2021)IRIS: A Portable Runtime System Exploiting Multiple Heterogeneous Programming Systems2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622873(1-8)Online publication date: 20-Sep-2021
    • (2020)Optimizing non-coalesced memory access for irregular applications with GPU computingFrontiers of Information Technology & Electronic Engineering10.1631/FITEE.190026221:9(1285-1301)Online publication date: 17-Sep-2020
    • (2020)XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU PlatformIEEE Transactions on Computers10.1109/TC.2020.296830269:6(819-831)Online publication date: 1-Jun-2020
    • (2019)PagodaACM Transactions on Parallel Computing10.1145/33656576:4(1-23)Online publication date: 19-Nov-2019
    • (2019)CuLDAProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325407(195-205)Online publication date: 17-Jun-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media