[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2908080.2908094acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

A distributed OpenCL framework using redundant computation and data replication

Published: 02 June 2016 Publication History

Abstract

Applications written solely in OpenCL or CUDA cannot execute on a cluster as a whole. Most previous approaches that extend these programming models to clusters are based on a common idea: designating a centralized host node and coordinating the other nodes with the host for computation. However, the centralized host node is a serious performance bottleneck when the number of nodes is large. In this paper, we propose a scalable and distributed OpenCL framework called SnuCL-D for large-scale clusters. SnuCL-D's remote device virtualization provides an OpenCL application with an illusion that all compute devices in a cluster are confined in a single node. To reduce the amount of control-message and data communication between nodes, SnuCL-D replicates the OpenCL host program execution and data in each node. We also propose a new OpenCL host API function and a queueing optimization technique that significantly reduce the overhead incurred by the previous centralized approaches. To show the effectiveness of SnuCL-D, we evaluate SnuCL-D with a microbenchmark and eleven benchmark applications on a large-scale CPU cluster and a medium-scale GPU cluster.

References

[1]
Charm++. Website. http://charm.cs.uiuc.edu/.
[2]
G. Aloisio and S. Fiore. Towards Exascale Distributed Data Management. International Journal of High Performance Computing Applications, 23(4):398–400, 2009.
[3]
A. Alves, J. Rufino, A. Pina, and L. P. Santos. clOpenCL - Supporting Distributed Heterogeneous Computing in HPC Cluster. In Euro-Par 2012: Parallel Processing Workshops, Revised Selected Papers, pages 112–122. Springer-Verlag, Berlin, Heidelberg, 2013.
[4]
AMD. AMD APP SDK. Website, January 2014. http: //developer.amd.com/tools-and-sdks/opencl-zone/ amd-accelerated-parallel-processing-app-sdk/.
[5]
C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations. Computer, 29:18–28, February 1996.
[6]
R. Aoki, S. Oikawa, R. Tsuchiyama, and T. Nakamura. Hybrid OpenCL: Connecting Different OpenCL Implementations over Network. In Proceedings of the 10th IEEE International Conference on Computer and Information Technology, pages 2729–2735, Washington, DC, USA, 2010. IEEE Computer Society.
[7]
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: a Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience, 23(2):187–198, 2011.
[8]
C. Augonnet, O. Aumage, N. Furmento, S. Thibault, and R. Namyst. StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators. Research Report RR-8538, May 2014.
[9]
A. Barak, T. Ben-Nun, E. Levy, and A. Shiloh. A Package for OpenCL based Heterogeneous Computing on Clusters with Many GPU Devices. In Proceedings of 2010 IEEE International Conference on Cluster Computing Workshops and Posters, pages 1–7, September 2010.
[10]
T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. Core-Det: A Compiler and Runtime System for Deterministic Multithreaded Execution. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 53–64, 2010.
[11]
E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: Safe Multithreaded Programming for C/C++. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’09, pages 81–96. ACM, 2009.
[12]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pages 72–81. ACM, 2008.
[13]
J. Bueno, J. Planas, A. Duran, R. Badia, X. Martorell, E. Ayguade, and J. Labarta. Productive Programming of GPU Clusters with OmpSs. In Proceedings of the 26th IEEE International Parallel Distributed Processing Symposium, IPDPS ’12, pages 557–568, May 2012.
[14]
Casten. SocketCL. Website. http://sourceforge.net/ projects/socketcl.
[15]
F. Darema, D. A. George, N. V. A., and G. F. Pfister. A Single-programmultiple-data Computational Model for EPEX/FORTRAN. Parallel Computing, 7(1):11–24, April 1988.
[16]
J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Ortì. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. In Proceedigns of International Conference on High Performance Computing and Simulation, HPCS ’10, pages 224–231, June 2010.
[17]
A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSs: A Proposal for Programming Heterogeneous Multi-core Architectures. Parallel Processing Letters, 21(02): 173–193, 2011.
[18]
I. Grasso, S. Pellegrini, B. Cosenza, and T. Fahringer. LibWater: Heterogeneous Distributed Computing Made Easy. In Proceedings of the 27th ACM International Conference on Supercomputing, ICS ’13, pages 161–172, 2013.
[19]
T. D. Hartley, E. Saule, and Ümit V. Çatalyürek. Improving Performance of Adaptive Component-based Dataflow Middleware. Parallel Computing, 38(6–7):289–309, 2012.
[20]
J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance Code Generation for Stencil Computations on GPU Architectures. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 311–320, 2012.
[21]
P. Kegel, M. Steuwer, and S. Gorlatch. dOpenCL: Towards a Uniform Programming Approach for Distributed Heterogeneous Multi-/Many-Core Systems. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops PhD Forum, pages 174–186, May 2012.
[22]
Khronos Group. OpenCL 1.2 Specification. Khronos Group, November 2012. http://www.khronos.org/registry/cl/sdk/1.2/docs/ man/xhtml/.
[23]
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU Clusters. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 341–352, 2012.
[24]
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. OpenCL as a Programming Model for GPU Clusters. In Languages and Compilers for Parallel Computing: 24th International Workshop, LCPC 2011, Fort Collins, CO, USA, September 8-10, 2011. Revised Selected Papers, LNCS 7146, pages 233–248. Springer-Verlag, Berlin, Heidelberg, 2013.
[25]
B. König. CLara - OpenCL Across the Net. Website. http: //sourceforge.net/projects/clara.
[26]
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective Automatic Parallelization of Stencil Computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pages 235–244, 2007.
[27]
D. M. Kunzman and L. V. Kalé. Programming Heterogeneous Clusters with Accelerators Using Object-Based Programming. Scientific Programming, 19(1):47–62, 2011.
[28]
L. Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput., 28(9):690– 691, September 1979.
[29]
J. Lee and D. A. Padua. Hiding Relaxed Memory Consistency with Compilers. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’00, pages 111–122, October 2000.
[30]
J. Lee, D. A. Padua, and S. P. Midkiff. Basic Compiler Algorithms for Parallel Programs. In Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’99, pages 1–12, 1999.
[31]
J. Lee, J. Kim, C. Jang, S. Kim, B. Egger, K. Kim, and S. Han. FaCSim: A Fast and Cycle-Accurate Architecture Simulator for Embedded Systems. In Proceedings of the ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, LCTES ’08, pages 89–99, June 2008.
[32]
T. Liu, C. Curtsinger, and E. D. Berger. DTHREADS: Efficient Deterministic Multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11, 2011.
[33]
MPI Forum. MPI: A Message Passing Interface Standard. Version 3. Website, 2012. http://www.mpi-forum.org.
[34]
NASA Advanced Supercomputing Division. NAS Parallel Benchmarks version 3.3. Website. http://www.nas.nasa.gov/Resources/ Software/npb.html.
[35]
NVIDIA. NVIDIA CUDA Toolkit 4.0. Website. http://developer. nvidia.com/cuda-toolkit-40.
[36]
NVIDIA. CUDA Zone. Website, January 2014. http://www.nvidia. com/object/cuda_home_new.html.
[37]
M. Oikawa, A. Kawai, K. Nomura, K. Yasuoka, K. Yoshikawa, and T. Narumi. DS-CUDA: A Middleware to Use Many GPUs in the Cloud Environment. In Proceedings of 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis, pages 1207–1214, Nov 2012.
[38]
M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient Deterministic Multithreading in Software. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIV, pages 97–108, 2009.
[39]
A. J. Peña, C. Reaño, F. Silla, R. Mayo, E. S. Quintana-Ortí, and J. Duato. A Complete and Efficient CUDA-sharing Solution for HPC Clusters. Parallel Computing, 40(10):574–588, 2014. ISSN 0167-8191.
[40]
C. Reaño, A. Peña, F. Silla, J. Duato, R. Mayo, and E. Quintana-Ortí. CU2rCU: Towards the Complete rCUDA Remote GPU Virtualization and Sharing Solution. In Proceedings of the 19th International Conference on High Performance Computing, HiPC ’12, pages 1–10, December 2012.
[41]
D. A. Reed and J. Dongarra. Exascale Computing and Big Data. Communications of the ACM, 58(7):56–68, July 2015.
[42]
S. Ryoo, C. I. Rodrigues, S. S. Stone, J. A. Stratton, S.-Z. Ueng, S. S. Baghsorkhi, and W.-M. W. Hwu. Program Optimization Carving for GPU Computing. Journal of Parallel and Distributed Computing, 68 (10):1389–1401, 2008.
[43]
S. Seo, G. Jo, and J. Lee. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. In Proceedings of 2011 IEEE International Symposium on Workload Characterization, IISWC ’11, pages 137–148, 2011.
[44]
D. Shasha and M. Snir. Efficient and Correct Execution of Parallel Programs That Share Memory. ACM Trans. Program. Lang. Syst., 10 (2):282–312, April 1988.
[45]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, and W.-M. W. Liu, Geng Danieland Hwu. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical report, University of Illinois at Urbana-Champaign, March 2012. http://impact.crhc.illinois.edu/Parboil/parboil.aspx.
[46]
Z. Sura, X. Fang, C.-L. Wong, S. P. Midkiff, J. Lee, and D. Padua. Compiler Techniques for High Performance Sequentially Consistent Java Programs. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’05, pages 2–13, 2005.
[47]
A. Tupinamba. DistributedCL. Website. https://github.com/ andrelrt/distributedcl.
[48]
O. Wolfson, S. Jajodia, and Y. Huang. An Adaptive Data Replication Algorithm. ACM Transactions on Database Systems, 22(2):255–314, June 1997.
[49]
A. Woodland. CLuMPI (OpenCL under MPI). Website. http: //sourceforge.net/projects/clumpi.
[50]
J. Zhang, J. Lee, and P. K. McKinley. Optimizing the Java Pipe I/O Stream Library for Performance. In Languages and Compilers for Parallel Computing: 15th Workshop, LCPC 2002, College Park, MD, USA, July 2002, Revised Papers, LNCS 2481, pages 233–248. Springer-Verlag, Berlin, Heidelberg, 2005.

Cited By

View all
  • (2023)A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systemsConcurrency and Computation: Practice and Experience10.1002/cpe.789735:25Online publication date: 29-Aug-2023
  • (2022)An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous NodesACM Transactions on Computer Systems10.1145/350522439:1-4(1-30)Online publication date: 5-Jul-2022
  • (2022)Lightning: Scaling the GPU Programming Model Beyond a Single GPU2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00054(492-503)Online publication date: May-2022
  • Show More Cited By

Index Terms

  1. A distributed OpenCL framework using redundant computation and data replication

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation
    June 2016
    726 pages
    ISBN:9781450342612
    DOI:10.1145/2908080
    • General Chair:
    • Chandra Krintz,
    • Program Chair:
    • Emery Berger
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 51, Issue 6
      PLDI '16
      June 2016
      726 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2980983
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. OpenCL
    2. clusters
    3. data replication
    4. heterogeneous computing
    5. programming models
    6. redundant computation
    7. runtime systems

    Qualifiers

    • Research-article

    Conference

    PLDI '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 406 of 2,067 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systemsConcurrency and Computation: Practice and Experience10.1002/cpe.789735:25Online publication date: 29-Aug-2023
    • (2022)An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous NodesACM Transactions on Computer Systems10.1145/350522439:1-4(1-30)Online publication date: 5-Jul-2022
    • (2022)Lightning: Scaling the GPU Programming Model Beyond a Single GPU2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00054(492-503)Online publication date: May-2022
    • (2021)SnuRHACProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460647(107-120)Online publication date: 21-Jun-2021
    • (2020)Automated Partitioning of Data-Parallel Kernels using Polyhedral CompilationWorkshop Proceedings of the 49th International Conference on Parallel Processing10.1145/3409390.3409403(1-10)Online publication date: 17-Aug-2020
    • (2020)Extending Heterogeneous Applications to Remote Co-processors with rOpenCL2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD49847.2020.00049(305-312)Online publication date: Sep-2020
    • (2020)HaoCL: Harnessing Large-scale Heterogeneous Processors Made Easy2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS47774.2020.00120(1231-1234)Online publication date: Nov-2020
    • (2019)VComputeLibProceedings of the 17th International Conference on Advances in Mobile Computing & Multimedia10.1145/3365921.3365936(242-251)Online publication date: 2-Dec-2019
    • (2019)HDArray: Parallel Array Interface for Distributed Heterogeneous DevicesLanguages and Compilers for Parallel Computing10.1007/978-3-030-34627-0_13(176-184)Online publication date: 13-Nov-2019
    • (2018)UHCL-DarknetProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225107(1-10)Online publication date: 13-Aug-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media