More Web Proxy on the site http://driver.im/

research-article

A distributed OpenCL framework using redundant computation and data replication

Authors:

Jaejin LeeAuthors Info & Claims

PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 553 - 569

https://doi.org/10.1145/2908080.2908094

Published: 02 June 2016 Publication History

Abstract

Applications written solely in OpenCL or CUDA cannot execute on a cluster as a whole. Most previous approaches that extend these programming models to clusters are based on a common idea: designating a centralized host node and coordinating the other nodes with the host for computation. However, the centralized host node is a serious performance bottleneck when the number of nodes is large. In this paper, we propose a scalable and distributed OpenCL framework called SnuCL-D for large-scale clusters. SnuCL-D's remote device virtualization provides an OpenCL application with an illusion that all compute devices in a cluster are confined in a single node. To reduce the amount of control-message and data communication between nodes, SnuCL-D replicates the OpenCL host program execution and data in each node. We also propose a new OpenCL host API function and a queueing optimization technique that significantly reduce the overhead incurred by the previous centralized approaches. To show the effectiveness of SnuCL-D, we evaluate SnuCL-D with a microbenchmark and eleven benchmark applications on a large-scale CPU cluster and a medium-scale GPU cluster.

References

[1]

Charm++. Website. http://charm.cs.uiuc.edu/.

[2]

G. Aloisio and S. Fiore. Towards Exascale Distributed Data Management. International Journal of High Performance Computing Applications, 23(4):398–400, 2009.

Digital Library

[3]

A. Alves, J. Rufino, A. Pina, and L. P. Santos. clOpenCL - Supporting Distributed Heterogeneous Computing in HPC Cluster. In Euro-Par 2012: Parallel Processing Workshops, Revised Selected Papers, pages 112–122. Springer-Verlag, Berlin, Heidelberg, 2013.

Digital Library

[4]

AMD. AMD APP SDK. Website, January 2014. http: //developer.amd.com/tools-and-sdks/opencl-zone/ amd-accelerated-parallel-processing-app-sdk/.

[5]

C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations. Computer, 29:18–28, February 1996.

Digital Library

[6]

R. Aoki, S. Oikawa, R. Tsuchiyama, and T. Nakamura. Hybrid OpenCL: Connecting Different OpenCL Implementations over Network. In Proceedings of the 10th IEEE International Conference on Computer and Information Technology, pages 2729–2735, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[7]

C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: a Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience, 23(2):187–198, 2011.

Digital Library

[8]

C. Augonnet, O. Aumage, N. Furmento, S. Thibault, and R. Namyst. StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators. Research Report RR-8538, May 2014.

[9]

A. Barak, T. Ben-Nun, E. Levy, and A. Shiloh. A Package for OpenCL based Heterogeneous Computing on Clusters with Many GPU Devices. In Proceedings of 2010 IEEE International Conference on Cluster Computing Workshops and Posters, pages 1–7, September 2010.

[10]

T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. Core-Det: A Compiler and Runtime System for Deterministic Multithreaded Execution. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 53–64, 2010.

Digital Library

[11]

E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: Safe Multithreaded Programming for C/C++. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’09, pages 81–96. ACM, 2009.

Digital Library

[12]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pages 72–81. ACM, 2008.

Digital Library

[13]

J. Bueno, J. Planas, A. Duran, R. Badia, X. Martorell, E. Ayguade, and J. Labarta. Productive Programming of GPU Clusters with OmpSs. In Proceedings of the 26th IEEE International Parallel Distributed Processing Symposium, IPDPS ’12, pages 557–568, May 2012.

Digital Library

[14]

Casten. SocketCL. Website. http://sourceforge.net/ projects/socketcl.

[15]

F. Darema, D. A. George, N. V. A., and G. F. Pfister. A Single-programmultiple-data Computational Model for EPEX/FORTRAN. Parallel Computing, 7(1):11–24, April 1988.

[16]

J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Ortì. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. In Proceedigns of International Conference on High Performance Computing and Simulation, HPCS ’10, pages 224–231, June 2010.

[17]

A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSs: A Proposal for Programming Heterogeneous Multi-core Architectures. Parallel Processing Letters, 21(02): 173–193, 2011.

[18]

I. Grasso, S. Pellegrini, B. Cosenza, and T. Fahringer. LibWater: Heterogeneous Distributed Computing Made Easy. In Proceedings of the 27th ACM International Conference on Supercomputing, ICS ’13, pages 161–172, 2013.

Digital Library

[19]

T. D. Hartley, E. Saule, and Ümit V. Çatalyürek. Improving Performance of Adaptive Component-based Dataflow Middleware. Parallel Computing, 38(6–7):289–309, 2012.

Digital Library

[20]

J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance Code Generation for Stencil Computations on GPU Architectures. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 311–320, 2012.

Digital Library

[21]

P. Kegel, M. Steuwer, and S. Gorlatch. dOpenCL: Towards a Uniform Programming Approach for Distributed Heterogeneous Multi-/Many-Core Systems. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops PhD Forum, pages 174–186, May 2012.

Digital Library

[22]

Khronos Group. OpenCL 1.2 Specification. Khronos Group, November 2012. http://www.khronos.org/registry/cl/sdk/1.2/docs/ man/xhtml/.

[23]

J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU Clusters. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 341–352, 2012.

Digital Library

[24]

J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. OpenCL as a Programming Model for GPU Clusters. In Languages and Compilers for Parallel Computing: 24th International Workshop, LCPC 2011, Fort Collins, CO, USA, September 8-10, 2011. Revised Selected Papers, LNCS 7146, pages 233–248. Springer-Verlag, Berlin, Heidelberg, 2013.

[25]

B. König. CLara - OpenCL Across the Net. Website. http: //sourceforge.net/projects/clara.

[26]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective Automatic Parallelization of Stencil Computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pages 235–244, 2007.

Digital Library

[27]

D. M. Kunzman and L. V. Kalé. Programming Heterogeneous Clusters with Accelerators Using Object-Based Programming. Scientific Programming, 19(1):47–62, 2011.

Digital Library

[28]

L. Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput., 28(9):690– 691, September 1979.

Digital Library

[29]

J. Lee and D. A. Padua. Hiding Relaxed Memory Consistency with Compilers. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’00, pages 111–122, October 2000.

Digital Library

[30]

J. Lee, D. A. Padua, and S. P. Midkiff. Basic Compiler Algorithms for Parallel Programs. In Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’99, pages 1–12, 1999.

Digital Library

[31]

J. Lee, J. Kim, C. Jang, S. Kim, B. Egger, K. Kim, and S. Han. FaCSim: A Fast and Cycle-Accurate Architecture Simulator for Embedded Systems. In Proceedings of the ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, LCTES ’08, pages 89–99, June 2008.

Digital Library

[32]

T. Liu, C. Curtsinger, and E. D. Berger. DTHREADS: Efficient Deterministic Multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11, 2011.

Digital Library

[33]

MPI Forum. MPI: A Message Passing Interface Standard. Version 3. Website, 2012. http://www.mpi-forum.org.

[34]

NASA Advanced Supercomputing Division. NAS Parallel Benchmarks version 3.3. Website. http://www.nas.nasa.gov/Resources/ Software/npb.html.

[35]

NVIDIA. NVIDIA CUDA Toolkit 4.0. Website. http://developer. nvidia.com/cuda-toolkit-40.

[36]

NVIDIA. CUDA Zone. Website, January 2014. http://www.nvidia. com/object/cuda_home_new.html.

[37]

M. Oikawa, A. Kawai, K. Nomura, K. Yasuoka, K. Yoshikawa, and T. Narumi. DS-CUDA: A Middleware to Use Many GPUs in the Cloud Environment. In Proceedings of 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis, pages 1207–1214, Nov 2012.

Digital Library

[38]

M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient Deterministic Multithreading in Software. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIV, pages 97–108, 2009.

Digital Library

[39]

A. J. Peña, C. Reaño, F. Silla, R. Mayo, E. S. Quintana-Ortí, and J. Duato. A Complete and Efficient CUDA-sharing Solution for HPC Clusters. Parallel Computing, 40(10):574–588, 2014. ISSN 0167-8191.

Digital Library

[40]

C. Reaño, A. Peña, F. Silla, J. Duato, R. Mayo, and E. Quintana-Ortí. CU2rCU: Towards the Complete rCUDA Remote GPU Virtualization and Sharing Solution. In Proceedings of the 19th International Conference on High Performance Computing, HiPC ’12, pages 1–10, December 2012.

[41]

D. A. Reed and J. Dongarra. Exascale Computing and Big Data. Communications of the ACM, 58(7):56–68, July 2015.

Digital Library

[42]

S. Ryoo, C. I. Rodrigues, S. S. Stone, J. A. Stratton, S.-Z. Ueng, S. S. Baghsorkhi, and W.-M. W. Hwu. Program Optimization Carving for GPU Computing. Journal of Parallel and Distributed Computing, 68 (10):1389–1401, 2008.

Digital Library

[43]

S. Seo, G. Jo, and J. Lee. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. In Proceedings of 2011 IEEE International Symposium on Workload Characterization, IISWC ’11, pages 137–148, 2011.

Digital Library

[44]

D. Shasha and M. Snir. Efficient and Correct Execution of Parallel Programs That Share Memory. ACM Trans. Program. Lang. Syst., 10 (2):282–312, April 1988.

Digital Library

[45]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, and W.-M. W. Liu, Geng Danieland Hwu. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical report, University of Illinois at Urbana-Champaign, March 2012. http://impact.crhc.illinois.edu/Parboil/parboil.aspx.

[46]

Z. Sura, X. Fang, C.-L. Wong, S. P. Midkiff, J. Lee, and D. Padua. Compiler Techniques for High Performance Sequentially Consistent Java Programs. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’05, pages 2–13, 2005.

Digital Library

[47]

A. Tupinamba. DistributedCL. Website. https://github.com/ andrelrt/distributedcl.

[48]

O. Wolfson, S. Jajodia, and Y. Huang. An Adaptive Data Replication Algorithm. ACM Transactions on Database Systems, 22(2):255–314, June 1997.

Digital Library

[49]

A. Woodland. CLuMPI (OpenCL under MPI). Website. http: //sourceforge.net/projects/clumpi.

[50]

J. Zhang, J. Lee, and P. K. McKinley. Optimizing the Java Pipe I/O Stream Library for Performance. In Languages and Compilers for Parallel Computing: 15th Workshop, LCPC 2002, College Park, MD, USA, July 2002, Revised Papers, LNCS 2481, pages 233–248. Springer-Verlag, Berlin, Heidelberg, 2005.

Digital Library

Cited By

Czarnul P(2023)A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systemsConcurrency and Computation: Practice and Experience10.1002/cpe.789735:25Online publication date: 29-Aug-2023
https://doi.org/10.1002/cpe.7897
Lyerly RBilbao CMin CRossbach CRavindran B(2022)An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous NodesACM Transactions on Computer Systems10.1145/350522439:1-4(1-30)Online publication date: 5-Jul-2022
https://dl.acm.org/doi/10.1145/3505224
Heldens SHijma PVan Werkhoven BMaassen Jvan Nieuwpoort R(2022)Lightning: Scaling the GPU Programming Model Beyond a Single GPU2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00054(492-503)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00054
Show More Cited By

Index Terms

A distributed OpenCL framework using redundant computation and data replication
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

A distributed OpenCL framework using redundant computation and data replication
PLDI '16

Applications written solely in OpenCL or CUDA cannot execute on a cluster as a whole. Most previous approaches that extend these programming models to clusters are based on a common idea: designating a centralized host node and coordinating the other ...
OpenCL as a unified programming model for heterogeneous CPU/GPU clusters
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

In this paper, we propose an OpenCL framework for heterogeneous CPU/GPU clusters, and show that the framework achieves both high performance and ease of programming. The framework provides an illusion of a single system for the user. It allows the ...
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

In this paper, we propose SnuCL, an OpenCL framework for heterogeneous CPU/GPU clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2016

726 pages

ISBN:9781450342612

DOI:10.1145/2908080

General Chair:
Chandra Krintz
University of California at Santa Barbara, USA
,
Program Chair:
Emery Berger
University of Massachusetts at Amherst, USA

ACM SIGPLAN Notices Volume 51, Issue 6
PLDI '16
June 2016
726 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2980983
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '16

Sponsor:

SIGPLAN

PLDI '16: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 13 - 17, 2016

CA, Santa Barbara, USA

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
557
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)3

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Czarnul P(2023)A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systemsConcurrency and Computation: Practice and Experience10.1002/cpe.789735:25Online publication date: 29-Aug-2023
https://doi.org/10.1002/cpe.7897
Lyerly RBilbao CMin CRossbach CRavindran B(2022)An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous NodesACM Transactions on Computer Systems10.1145/350522439:1-4(1-30)Online publication date: 5-Jul-2022
https://dl.acm.org/doi/10.1145/3505224
Heldens SHijma PVan Werkhoven BMaassen Jvan Nieuwpoort R(2022)Lightning: Scaling the GPU Programming Model Beyond a Single GPU2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00054(492-503)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00054
Jung JPark DJo GPark JLee JLaure EMarkidis SVerbanescu ALofstead G(2021)SnuRHACProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460647(107-120)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3431379.3460647
Matz ADoerfert JFröning H(2020)Automated Partitioning of Data-Parallel Kernels using Polyhedral CompilationWorkshop Proceedings of the 49th International Conference on Parallel Processing10.1145/3409390.3409403(1-10)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3409390.3409403
Alves RRufino J(2020)Extending Heterogeneous Applications to Remote Co-processors with rOpenCL2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD49847.2020.00049(305-312)Online publication date: Sep-2020
https://doi.org/10.1109/SBAC-PAD49847.2020.00049
Chen YLong XHe JChen YTan HZhang ZWinslett MChen D(2020)HaoCL: Harnessing Large-scale Heterogeneous Processors Made Easy2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS47774.2020.00120(1231-1234)Online publication date: Nov-2020
https://doi.org/10.1109/ICDCS47774.2020.00120
Mammeri NJuurlink B(2019)VComputeLibProceedings of the 17th International Conference on Advances in Mobile Computing & Multimedia10.1145/3365921.3365936(242-251)Online publication date: 2-Dec-2019
https://dl.acm.org/doi/10.1145/3365921.3365936
Cho HKwon OMidkiff S(2019)HDArray: Parallel Array Interface for Distributed Heterogeneous DevicesLanguages and Compilers for Parallel Computing10.1007/978-3-030-34627-0_13(176-184)Online publication date: 13-Nov-2019
https://doi.org/10.1007/978-3-030-34627-0_13
Liao LLi KLi KYang CTian Q(2018)UHCL-DarknetProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225107(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225107
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents