More Web Proxy on the site http://driver.im/

research-article

Memory access patterns: the missing piece of the multi-GPU puzzle

Authors:

Eri RubinAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 19, Pages 1 - 12

https://doi.org/10.1145/2807591.2807611

Published: 15 November 2015 Publication History

Abstract

With the increased popularity of multi-GPU nodes in modern HPC clusters, it is imperative to develop matching programming paradigms for their efficient utilization. In order to take advantage of the local GPUs and the low-latency high-throughput interconnects that link them, programmers need to meticulously adapt parallel applications with respect to load balancing, boundary conditions and device synchronization. This paper presents MAPS-Multi, an automatic multi-GPU partitioning framework that distributes the workload based on the underlying memory access patterns. The framework consists of host- and device-level APIs that allow programs to efficiently run on a variety of GPU and multi-GPU architectures. The framework implements several layers of code optimization, device abstraction, and automatic inference of inter-GPU memory exchanges. The paper demonstrates that the performance of MAPS-Multi achieves near-linear scaling on fundamental computational operations, as well as real-world applications in deep learning and multivariate analysis.

References

[1]

M. Ament, G. Knittel, D. Weiskopf, and W. Straßer. A parallel preconditioned conjugate gradient solver for the Poisson problem on a multi-GPU platform. In Parallel, Distributed and Network-Based Processing (PDP), 2010 18th Euromicro International Conference on, pages 583--592, 2010.

Digital Library

[2]

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.

[3]

J. P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences, 101(12):4164--4169, 2004.

[4]

B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl., 21(3):291--312, 2007.

Digital Library

[5]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, 2005.

Digital Library

[6]

A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning, pages 1337--1345, 2013.

Digital Library

[7]

R. Collobert, S. Bengio, and J. Mariéthoz. Torch: a modular machine learning software library. Technical report, IDIAP, 2002.

[8]

CUB Library Documentation, 2015. http://nvlabs.github.io/cub/.

[9]

CUBLAS Library Documentation, 2015. http://docs.nvidia.com/cuda/cublas/.

[10]

NVIDIA cuDNN Deep Learning Library, 2015. http://developer.nvidia.com/cuDNN.

[11]

CUFFT Library Documentation, 2015. http://docs.nvidia.com/cuda/cufft/.

[12]

M. De Wael, S. Marr, B. De Fraine, T. Van Cutsem, and W. De Meuter. Partitioned global address space languages. ACM Comput. Surv., 47(4):62:1--62:27, 2015.

Digital Library

[13]

J. Enmyren and C. W. Kessler. SkePU: A multi-backend skeleton programming library for multi-GPU systems. In Proceedings of the Fourth International Workshop on High-Level Parallel Programming and Applications, HLPP '10, pages 5--14, 2010.

Digital Library

[14]

M. Gardner. Mathematical games: The fantastic combinations of John Conway's new solitaire game "life". Scientific American, 223(4):120--123, 1970.

[15]

High Performance Fortran language specification. SIGPLAN Fortran Forum, 12(4):1--86, 1993.

Digital Library

[16]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675--678, 2014.

Digital Library

[17]

J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 277--288, 2011.

Digital Library

[18]

A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.

[19]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.

[20]

Y. LeCun and C. Cortes. The MNIST database of handwritten digits, 1998. http://yann.lecun.com/exdb/mnist.

[21]

MAPS Framework Documentation, 2015. http://maps-gpu.github.io/.

[22]

E. Mejía-Roa, D. Tabas-Madrid, J. Setoain, C. García, F. Tirado, and A. Pascual-Montano. NMF-mGPU: non-negative matrix factorization on multi-GPU systems. BMC Bioinformatics, 16(1):43, 2015.

[23]

T. Ramashekar and U. Bondhugula. Automatic data allocation and buffer management for multi-GPU machines. ACM Trans. Archit. Code Optim., 10(4):60:1--60:26, 2013.

Digital Library

[24]

E. Rubin, E. Levy, A. Barak, and T. Ben-Nun. MAPS: Optimizing massively parallel applications using device-level memory abstraction. ACM Trans. Archit. Code Optim., 11(4):44:1--44:22, 2014.

Digital Library

[25]

E. Rustico, G. Bilotta, A. Herault, C. Del Negro, and G. Gallo. Advances in multi-GPU smoothed particle hydrodynamics simulations. IEEE Trans. Parallel Distrib. Syst., 25(1):43--52, 2014.

Digital Library

[26]

M. L. Sætra and A. R. Brodtkorb. Shallow water simulations on multiple GPUs. In Applied Parallel and Scientific Computing, volume 7134 of Lecture Notes in Computer Science, pages 56--66. Springer, 2012.

Digital Library

[27]

S. Schaetz and M. Uecker. A multi-GPU programming library for real-time applications. In Proceedings of the 12th International Conference on Algorithms and Architectures for Parallel Processing - Part I, ICA3PP'12, pages 114--128. Springer-Verlag, 2012.

Digital Library

[28]

L. Snyder. Programming Guide to ZPL. MIT Press, Cambridge, MA, 1999.

Digital Library

[29]

M. Steuwer, P. Kegel, and S. Gorlatch. Towards high-level programming of multi-GPU systems using the SkelCL library. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International, pages 1858--1865, 2012.

Digital Library

[30]

J. C. Thibault and I. Senocak. CUDA implementation of a Navier-Stokes solver on multi-GPU desktop platforms for incompressible flows. In Proceedings of the 47th AIAA Aerospace Sciences Meeting, 2009.

[31]

UPC Consortium. UPC Language and Library Specifications, v1.3. Technical report, Lawrence Berkeley National Lab, 2013.

[32]

C. G. Xanthis, I. E. Venetis, and A. H. Aletras. High performance MRI simulations of motion on multi-GPU systems. Journal of Cardiovascular Magnetic Resonance, 16(1):48, 2014.

Cited By

Augonnet CAlexandrescu ASidelnik AGarland M(2024)CUDASTF: Bridging the Gap Between CUDA and Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00049(1-17)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00049
Zhang HChen YHuang ZZhang HDai F(2023)SEECHIP: A Scalable and Energy-Efficient Chiplet-based GPU Architecture Using Photonic LinksProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605626(566-575)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605626
Ji ZWang CKloeckner AMoreira J(2022)Optimizing Aggregate Computation of Graph Neural Networks with on-GPU Interpreter-Style ProgrammingProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569690(83-95)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569690
Show More Cited By

Index Terms

Memory access patterns: the missing piece of the multi-GPU puzzle

Recommendations

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data ...
Automatic execution of single-GPU computations across multiple GPUs
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of ...
Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
1,121
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)7

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Augonnet CAlexandrescu ASidelnik AGarland M(2024)CUDASTF: Bridging the Gap Between CUDA and Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00049(1-17)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00049
Zhang HChen YHuang ZZhang HDai F(2023)SEECHIP: A Scalable and Energy-Efficient Chiplet-based GPU Architecture Using Photonic LinksProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605626(566-575)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605626
Ji ZWang CKloeckner AMoreira J(2022)Optimizing Aggregate Computation of Graph Neural Networks with on-GPU Interpreter-Style ProgrammingProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569690(83-95)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569690
Heldens SHijma PVan Werkhoven BMaassen Jvan Nieuwpoort R(2022)Lightning: Scaling the GPU Programming Model Beyond a Single GPU2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00054(492-503)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00054
Yu XNikitin VChing DAslan SGürsoy DBiçer T(2022)Scalable and accurate multi-GPU-based image reconstruction of large-scale ptychography dataScientific Reports10.1038/s41598-022-09430-312:1Online publication date: 29-Mar-2022
https://doi.org/10.1038/s41598-022-09430-3
Yu XBicer TKettimuthu RFoster IZhou HMoreira JMueller FEtsion Y(2021)Topology-aware optimizations for multi-GPU ptychographic image reconstructionProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460380(354-366)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460380
Haque Monil MLee SVetter JMalony A(2021)Comparing LLC-Memory Traffic between CPU and GPU Architectures2021 IEEE/ACM Redefining Scalability for Diversely Heterogeneous Architectures Workshop (RSDHA)10.1109/RSDHA54838.2021.00007(8-16)Online publication date: Nov-2021
https://doi.org/10.1109/RSDHA54838.2021.00007
Tripathy DAbdolrashidi AFan QWong DSatpathy M(2021)LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605411(1-8)Online publication date: Oct-2021
https://doi.org/10.1109/NAS51552.2021.9605411
Fariborz MXiao XFotouhi PProietti RYoo S(2021)Silicon Photonic Flex-LIONS for Reconfigurable Multi-GPU SystemsJournal of Lightwave Technology10.1109/JLT.2021.305271339:4(1212-1220)Online publication date: 15-Feb-2021
https://doi.org/10.1109/JLT.2021.3052713
Muthukrishnan HNellans DLustig DFessler JWenisch TMartínez JDuato JJohn L(2021)Efficient multi-GPU shared memory via automatic optimization of fine-grained transfersProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00020(139-152)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00020
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents