Article

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Authors:

Dhabaleswar K. PandaAuthors Info & Claims

CLUSTER '11: Proceedings of the 2011 IEEE International Conference on Cluster Computing

Pages 308 - 316

https://doi.org/10.1109/CLUSTER.2011.42

Published: 26 September 2011 Publication History

Abstract

Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remains the biggest hurdle to overall performance and programmer productivity. Real scientific applications utilize multi-dimensional data. Data in higher dimensions may not be contiguous in memory. In order to improve programmer productivity and to enable communication libraries to optimize non-contiguous data communication, the MPI interface provides MPI data types. Currently, state of the art MPI libraries do not provide native data type support for data that resides in GPU memory. The management of non-contiguous GPU data is a source of productivity and performance loss, because GPU application developers have to manually move the data out of and in to GPUs. In this paper, we present our design for enabling high-performance communication support between GPUs for non-contiguous data types. We describe our innovative approach to improve performance by "offloading" data type packing and unpacking on to a GPU device, and "pipelining" all data transfer stages between two GPUs. Our design is integrated into the popular MVAPICH2 MPI library for InfiniBand, iWARP and RoCE clusters. We perform a detailed evaluation of our design on a GPU cluster with the latest NVIDIA Fermi GPU adapters. The evaluation reveals that the proposed designs can achieve up to 88% latency improvement for vector data type at 4 MB size with micro benchmarks. For Stencil2D application from the SHOC benchmark suite, our design can simplify the data communication in its main loop, reducing the lines of code by 36%. Further, our method can improve the performance of Stencil2D by up to 42% for single precision data set, and 39% for double precision data set. To the best of our knowledge, this is the first such design, implementation and evaluation of non-contiguous MPI data communication for GPU clusters.

Cited By

View all

Bacon NBridges PLevy SFerreira KBienz A(2023)Evaluating the Viability of LogGP for Modeling MPI Performance with Non-contiguous Datatypes on Modern ArchitecturesProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615326(1-10)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1145/3615318.3615326
Pearson CWu KChung IXiong JHwu WLaure EMarkidis SVerbanescu ALofstead G(2021)TEMPIProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460645(95-106)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3431379.3460645
Di Girolamo STaranov KKurth ASchaffner MSchneider TBeránek JBesta MBenini LRoweth DHoefler TTaufer MBalaji PPeña A(2019)Network-accelerated non-contiguous memory transfersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356189(1-14)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356189
Show More Cited By

Recommendations

On the efficacy of GPU-integrated MPI for scientific applications
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Scientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU ...
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
SIMUTools '10: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques

An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-...
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

CLUSTER '11: Proceedings of the 2011 IEEE International Conference on Cluster Computing

September 2011

613 pages

ISBN:9780769545165

Publisher

IEEE Computer Society

United States

Publication History

Published: 26 September 2011

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Bacon NBridges PLevy SFerreira KBienz A(2023)Evaluating the Viability of LogGP for Modeling MPI Performance with Non-contiguous Datatypes on Modern ArchitecturesProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615326(1-10)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1145/3615318.3615326
Pearson CWu KChung IXiong JHwu WLaure EMarkidis SVerbanescu ALofstead G(2021)TEMPIProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460645(95-106)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3431379.3460645
Di Girolamo STaranov KKurth ASchaffner MSchneider TBeránek JBesta MBenini LRoweth DHoefler TTaufer MBalaji PPeña A(2019)Network-accelerated non-contiguous memory transfersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356189(1-14)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356189
Xiong QBangalore PSkjellum AHerbordt M(2018)MPI Derived DatatypesProceedings of the 25th European MPI Users' Group Meeting10.1145/3236367.3236378(1-10)Online publication date: 23-Sep-2018
https://dl.acm.org/doi/10.1145/3236367.3236378
Wu WBosilca GvandeVaart RJeaugey SDongarra JNakashima HTaura KLange J(2016)GPU-Aware Non-contiguous Data Movement In Open MPIProceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing10.1145/2907294.2907317(231-242)Online publication date: 31-May-2016
https://dl.acm.org/doi/10.1145/2907294.2907317
Banerjee DHamidouche KPanda DKaeli DCavazos J(2016)Designing high performance communication runtime for GPU managed memoryProceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit10.1145/2884045.2884050(82-91)Online publication date: 12-Mar-2016
https://dl.acm.org/doi/10.1145/2884045.2884050
Chu CHamidouche KVenkatesh AAwan APanda DVarela CCastro HBarrios C(2016)CUDA kernel based collective reduction operations on large-scale GPU clustersProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.111(726-735)Online publication date: 16-May-2016
https://dl.acm.org/doi/10.1109/CCGrid.2016.111
Awan AHamidouche KVenkatesh APerkins JSubramoni HPanda DDongarra JDenis AGoglin BJeannot EMercier G(2015)GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective BenchmarksProceedings of the 22nd European MPI Users' Group Meeting10.1145/2802658.2802672(1-10)Online publication date: 21-Sep-2015
https://dl.acm.org/doi/10.1145/2802658.2802672
Yu LZhang YGong XRoy NMakowski LKaeli DKaeli DCavazos J(2015)High performance computing of fiber scattering simulationProceedings of the 8th Workshop on General Purpose Processing using GPUs10.1145/2716282.2716285(90-98)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.1145/2716282.2716285
Oden LKlenk BFröning HReed DSun XFoster I(2014)Energy-efficient collective reduce and allreduce operations on distributed GPUsProceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2014.21(483-492)Online publication date: 26-May-2014
https://dl.acm.org/doi/10.1109/CCGrid.2014.21
Show More Cited By

Abstract

Cited By

Recommendations

On the efficacy of GPU-integrated MPI for scientific applications

Efficient simulation of agent-based models on multi-GPU and multi-core clusters

Optimized HPL for AMD GPU and multi-core CPU usage

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations