Article

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU

Authors:

Shixiong Xu,

David GreggAuthors Info & Claims

TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

Pages 53 - 60

https://doi.org/10.1109/Trustcom.2015.612

Published: 20 August 2015 Publication History

Abstract

Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue that vectorization based on hyper loop parallelism can be used as a unified technique to optimize the memory performance. In this paper, we put forward a compiler framework based on the Cetus source-to-source compiler to improve the memory performance on the CUDA GPU by efficiently exploiting hyper loop parallelism in vectorization. We introduce abstractions of SIMD vectors and SIMD operations that match the execution model and memory model of the CUDA GPU, along with three different execution mapping strategies for efficiently offloading vectorized code to CUDA GPUs. In addition, as we employ the vectorization in C-to-CUDA with automatic parallelization, our technique further refines the mapping granularity between coarse-grain loop parallelism and GPU threads. We evaluated our proposed technique on two platforms, an embedded GPU system--Jetson TK1--and a desktop GPU--GeForce GTX 645. The experimental results demonstrate that our vectorization technique based on hyper loop parallelism can yield performance speedups up to 2.5 compared to the direct coarse-grain loop parallelism mapping.

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
1. Computer systems organization

Recommendations

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue ...
Importance of explicit vectorization for CPU and GPU software performance

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are ...
Impact of Vectorization Over 16-bit Data-Types on GPUs
PARMA-DITAM '18: Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms

Since the introduction of Single Instruction Multiple Thread (SIMT) GPU architectures, vectorization has seldom been recommended. However, for efficient use of 8-bit and 16-bit data types, vector types are necessary even on these GPUs. When only integer ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

August 2015

644 pages

ISBN:9781467379526

Publisher

IEEE Computer Society

United States

Publication History

Published: 20 August 2015

Author Tags

Qualifiers

Article

Recommendations