Optimizing non-coalesced memory access for irregular applications with GPU computing

Ran Zheng ORCID: orcid.org/0000-0002-3058-7581^1,2,3,4,
Yuan-dong Liu^1,2,3,4 &
Hai Jin^1,2,3,4

240 Accesses
3 Citations
Explore all metrics

Abstract

General purpose graphics processing units (GPGPUs) can be used to improve computing performance considerably for regular applications. However, irregular memory access exists in many applications, and the benefits of graphics processing units (GPUs) are less substantial for irregular applications. In recent years, several studies have presented some solutions to remove static irregular memory access. However, eliminating dynamic irregular memory access with software remains a serious challenge. A pure software solution without hardware extensions or offline profiling is proposed to eliminate dynamic irregular memory access, especially for indirect memory access. Data reordering and index redirection are suggested to reduce the number of memory transactions, thereby improving the performance of GPU kernels. To improve the efficiency of data reordering, an operation to reorder data is offloaded to a GPU to reduce overhead and thus transfer data. Through concurrently executing the compute unified device architecture (CUDA) streams of data reordering and the data processing kernel, the overhead of data reordering can be reduced. After these optimizations, the volume of memory transactions can be reduced by 16.7%–50% compared with CUSPARSE-based benchmarks, and the performance of irregular kernels can be improved by 9.64%–34.9% using an NVIDIA Tesla P4 GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Improving Performance and Energy Efficiency of Geophysics Applications on GPU Architectures

GPU Architecture

Enhancing GPU Performance Using Thread Geometry Analysis for Irregular Workloads

References

Alandoli M, Shehab M, Al-Ayyoub M, et al., 2016. Using GPUs to speed-up FCM-based community detection in social networks. Proc 7^th Int Conf on Computer Science and Information Technology, p.1–6. https://doi.org/10.1109/CSIT.2016.7549467
Baskaran MM, Bordawekar R, 2008. Optimizing Sparse Matrix-Vector Multiplication on GPUs Using Compile-Time and Run-Time Strategies. IBM Research Report, RC24704 (W0812-047).
Baskaran MM, Bondhugula U, Krishnamoorthy S, et al., 2008. A compiler framework for optimization of affine loop nests for GPGPUs. Proc 22^nd Annual Int Conf on
Supercomputing, p.225–234. https://doi.org/10.1145/1375527.1375562
Bhatotia P, Rodrigues R, Verma A, 2012. Shredder: GPU-accelerated incremental storage and computation. Proc 10^th USENIX Conf on File and Storage Technologies, p.14.
Burtscher M, Nasre R, Pingali K, 2012. A quantitative study of irregular programs on GPUs. Proc IEEE Int Symp on Workload Characterization, p.141–151. https://doi.org/10.1109/IISWC.2012.6402918
Danalis A, Marin G, McCurdy C, et al., 2010. The scalable heterogeneous computing (SHOC) benchmark suite. Proc 3^rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63–74. https://doi.org/10.1145/1735688.1735702
Davis TA, Hu YF, 2011. The university of Florida sparse matrix collection. ACM Trans Math Softw, 38(1):1. https://doi.org/10.1145/2049662.2049663
MathSciNet MATH Google Scholar
Fan WS, Wang B, Paul JC, et al., 2013. An octree-based proxy for collision detection in large-scale particle systems. Sci China Inform Sci, 56(1):1–10. https://doi.org/10.1007/s11432-012-4616-5
Article MathSciNet Google Scholar
Fung WW, Sham I, Yuan G, et al., 2007. Dynamic warp formation and scheduling for efficient GPU control flow. Proc 40^th Annual IEEE/ACM Int Symp on Microarchitecture, p.407–420. https://doi.org/10.1109/MICRO.2007.12
Lee S, Min S, Eigenmann R, 2009. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. ACM SIGPLAN Not, 44(4):101–110. https://doi.org/10.1145/1594835.1504194
Article Google Scholar
Li KL, Yang WD, Li KQ, 2015. Performance analysis and optimization for SpMV on GPU using probabilistic modeling. IEEE Trans Parall Distrib Syst, 26(1):196–205. https://doi.org/10.1109/TPDS.2014.2308221
Article MathSciNet Google Scholar
Li KL, Yang WD, Li KQ, 2016. A hybrid parallel solving algorithm on GPU for quasi-tridiagonal system of linear equations. IEEE Trans Parall Distrib Syst, 27(10):2795–2808. https://doi.org/10.1109/TPDS.2016.2516988
Article Google Scholar
Meng JY, Tarjan D, Skadron K, 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ACM SIGARCH Comput Arch News, 38(3):235–246. https://doi.org/10.1145/1816038.1815992
Article Google Scholar
Mokhtari R, Stumm M, 2014. BigKernel—high performance CPU-GPU communication pipelining for big data-style applications. Proc IEEE 28^th Int Parallel and Distributed Processing Symp, p.819–828. https://doi.org/10.1109/IPDPS.2014.89
Pickering BP, Jackson CW, Scogland TRW, et al., 2015. Directive-based GPU programming for computational fluid dynamics. Comput Fluids, 114:242–253. https://doi.org/10.1016/j.compfluid.2015.03.008
Article MathSciNet Google Scholar
Sabne A, Sakdhnagool P, Eigenmann R, 2013. Scaling large-data computations on multi-GPU accelerators. Proc 27^th Int ACM Conf on Supercomputing, p.443–454. https://doi.org/10.1145/2464996.2465023
Song W, Wu D, Wong R, et al., 2015. A real-time interactive data mining and visualization system using parallel computing. Proc 10^th Int Conf on Digital Information Management, p.10–13. https://doi.org/10.1109/ICDIM.2015.7381890
Sourouri M, Gillberg T, Baden SB, et al., 2014. Effective multi-GPU communication using multiple CUDA streams and threads. Proc 20^th IEEE Int Conf on Parallel and Distributed Systems, p.981–986. https://doi.org/10.1109/PADSW.2014.7097919
Ueng SZ, Lathara M, Baghsorkhi SS, et al., 2008. CUDA-Lite: reducing GPU programming complexity. Proc 21^st Int Workshop on Languages and Compilers for Parallel Computing, p.1–15. https://doi.org/10.1007/978-3-540-89740-8_1
Wang L, Spurzem R, Aarseth S, et al., 2015. NBODY6++GPU: ready for the gravitational million-body problem. Mon Not R Astron Soc, 450(4):4070–4080. https://doi.org/10.1093/mnras/stv817
Article Google Scholar
Wang Y, Davidson A, Pan YC, et al., 2015. Gunrock: a high-performance graph processing library on the GPU. ACM SIGPLAN Not, 50(8):265–266. https://doi.org/10.1145/2858788.2688538
Article Google Scholar
Wu B, Zhao ZJ, Zhang EZ, et al., 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. ACM SIG-PLAN Not, 48(8):57–68. https://doi.org/10.1145/2517327.2442523
Article Google Scholar
Yang Y, Xiang P, Kong JF, et al., 2010. A GPGPU compiler for memory optimization and parallelism management. ACM SIGPLAN Not, 45(6):86–97. https://doi.org/10.1145/1809028.1806606
Article Google Scholar
Zhang EZ, Jiang YL, Guo ZY, et al., 2011. On-the-fly elimination of dynamic irregularities for GPU computing. ACM SIGARCH Comput Arch News, 39(1):369–380. https://doi.org/10.1145/1961295.1950408
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Engineering Research Center for Big Data Technology and System, Huazhong University of Science and Technology, Wuhan, 430074, China
Ran Zheng, Yuan-dong Liu & Hai Jin
Services Computing Technology and System Lab, Huazhong University of Science and Technology, Wuhan, 430074, China
Ran Zheng, Yuan-dong Liu & Hai Jin
Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, 430074, China
Ran Zheng, Yuan-dong Liu & Hai Jin
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Ran Zheng, Yuan-dong Liu & Hai Jin

Authors

Ran Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yuan-dong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hai Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ran Zheng.

Additional information

Project supported by the National Key Research and Development Program of China (No. 2018YFB1003500)

Contributors

Ran ZHENG and Yuan-dong LIU designed the research. Yuan-dong LIU processed the data. Ran ZHENG and Yuan-dong LIU drafted the manuscript. Hai JIN helped organize the manuscript. Ran ZHENG and Hai JIN revised and finalized the paper.

Compliance with ethics guidelines

Ran ZHENG, Yuan-dong LIU, and Hai JIN declare that they have no conflict of interest.

Ran ZHENG received her MS and PhD degrees from Huazhong University of Science and Technology (HUST), China in 2002 and 2006, respectively. She is currently an Associate Professor of Computer Science and Engineering at HUST. Her research interests include distributed computing, cloud computing, high-performance computing, and their applications.

Hai JIN is a Cheung Kung Scholars Chair Professor of Computer Science and Engineering at Huazhong University of Science and Technology (HUST) in China. He received his PhD degree in Computer Engineering from HUST in 1994. In 1996, he was awarded a German Academic Exchange Service fellowship to visit the Technical University of Chemnitz in Germany. He worked at the University of Hong Kong between 1998 and 2000, and was a visiting scholar at the University of Southern California between 1999 and 2000. He was supported by the National Science Fund for Distinguished Young Scholars in 2001. He is the Chief Scientist of ChinaGrid, the largest grid computing project in China, and the Chief Scientist of National 973 Basic Research Program Project of Virtualization Technology of Computing System, and Cloud Security. He is an editorial board member of Frontiers of Information Technology & Electronic Engineering. His research interests include computer architecture, virtualization technology, cluster computing and cloud computing, peer-to-peer computing, network storage, and network security.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, R., Liu, Yd. & Jin, H. Optimizing non-coalesced memory access for irregular applications with GPU computing. Front Inform Technol Electron Eng 21, 1285–1301 (2020). https://doi.org/10.1631/FITEE.1900262

Download citation

Received: 24 May 2019
Accepted: 31 May 2020
Published: 17 September 2020
Issue Date: September 2020
DOI: https://doi.org/10.1631/FITEE.1900262

Key words

CLC number

TP319

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Performance and Energy Efficiency of Geophysics Applications on GPU Architectures

GPU Architecture

Enhancing GPU Performance Using Thread Geometry Analysis for Irregular Workloads

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Contributors

Compliance with ethics guidelines

Rights and permissions

About this article

Cite this article

Key words

CLC number

Subscribe and save

Buy Now

Navigation

Optimizing non-coalesced memory access for irregular applications with GPU computing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Performance and Energy Efficiency of Geophysics Applications on GPU Architectures

GPU Architecture

Enhancing GPU Performance Using Thread Geometry Analysis for Irregular Workloads

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Contributors

Compliance with ethics guidelines

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Subscribe and save

Buy Now

Search

Navigation