More Web Proxy on the site http://driver.im/

research-article

Open access

HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-order Execution

Authors:

David KaeliAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 16, Issue 2

Article No.: 15, Pages 1 - 22

https://doi.org/10.1145/3291050

Published: 18 April 2019 Publication History

All formats PDF

Abstract

Graphics Processing Units (GPUs) have become an attractive platform for accelerating challenging applications on a range of platforms, from High Performance Computing (HPC) to full-featured smartphones. They can overcome computational barriers in a wide range of data-parallel kernels. GPUs hide pipeline stalls and memory latency by utilizing efficient thread preemption. But given the demands on the memory hierarchy due to the growth in the number of computing cores on-chip, it has become increasingly difficult to hide all of these stalls.

In this article, we propose a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls. HAWS starts by enhancing a compiler infrastructure to identify potential opportunities that can bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 14.6% on average for memory intensive applications.

References

[1]

Kristof Beyls and Erik D’Hollander. 2002. Compile-time cache hint generation for EPIC architectures. In Proceedings of the 2nd Workshop on Explicitly Parallel Instruction Computing Architecture and Compilers (EPIC’02).

[2]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. Arxiv Preprint Arxiv:1410.0759 (2014).

[3]

Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In ACM SIGARCH Computer Architecture News, Vol. 39. ACM, 235--246.

Digital Library

[4]

Xiang Gong, Zhongliang Chen, Amir Kavyan Ziabari, Rafael Ubal, and David Kaeli. 2017. TwinKernels: An execution model to improve GPU hardware scheduling at compile time. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Press, 39--49.

Digital Library

[5]

Xun Gong, Rafael Ubal, and David Kaeli. 2017. Multi2Sim kepler: A detailed architectural GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’17). IEEE, 269--278.

[6]

Honghoon Jang, Anjin Park, and Keechul Jung. 2008. Neural network implementation using CUDA and OpenMP. In Proceedings of the Conference on Computing: Techniques and Applications (DICTA’08). IEEE, 155--161.

Digital Library

[7]

Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 420--432.

Digital Library

[8]

Naifeng Jing, Yao Shen, Yao Lu, Shrikanth Ganapathy, Zhigang Mao, Minyi Guo, Ramon Canal, and Xiaoyao Liang. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 344--355.

Digital Library

[9]

Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48. ACM, 395--406.

Digital Library

[10]

James A. Kahle, Michael N. Day, H. Peter Hofstee, Charles R. Johns, Theodore R. Maeurer, and David Shippy. 2005. Introduction to the cell multiprocessor. IBM J. Res. Dev. 49, 4.5 (2005), 589--604.

Digital Library

[11]

Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE Press, 157--166.

Digital Library

[12]

Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, Gunjae Koo, Won Woo Ro, and Murali Annavaram. 2016. Warped-preexecution: A GPU pre-execution approach for improving latency hiding. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE Press, 163--175.

[13]

George Kyriazis. 2012. Heterogeneous system architecture: A technical review. AMD Fusion Dev. Summit (2012), 21.

[14]

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (2008).

Digital Library

[15]

Mike Mantor. 2012. AMD Radeon HD 7970 with graphics core next (GCN) architecture. In Proceedings of the 24th Hot Chips Symposium (HCS’12). IEEE, 1--35.

[16]

Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai Li. 2017. An energy-efficient GPGPU register file architecture using racetrack memory. IEEE Trans. Comput. (2017).

[17]

Sally A. McKee. 2004. Reflections on the memory wall. In Proceedings of the 1st Conference on Computing Frontiers. ACM, 162.

Digital Library

[18]

Cameron McNairy and Don Soltis. 2003. Itanium 2 processor microarchitecture. IEEE Micro 23, 2 (2003), 44--55.

Digital Library

[19]

Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ACM SIGARCH Comput. Architect. News 38, 3 (2010), 235--246.

Digital Library

[20]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Lab. (2009), 22--31.

[21]

Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 308--317.

Digital Library

[22]

CUDA Nvidia. 2007. Compute unified device architecture programming guide.

[23]

In Kyu Park, Nitin Singhal, Man Hee Lee, Sungdae Cho, and Chris Kim. 2011. Design and performance evaluation of image processing algorithms on GPUs. IEEE Trans. Parallel Distrib. Syst. 22, 1 (2011), 91--104.

Digital Library

[24]

Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient GPU control flow. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE, 591--602.

Digital Library

[25]

Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. 2015. A variable warp size architecture. In ACM SIGARCH Comput. Architect. News, Vol. 43. ACM, 489--501.

Digital Library

[26]

Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72--83.

Digital Library

[27]

Michael S. Schlansker and B. Ramakrishna Rau. 2000. EPIC: Explicitly parallel instruction computing. Computer 33, 2 (2000), 37--45.

Digital Library

[28]

Ankit Sethia, D. Anoushe Jamshidi, and Scott Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 174--185.

[29]

John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 3 (2010), 66--73.

[30]

Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). IEEE, 335--344.

Digital Library

[31]

Bin Wang, Yue Zhu, and Weikuan Yu. 2016. OAWS: Memory occlusion aware warp scheduling. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT’16). IEEE, 45--55.

Digital Library

[32]

Zhenlin Wang, Kathryn S. McKinley, Arnold L. Rosenberg, and Charles C. Weems. 2002. Using the compiler to improve cache replacement decisions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 199--208.

Digital Library

[33]

Yulong Yu, Weijun Xiao, Xubin He, He Guo, Yuxin Wang, and Xin Chen. 2015. A stall-aware warp scheduling for dynamically optimizing thread-level parallelism in GPGPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 15--24.

Digital Library

Cited By

Matsuo RKoizumi TIrie HSakai SShioya R(2024)TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISAIEEE Computer Architecture Letters10.1109/LCA.2023.328931723:2(175-178)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/LCA.2023.3289317
Chaturvedi IGodala BWu YXu ZIliakis KEleftherakis PXydis SSoudris DSorensen TCampanoni SAamodt TAugust D(2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00011
Huerta RCruz JArnau JGonzález A(2024)SIMIL: SIMple Issue Logic for GPUsMicroprocessors and Microsystems10.1016/j.micpro.2024.105105111(105105)Online publication date: Nov-2024
https://doi.org/10.1016/j.micpro.2024.105105
Show More Cited By

Index Terms

HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-order Execution
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Many-core GPU computing with NVIDIA CUDA
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

In the past, graphics processors were special-purpose hardwired application accelerators, suitable only for conventional graphics applications. Modern GPUs are fully programmable, massively parallel floating point processors. In this talk I will ...
Dynamic warp resizing: Analysis and benefits in high-performance SIMT
ICCD '12: Proceedings of the 2012 IEEE 30th International Conference on Computer Design (ICCD 2012)

Modern GPUs synchronize threads grouped in warps. The number of threads included in each warp (or warp size) affects divergence, synchronization overhead, and the efficiency of memory access coalescing. Small warps reduce the performance penalty ...
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 16, Issue 2

June 2019

317 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3325131

Editor:
Koen De Bosschere
Ghent University, Belgium

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2019

Accepted: 01 February 2019

Revised: 01 February 2019

Received: 01 July 2018

Published in TACO Volume 16, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
2,820
Total Downloads

Downloads (Last 12 months)632
Downloads (Last 6 weeks)74

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Matsuo RKoizumi TIrie HSakai SShioya R(2024)TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISAIEEE Computer Architecture Letters10.1109/LCA.2023.328931723:2(175-178)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/LCA.2023.3289317
Chaturvedi IGodala BWu YXu ZIliakis KEleftherakis PXydis SSoudris DSorensen TCampanoni SAamodt TAugust D(2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00011
Huerta RCruz JArnau JGonzález A(2024)SIMIL: SIMple Issue Logic for GPUsMicroprocessors and Microsystems10.1016/j.micpro.2024.105105111(105105)Online publication date: Nov-2024
https://doi.org/10.1016/j.micpro.2024.105105
Huerta RArnau JGonzalez A(2023)Simple Out of Order Core for GPGPUsProceedings of the 15th Workshop on General Purpose Processing Using GPU10.1145/3589236.3589244(21-26)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3589236.3589244
Iliakis KXydis SSoudris D(2022)Repurposing GPU Microarchitectures with Light-Weight Out-Of-Order ExecutionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309323133:2(388-402)Online publication date: 1-Feb-2022
https://doi.org/10.1109/TPDS.2021.3093231
Li JYe HTian SLi XZhang J(2022)A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00089(863-874)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00089
Yu CBai YWang R(2021)MIPSGPU: Minimizing Pipeline Stalls for GPUs With Non-Blocking ExecutionIEEE Transactions on Computers10.1109/TC.2020.302604370:11(1804-1816)Online publication date: 1-Nov-2021
https://doi.org/10.1109/TC.2020.3026043
Baruah TShivdikar KDong SSun YMojumder SJung KAbellan JUkidave YJoshi AKim JKaeli D(2021)GNNMark: A Benchmark Suite to Characterize Graph Neural Network Training on GPUs2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00013(13-23)Online publication date: Mar-2021
https://doi.org/10.1109/ISPASS51385.2021.00013
Iliakis KXydis SSoudris D(2019)LOOG: Improving GPU Efficiency With Light-Weight Out-Of-Order ExecutionIEEE Computer Architecture Letters10.1109/LCA.2019.295116118:2(166-169)Online publication date: 1-Jul-2019
https://doi.org/10.1109/LCA.2019.2951161
Do CChoi HChung SKim C(2019)A novel warp scheduling scheme considering long-latency operations for high-performance GPUsThe Journal of Supercomputing10.1007/s11227-019-03091-276:4(3043-3062)Online publication date: 23-Nov-2019
https://dl.acm.org/doi/10.1007/s11227-019-03091-2

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents