[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-order Execution

Published: 18 April 2019 Publication History

Abstract

Graphics Processing Units (GPUs) have become an attractive platform for accelerating challenging applications on a range of platforms, from High Performance Computing (HPC) to full-featured smartphones. They can overcome computational barriers in a wide range of data-parallel kernels. GPUs hide pipeline stalls and memory latency by utilizing efficient thread preemption. But given the demands on the memory hierarchy due to the growth in the number of computing cores on-chip, it has become increasingly difficult to hide all of these stalls.
In this article, we propose a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls. HAWS starts by enhancing a compiler infrastructure to identify potential opportunities that can bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 14.6% on average for memory intensive applications.

References

[1]
Kristof Beyls and Erik D’Hollander. 2002. Compile-time cache hint generation for EPIC architectures. In Proceedings of the 2nd Workshop on Explicitly Parallel Instruction Computing Architecture and Compilers (EPIC’02).
[2]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. Arxiv Preprint Arxiv:1410.0759 (2014).
[3]
Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In ACM SIGARCH Computer Architecture News, Vol. 39. ACM, 235--246.
[4]
Xiang Gong, Zhongliang Chen, Amir Kavyan Ziabari, Rafael Ubal, and David Kaeli. 2017. TwinKernels: An execution model to improve GPU hardware scheduling at compile time. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Press, 39--49.
[5]
Xun Gong, Rafael Ubal, and David Kaeli. 2017. Multi2Sim kepler: A detailed architectural GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’17). IEEE, 269--278.
[6]
Honghoon Jang, Anjin Park, and Keechul Jung. 2008. Neural network implementation using CUDA and OpenMP. In Proceedings of the Conference on Computing: Techniques and Applications (DICTA’08). IEEE, 155--161.
[7]
Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 420--432.
[8]
Naifeng Jing, Yao Shen, Yao Lu, Shrikanth Ganapathy, Zhigang Mao, Minyi Guo, Ramon Canal, and Xiaoyao Liang. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 344--355.
[9]
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48. ACM, 395--406.
[10]
James A. Kahle, Michael N. Day, H. Peter Hofstee, Charles R. Johns, Theodore R. Maeurer, and David Shippy. 2005. Introduction to the cell multiprocessor. IBM J. Res. Dev. 49, 4.5 (2005), 589--604.
[11]
Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE Press, 157--166.
[12]
Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, Gunjae Koo, Won Woo Ro, and Murali Annavaram. 2016. Warped-preexecution: A GPU pre-execution approach for improving latency hiding. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE Press, 163--175.
[13]
George Kyriazis. 2012. Heterogeneous system architecture: A technical review. AMD Fusion Dev. Summit (2012), 21.
[14]
Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (2008).
[15]
Mike Mantor. 2012. AMD Radeon HD 7970 with graphics core next (GCN) architecture. In Proceedings of the 24th Hot Chips Symposium (HCS’12). IEEE, 1--35.
[16]
Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai Li. 2017. An energy-efficient GPGPU register file architecture using racetrack memory. IEEE Trans. Comput. (2017).
[17]
Sally A. McKee. 2004. Reflections on the memory wall. In Proceedings of the 1st Conference on Computing Frontiers. ACM, 162.
[18]
Cameron McNairy and Don Soltis. 2003. Itanium 2 processor microarchitecture. IEEE Micro 23, 2 (2003), 44--55.
[19]
Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ACM SIGARCH Comput. Architect. News 38, 3 (2010), 235--246.
[20]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Lab. (2009), 22--31.
[21]
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 308--317.
[22]
CUDA Nvidia. 2007. Compute unified device architecture programming guide.
[23]
In Kyu Park, Nitin Singhal, Man Hee Lee, Sungdae Cho, and Chris Kim. 2011. Design and performance evaluation of image processing algorithms on GPUs. IEEE Trans. Parallel Distrib. Syst. 22, 1 (2011), 91--104.
[24]
Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient GPU control flow. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE, 591--602.
[25]
Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. 2015. A variable warp size architecture. In ACM SIGARCH Comput. Architect. News, Vol. 43. ACM, 489--501.
[26]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72--83.
[27]
Michael S. Schlansker and B. Ramakrishna Rau. 2000. EPIC: Explicitly parallel instruction computing. Computer 33, 2 (2000), 37--45.
[28]
Ankit Sethia, D. Anoushe Jamshidi, and Scott Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 174--185.
[29]
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 3 (2010), 66--73.
[30]
Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). IEEE, 335--344.
[31]
Bin Wang, Yue Zhu, and Weikuan Yu. 2016. OAWS: Memory occlusion aware warp scheduling. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT’16). IEEE, 45--55.
[32]
Zhenlin Wang, Kathryn S. McKinley, Arnold L. Rosenberg, and Charles C. Weems. 2002. Using the compiler to improve cache replacement decisions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 199--208.
[33]
Yulong Yu, Weijun Xiao, Xubin He, He Guo, Yuxin Wang, and Xin Chen. 2015. A stall-aware warp scheduling for dynamically optimizing thread-level parallelism in GPGPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 15--24.

Cited By

View all
  • (2024)TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISAIEEE Computer Architecture Letters10.1109/LCA.2023.328931723:2(175-178)Online publication date: 1-Jul-2024
  • (2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
  • (2024)SIMIL: SIMple Issue Logic for GPUsMicroprocessors and Microsystems10.1016/j.micpro.2024.105105111(105105)Online publication date: Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 2
June 2019
317 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3325131
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2019
Accepted: 01 February 2019
Revised: 01 February 2019
Received: 01 July 2018
Published in TACO Volume 16, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Computer architecture
  2. GPU architecture
  3. simulation
  4. wavefront scheduler

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)632
  • Downloads (Last 6 weeks)74
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISAIEEE Computer Architecture Letters10.1109/LCA.2023.328931723:2(175-178)Online publication date: 1-Jul-2024
  • (2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
  • (2024)SIMIL: SIMple Issue Logic for GPUsMicroprocessors and Microsystems10.1016/j.micpro.2024.105105111(105105)Online publication date: Nov-2024
  • (2023)Simple Out of Order Core for GPGPUsProceedings of the 15th Workshop on General Purpose Processing Using GPU10.1145/3589236.3589244(21-26)Online publication date: 25-Feb-2023
  • (2022)Repurposing GPU Microarchitectures with Light-Weight Out-Of-Order ExecutionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309323133:2(388-402)Online publication date: 1-Feb-2022
  • (2022)A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00089(863-874)Online publication date: May-2022
  • (2021)MIPSGPU: Minimizing Pipeline Stalls for GPUs With Non-Blocking ExecutionIEEE Transactions on Computers10.1109/TC.2020.302604370:11(1804-1816)Online publication date: 1-Nov-2021
  • (2021)GNNMark: A Benchmark Suite to Characterize Graph Neural Network Training on GPUs2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00013(13-23)Online publication date: Mar-2021
  • (2019)LOOG: Improving GPU Efficiency With Light-Weight Out-Of-Order ExecutionIEEE Computer Architecture Letters10.1109/LCA.2019.295116118:2(166-169)Online publication date: 1-Jul-2019
  • (2019)A novel warp scheduling scheme considering long-latency operations for high-performance GPUsThe Journal of Supercomputing10.1007/s11227-019-03091-276:4(3043-3062)Online publication date: 23-Nov-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media