More Web Proxy on the site http://driver.im/

research-article

Open access

Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency

Authors:

Michael Andersch,

Mauricio Alvarez-Mesa,

Ben JuurlinkAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 3

Article No.: 32, Pages 1 - 26

https://doi.org/10.1145/2811402

Published: 08 September 2015 Publication History

Abstract

Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch-intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. We present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatiotemporal SIMT (STSIMT). Simulations show that TSIMT alone results in a performance reduction, but a combination of scalarization and STSIMT yields a mean performance enhancement of 19.6% and improves the energy-delay product by 26.2% compared to SIMT.

Supplementary Material

TACO1203-32 (taco1203-32.pdf)

Slide deck associated with this paper

Download
730.40 KB

References

[1]

S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu. 2008. StoreGPU: Exploiting graphics processing units to accelerate distributed storage systems. In Proceedings of the 17th International Symposium on High Performance Distributed Computing.

Digital Library

[2]

AMD. 2012. AMD Graphics Core Next GCN Architecture White Paper. Sunnyvale, CA.

[3]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[4]

N. Brunie, S. Collange, and G. Diamos. 2012. Simultaneous branch and warp interweaving for sustained GPU performance. In Proceedings of the 39th International Symposium on Computer Architecture (ISCA).

Digital Library

[5]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC).

Digital Library

[6]

S. Collange. 2011. Identifying Scalar Behavior in CUDA Kernels. Technical Report hal-00555134. Université de Lyon, January 2011. https://hal.archives-ouvertes.fr/hal-00555134/file/collange_scalarizing_compiler_rr.pdf.

[7]

B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. Meira. 2011. Divergence analysis and optimizations. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE, 320--329.

Digital Library

[8]

J. R. Diamond, D. S. Fussell, and S. W. Keckler. 2014. Arbitrary modulus indexing. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 140--152.

Digital Library

[9]

G. F. Diamos, A. Robert Kerr, S. Yalamanchili, and N. Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[10]

W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA).

Digital Library

[11]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[12]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Transactions on Architecture and Code Optimization 6, 2, Article 7 (July 2009).

Digital Library

[13]

M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 96--106.

Digital Library

[14]

S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. 2011. Accelerating CUDA graph algorithms at maximum warp. In ACM SIGPLAN Notices, Vol. 46. ACM, New York, NY, 267--276.

Digital Library

[15]

N. Jayasena, M. Erez, J. H. Ahn, and W. J. Dally. 2004. Stream register files with indexed access. In Proceedings of the 10th International Symposium on High Performance Computer Architecture (HPCA).

Digital Library

[16]

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 31.

Digital Library

[17]

J. Kim, C. Torng, S. Srinath, D. Lockhart, and C. Batten. 2013. Microarchitectural mechanisms to exploit value structure in SIMT architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture.

Digital Library

[18]

R. M. Krashinsky. 2011. Temporal SIMT Execution Optimization. (Aug. 2011). Patent No. US 2013/0042090 A1. Filed August 12, 2011, Issued February 14, 2013.

[19]

Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA).

Digital Library

[20]

Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. Asanovic. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the International Symposium on Code Generation and Optimization (CGO).

Digital Library

[21]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. 2008a. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro.

Digital Library

[22]

J. E. Lindholm, M. Y. Siu, S. S. Moy, S. Liu, and J. R. Nickolls. 2008b. Simulating Multiported Memories using Lower Port Count Memories. Patent No. US 7339592 B2, Filed July 2004, Issued March 2008.

[23]

A. Lumsdaine and D. Gregor. 2004. Boost Graph Library: Sequential Vertex Coloring. http://www.boost.org/doc/libs/1_57_0/libs/graph/doc/sequential_vertex_coloring.html.

[24]

S. S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann, Burlington, MA.

Digital Library

[25]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[26]

NVIDIA. 2011. NVidia GPU Computing SDK 3.1.

[27]

J. E. Smith, G. Faanes, and R. Sugumar. 2000. Vector instruction set support for conditional operations. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA).

Digital Library

[28]

A. S. Vaidya, A. Shayesteh, D. H. Woo, R. Saharoy, and M. Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA).

Digital Library

[29]

B. Wang, M. Alvarez-Mesa, C. C. Chi, and B. Juurlink. 2013. An optimized parallel IDCT on graphics processing units. In Euro-Par 2012: Parallel Processing Workshops. Lecture Notes in Computer Science, Vol. 7640. Springer, Berlin, 155--164.

Digital Library

[30]

B. Wang, M. Alvarez-Mesa, C. C. Chi, and B. Juurlink. 2015. Parallel H.264/AVC motion compensation for GPUs using OpenCL. IEEE Transactions on Circuits and Systems for Video Technology, 25, 3, 525--531.

[31]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems Software (ISPASS).

[32]

P. Xiang, Y. Yang, M. Mantor, N. Rubin, L. R. Hsu, and H. Zhou. 2013. Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement. In Proceedings of the 27th International ACM Conference on the International Conference on Supercomputing.

Digital Library

[33]

P. Xiang, Y. Yang, and H. Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA).

[34]

G. Ziegler. 2011. Analysis-Driven Optimization. Retrieved August 17, 2015 from http://www.nvidia.de/content/PDF/isc-2011/Ziegler.pdf.

Cited By

Yazdanpanah ASajadimanesh SSafari S(2020)EREERMicroprocessors & Microsystems10.1016/j.micpro.2020.10317677:COnline publication date: 1-Sep-2020
https://dl.acm.org/doi/10.1016/j.micpro.2020.103176
Chen KChen C(2018)Enabling SIMT Execution Model on Homogeneous Multi-Core SystemACM Transactions on Architecture and Code Optimization10.1145/317796015:1(1-26)Online publication date: 22-Mar-2018
https://dl.acm.org/doi/10.1145/3177960

Index Terms

Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

On-GPU Thread-Data Remapping for Branch Divergence Reduction

General Purpose GPU computing (GPGPU) plays an increasingly vital role in high performance computing and other areas like deep learning. However, arising from the SIMD execution model, the branch divergence issue lowers efficiency of conditional ...
Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Intra-GPU synchronization is a problem for GPU controlled communication.Options, based on dynamic parallelism provide on-device synchronization.GPU controlled communication have a lower performance than CPU assisted approaches.Relieving the CPU from the ...
Energy-efficient stencil computations on distributed GPUs using dynamic parallelism and GPU-controlled communication
E2SC '14: Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing

GPUs are widely used in high performance computing, due to their high computational power and high performance per Watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 12, Issue 3

October 2015

168 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2818748

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 September 2015

Accepted: 01 July 2015

Revised: 01 June 2015

Received: 01 April 2015

Published in TACO Volume 12, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Seventh Framework Programme

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
866
Total Downloads

Downloads (Last 12 months)182
Downloads (Last 6 weeks)24

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yazdanpanah ASajadimanesh SSafari S(2020)EREERMicroprocessors & Microsystems10.1016/j.micpro.2020.10317677:COnline publication date: 1-Sep-2020
https://dl.acm.org/doi/10.1016/j.micpro.2020.103176
Chen KChen C(2018)Enabling SIMT Execution Model on Homogeneous Multi-Core SystemACM Transactions on Architecture and Code Optimization10.1145/317796015:1(1-26)Online publication date: 22-Mar-2018
https://dl.acm.org/doi/10.1145/3177960

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents