[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2485922.2485954acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

SIMD divergence optimization through intra-warp compaction

Published: 23 June 2013 Publication History

Abstract

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications.
Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.

References

[1]
AMD Radeon HD 7970 Graphics, AMD. {Online}. Available: amd.com
[2]
K. Asanovic, "Vector microprocessors," Ph.D. dissertation, UC Berkeley, 1998.
[3]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of International Symposium on Performance Analsys of Systems and Software, 2009.
[4]
C. F. Batten, "Simplified Vector-Thread Architectures for Flexible and Efficient Data-Parallel Accelerators," Ph.D. dissertation, MIT, 2010.
[5]
N. Brunie, S. Collange, and G. Diamos, "Simultaneous branch and warp interweaving for sustained GPU performance," in Proceedings of International Symposium on Computer Architecture, 2012, pp. 49--60.
[6]
ILLIAC IV -- System Description, Burroughs Corp, 1974, Computer History Museum resource.
[7]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proceedings of International Symposium on Workload Characterization, 2009, pp. 44--54.
[8]
S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in Proceedings of International Symposium on Workload Characterization, 2010.
[9]
G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili, "SIMD re-convergence at thread frontiers," in Proceedings of International Symposium on Microarchitecture, 2011, pp. 477--488.
[10]
R. Espasa and M. Valero, "Multithreaded vector architectures," in International Symposium on High Performance Computer Architecture, 1997, pp. 237--248.
[11]
W. Fung and T. Aamodt, "Thread block compaction for efficient simt control flow," in International Symposium on High Performance Computer Architecture, 2011, pp. 25--36.
[12]
W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proceedings of International Symposium on Microarchitecture, 2007, pp. 407--420.
[13]
V. George and H. Jiang, "Intel next generation microarchitecture code name IvyBridge," in Intel Developer Forum, 2012, Technology Insight Video.
[14]
T. Han and T. Abdelrahman, "Reducing branch divergence in GPU programs," in Workshop on General Purpose Processing on GPU, 2011, p. 3.
[15]
W. Hwu, Ed., GPU Computing Gems --- Jade and Emerald Eds. Morgan Kaufmann, 2011.
[16]
DirectX Developer's Guide for Intel Processor Graphics: Maximizing Performance on the New Intel Microarchitecture Codenamed IvyBridge, Intel Corp, April 2012. {Online}. Available: software.intel.com
[17]
Intel Open Source HD Graphics Programmer's Reference Manual (PRM) for 2012 Intel Core Processor Family (codenamed IvyBridge), Intel Corp, 2012. {Online}. Available: intellinuxgraphics.org
[18]
Intel SDK for OpenCL Applications 2012: OpenCL Optimization Guide, Intel Corp, 2012. {Online}. Available: software.intel.com
[19]
D. Kanter, "Intel's IvyBridge graphics architecture." {Online}. Available: realworldtech.com/ivy-bridge-gpu/
[20]
OpenCL - The open standard for parallel programming of heterogeneous systems, The Khronos Group. {Online}. Available: khronos.org/opencl/
[21]
Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović, "Exploring the Tradeoffs between Programmability and Efficiency in Data-parallel Accelerators," in Proceedings of International Symposium on Computer Architecture, 2011, pp. 129--140.
[22]
A. Levinthal and T. Porter, "Chap-a simd graphics processor," in ACM SIGGRAPH Computer Graphics, vol. 18, no. 3, 1984, pp. 77--82.
[23]
J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in Proceedings of International Symposium on Computer Architecture, 2010, pp. 235--246.
[24]
Compute Shader Overview, Microsoft Corp. {Online}. Available: msdn.microsoft.com/en-us/library/ff476331.aspx
[25]
V. Narasiman, M. Shebanow, C. Lee, R. Miftakhutdinov, O. Mutlu, and Y. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Proceedings of International Symposium on Microarchitecture, 2011, pp. 308--317.
[26]
Technical Brief: NVIDIA GeForce 8800 GPU Architecture Overview, Nvidia Corp, November 2006. {Online}. Available: nvidia.com
[27]
NVIDIA CUDA C Programming Guide: Version 4.2, Nvidia Corp, April 2012. {Online}. Available: nvidia.com
[28]
NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, Nvidia Corp, 2012. {Online}. Available: nvidia.com
[29]
J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "Gpu computing," Proceedings of of IEEE, vol. 96, no. 5, pp. 879--899, 2008.
[30]
M. Rhu and M. Erez, "CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures," in Proceedings of International Symposium on Computer Architecture, 2012, pp. 61--71.
[31]
S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis, "Vector lane threading," in Proceedings of International Conference on Parallel Processing, 2006, pp. 55--64.
[32]
J. E. Smith, S. G. Faanes, and R. Sugumar, "Vector instruction set support for conditional operations," in Proceedings of International Symposium on Computer Architecture, 2000, pp. 260--269.
[33]
I. Wald, "Active thread compaction for GPU path tracing," in Proceedings of ACM SIGGRAPH Symposium on High Performance Graphics, 2011, pp. 51--58.
[34]
D. Woligroski, "AMD A10--4600M review: Mobile trinity gets tested," Tom's Hardware, May 2012. {Online}. Available: tomshardware.com

Cited By

View all
  • (2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
  • (2022)Compiler-Assisted Compaction/Restoration of SIMD InstructionsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309101533:4(779-791)Online publication date: 1-Apr-2022
  • (2021)Aurochs: An Architecture for Dataflow Threads2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00039(402-415)Online publication date: Jun-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
June 2013
686 pages
ISBN:9781450320795
DOI:10.1145/2485922
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 41, Issue 3
    ICSA '13
    June 2013
    666 pages
    ISSN:0163-5964
    DOI:10.1145/2508148
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • IEEE CS

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. SIMD
  3. branch divergence

Qualifiers

  • Research-article

Conference

ISCA'13
Sponsor:

Acceptance Rates

ISCA '13 Paper Acceptance Rate 56 of 288 submissions, 19%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)9
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
  • (2022)Compiler-Assisted Compaction/Restoration of SIMD InstructionsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309101533:4(779-791)Online publication date: 1-Apr-2022
  • (2021)Aurochs: An Architecture for Dataflow Threads2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00039(402-415)Online publication date: Jun-2021
  • (2020)Improving Predication Efficiency through Compaction/Restoration of SIMD Instructions2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00064(717-728)Online publication date: Feb-2020
  • (2019)A Lightweight Method for Handling Control Divergence in GPGPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293331(120-127)Online publication date: 14-Jan-2019
  • (2019)ITAPACM Transactions on Architecture and Code Optimization10.1145/329160616:1(1-26)Online publication date: 27-Feb-2019
  • (2018)General-Purpose Graphics Processor ArchitecturesSynthesis Lectures on Computer Architecture10.2200/S00848ED1V01Y201804CAC04413:2(1-140)Online publication date: 21-May-2018
  • (2018)Interactive Sports AnalyticsACM Transactions on Computer-Human Interaction10.1145/318559625:2(1-32)Online publication date: 11-Apr-2018
  • (2018)Energy-Performance Considerations for Data Offloading to FPGA-Based Accelerators Over PCIeACM Transactions on Architecture and Code Optimization10.1145/318026315:1(1-24)Online publication date: 22-Mar-2018
  • (2018)Citizen observatoriesInteractions10.1145/317855825:2(52-57)Online publication date: 23-Feb-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media