More Web Proxy on the site http://driver.im/

research-article

SIMD divergence optimization through intra-warp compaction

Authors:

Aniruddha S. Vaidya,

Anahita Shayesteh,

Mani AzimiAuthors Info & Claims

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Pages 368 - 379

https://doi.org/10.1145/2485922.2485954

Published: 23 June 2013 Publication History

Abstract

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications.

Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.

References

[1]

AMD Radeon HD 7970 Graphics, AMD. {Online}. Available: amd.com

[2]

K. Asanovic, "Vector microprocessors," Ph.D. dissertation, UC Berkeley, 1998.

Digital Library

[3]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of International Symposium on Performance Analsys of Systems and Software, 2009.

[4]

C. F. Batten, "Simplified Vector-Thread Architectures for Flexible and Efficient Data-Parallel Accelerators," Ph.D. dissertation, MIT, 2010.

Digital Library

[5]

N. Brunie, S. Collange, and G. Diamos, "Simultaneous branch and warp interweaving for sustained GPU performance," in Proceedings of International Symposium on Computer Architecture, 2012, pp. 49--60.

Digital Library

[6]

ILLIAC IV -- System Description, Burroughs Corp, 1974, Computer History Museum resource.

[7]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proceedings of International Symposium on Workload Characterization, 2009, pp. 44--54.

Digital Library

[8]

S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in Proceedings of International Symposium on Workload Characterization, 2010.

Digital Library

[9]

G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili, "SIMD re-convergence at thread frontiers," in Proceedings of International Symposium on Microarchitecture, 2011, pp. 477--488.

Digital Library

[10]

R. Espasa and M. Valero, "Multithreaded vector architectures," in International Symposium on High Performance Computer Architecture, 1997, pp. 237--248.

Digital Library

[11]

W. Fung and T. Aamodt, "Thread block compaction for efficient simt control flow," in International Symposium on High Performance Computer Architecture, 2011, pp. 25--36.

Digital Library

[12]

W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proceedings of International Symposium on Microarchitecture, 2007, pp. 407--420.

Digital Library

[13]

V. George and H. Jiang, "Intel next generation microarchitecture code name IvyBridge," in Intel Developer Forum, 2012, Technology Insight Video.

[14]

T. Han and T. Abdelrahman, "Reducing branch divergence in GPU programs," in Workshop on General Purpose Processing on GPU, 2011, p. 3.

Digital Library

[15]

W. Hwu, Ed., GPU Computing Gems --- Jade and Emerald Eds. Morgan Kaufmann, 2011.

Digital Library

[16]

DirectX Developer's Guide for Intel Processor Graphics: Maximizing Performance on the New Intel Microarchitecture Codenamed IvyBridge, Intel Corp, April 2012. {Online}. Available: software.intel.com

[17]

Intel Open Source HD Graphics Programmer's Reference Manual (PRM) for 2012 Intel Core Processor Family (codenamed IvyBridge), Intel Corp, 2012. {Online}. Available: intellinuxgraphics.org

[18]

Intel SDK for OpenCL Applications 2012: OpenCL Optimization Guide, Intel Corp, 2012. {Online}. Available: software.intel.com

[19]

D. Kanter, "Intel's IvyBridge graphics architecture." {Online}. Available: realworldtech.com/ivy-bridge-gpu/

[20]

OpenCL - The open standard for parallel programming of heterogeneous systems, The Khronos Group. {Online}. Available: khronos.org/opencl/

[21]

Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović, "Exploring the Tradeoffs between Programmability and Efficiency in Data-parallel Accelerators," in Proceedings of International Symposium on Computer Architecture, 2011, pp. 129--140.

Digital Library

[22]

A. Levinthal and T. Porter, "Chap-a simd graphics processor," in ACM SIGGRAPH Computer Graphics, vol. 18, no. 3, 1984, pp. 77--82.

Digital Library

[23]

J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in Proceedings of International Symposium on Computer Architecture, 2010, pp. 235--246.

Digital Library

[24]

Compute Shader Overview, Microsoft Corp. {Online}. Available: msdn.microsoft.com/en-us/library/ff476331.aspx

[25]

V. Narasiman, M. Shebanow, C. Lee, R. Miftakhutdinov, O. Mutlu, and Y. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Proceedings of International Symposium on Microarchitecture, 2011, pp. 308--317.

Digital Library

[26]

Technical Brief: NVIDIA GeForce 8800 GPU Architecture Overview, Nvidia Corp, November 2006. {Online}. Available: nvidia.com

[27]

NVIDIA CUDA C Programming Guide: Version 4.2, Nvidia Corp, April 2012. {Online}. Available: nvidia.com

[28]

NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110, Nvidia Corp, 2012. {Online}. Available: nvidia.com

[29]

J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "Gpu computing," Proceedings of of IEEE, vol. 96, no. 5, pp. 879--899, 2008.

[30]

M. Rhu and M. Erez, "CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures," in Proceedings of International Symposium on Computer Architecture, 2012, pp. 61--71.

Digital Library

[31]

S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis, "Vector lane threading," in Proceedings of International Conference on Parallel Processing, 2006, pp. 55--64.

Digital Library

[32]

J. E. Smith, S. G. Faanes, and R. Sugumar, "Vector instruction set support for conditional operations," in Proceedings of International Symposium on Computer Architecture, 2000, pp. 260--269.

Digital Library

[33]

I. Wald, "Active thread compaction for GPU path tracing," in Proceedings of ACM SIGGRAPH Symposium on High Performance Graphics, 2011, pp. 51--58.

Digital Library

[34]

D. Woligroski, "AMD A10--4600M review: Mobile trinity gets tested," Tom's Hardware, May 2012. {Online}. Available: tomshardware.com

Cited By

Cuneo BBailey M(2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1145/3626957
Cebrian JBalem TBarredo ACasas MMoreto MRos AJimborean A(2022)Compiler-Assisted Compaction/Restoration of SIMD InstructionsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309101533:4(779-791)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TPDS.2021.3091015
Vilim MRucker AOlukotun K(2021)Aurochs: An Architecture for Dataflow Threads2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00039(402-415)Online publication date: Jun-2021
https://doi.org/10.1109/ISCA52012.2021.00039
Show More Cited By

Index Terms

SIMD divergence optimization through intra-warp compaction
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors

Recommendations

Dynamic warp subdivision for integrated branch and memory divergence tolerance
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in ...
SIMD divergence optimization through intra-warp compaction
ICSA '13

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU ...
Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

June 2013

686 pages

ISBN:9781450320795

DOI:10.1145/2485922

General Chair:
Avi Mendelson
Technion

ACM SIGARCH Computer Architecture News Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE CS

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISCA'13

Sponsor:

ISCA'13: The 40th Annual International Symposium on Computer Architecture

June 23 - 27, 2013

Tel-Aviv, Israel

Acceptance Rates

ISCA '13 Paper Acceptance Rate 56 of 288 submissions, 19%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
1,083
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)9

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cuneo BBailey M(2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1145/3626957
Cebrian JBalem TBarredo ACasas MMoreto MRos AJimborean A(2022)Compiler-Assisted Compaction/Restoration of SIMD InstructionsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309101533:4(779-791)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TPDS.2021.3091015
Vilim MRucker AOlukotun K(2021)Aurochs: An Architecture for Dataflow Threads2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00039(402-415)Online publication date: Jun-2021
https://doi.org/10.1109/ISCA52012.2021.00039
Barredo ACebrian JMoreto MCasas MValero M(2020)Improving Predication Efficiency through Compaction/Restoration of SIMD Instructions2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00064(717-728)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00064
Yang YZhang SShen L(2019)A Lightweight Method for Handling Control Divergence in GPGPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293331(120-127)Online publication date: 14-Jan-2019
https://dl.acm.org/doi/10.1145/3293320.3293331
Sadrosadati MEhsani SFalahati HAusavarungnirun RTavakkol AAbaee MOrosa LWang YSarbazi-Azad HMutlu O(2019)ITAPACM Transactions on Architecture and Code Optimization10.1145/329160616:1(1-26)Online publication date: 27-Feb-2019
https://dl.acm.org/doi/10.1145/3291606
Aamodt TFung WRogers T(2018)General-Purpose Graphics Processor ArchitecturesSynthesis Lectures on Computer Architecture10.2200/S00848ED1V01Y201804CAC04413:2(1-140)Online publication date: 21-May-2018
https://doi.org/10.2200/S00848ED1V01Y201804CAC044
Sha LLucey PYue YWei XHobbs JRohlf CSridharan S(2018)Interactive Sports AnalyticsACM Transactions on Computer-Human Interaction10.1145/318559625:2(1-32)Online publication date: 11-Apr-2018
https://dl.acm.org/doi/10.1145/3185596
Mbakoyiannis DTomoutzoglou OKornaros G(2018)Energy-Performance Considerations for Data Offloading to FPGA-Based Accelerators Over PCIeACM Transactions on Architecture and Code Optimization10.1145/318026315:1(1-24)Online publication date: 22-Mar-2018
https://dl.acm.org/doi/10.1145/3180263
Seffah A(2018)Citizen observatoriesInteractions10.1145/317855825:2(52-57)Online publication date: 23-Feb-2018
https://dl.acm.org/doi/10.1145/3178558
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents