[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2145816.2145824acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Efficient SIMD code generation for irregular kernels

Published: 25 February 2012 Publication History

Abstract

Array indirection causes several challenges for compilers to utilize single instruction, multiple data (SIMD) instructions. Disjoint memory references, arbitrarily misaligned memory references, and dependence cycles in loops are main challenges to handle for SIMD compilers. Due to those challenges, existing SIMD compilers have excluded loops with array indirection from their candidate loops for SIMD vectorization. However, addressing those challenges is inevitable, since many important compute-intensive applications extensively use array indirection to reduce memory and computation requirements. In this work, we propose a method to generate efficient SIMD code for loops containing indirected memory references. We extract both inter- and intra-iteration parallelism, taking data reorganization overhead into consideration. We also optimally place data reorganization code in order to amortize the reorganization overhead through the performance gain of SIMD vectorization. Experiments on four array indirection kernels, which are extracted from real-world scientific applications, show that our proposed method effectively generates SIMD code for irregular kernels with array indirection. Compared to the existing SIMD vectorization methods, our proposed method significantly improves the performance of irregular kernels by 91%, on average.

References

[1]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, 2002.
[2]
R. Barik, J. Zhao, and V. Sarkar. Efficient selection of vector instructions using dynamic programming. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '43, pages 201--212, 2010.
[3]
H. Chang and W. Sung. Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware. In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES '08, pages 167--176, 2008.
[4]
R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang. Communication optimizations for irregular scientific computations on distributed memory architectures. J. Parallel Distrib. Comput., 22: 462--478, Sep. 1994.
[5]
K. Diefendorff, P. K. Dubey, R. Hochsprung, and H. Scales. AltiVec extension to PowerPC accelerates media processing. IEEE Micro, 20: 85--95, Mar./Apr. 2000.
[6]
A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI '04, pages 82--93, 2004.
[7]
T. Grosser, H. Zheng, R. A, A. Simburger, A. Grosslinger, and L.-N. Pouchet. Polly - polyhedral optimization in llvm. In First International Workshop on Polyhedral Compilation Techniques (IMPACT'11), 2011.
[8]
M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26: 10--24, Mar. 2006.
[9]
J. L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34: 1--17, Sep. 2006.
[10]
A. Krall and S. Lelait. Compilation techniques for multimedia processors. Int. J. Parallel Program., 28: 347--361, Aug. 2000.
[11]
S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI '00, pages 145--156, 2000.
[12]
S. Larsen, R. Rabbah, and S. Amarasinghe. Exploiting vector parallelism in software pipelined loops. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, pages 119--129, 2005.
[13]
C. Lattner. Macroscopic Data Structure Analysis and Optimization. PhD thesis, Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL, May 2005. {online} http://llvm.cs.uiuc.edu.
[14]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004.
[15]
R. Leupers. Code selection for media processors with SIMD instructions. In Proceedings of the conference on Design, Automation and Test in Europe, DATE '00, pages 4--8, 2000.
[16]
D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a SIMdD DSP architecture. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES '03, pages 2--11, 2003.
[17]
D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '06, pages 132--143, 2006.
[18]
I. Pryanishnikov, A. Krall, T. U. Wien, and N. Horspool. Pointer alignment analysis for processors with SIMD instructions. In Proceedings of the 5th Workshop on Media and Streaming Processors, pages 50--57, 2003.
[19]
G. Ren, P. Wu, and D. Padua. A preliminary study on the vectorization of multimedia applications for multimedia extensions. In Languages and Compilers for Parallel Computing, volume 2958 of Lecture Notes in Computer Science, pages 420--435. 2004.
[20]
G. Ren, P. Wu, and D. Padua. Optimizing data permutations for SIMD devices. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '06, pages 118--131, 2006.
[21]
I. Rosen, D. Nuzman, and A. Zaks. Loop-aware SLP in GCC. In Proceedings of GCC Developers' Summit, pages 131--142, 2007.
[22]
J. Shalf, S. Dosanjh, and J. Morrison. Exascale computing technology challenges. In Proc. International Meeting on High Performance Computing for Computational Science, volume 6449 of Lecture Notes in Computer Science, pages 1--25, 2011.
[23]
N. Sreraman and R. Govindarajan. A vectorizing compiler for multimedia extensions. Int. J. Parallel Program., 28: 363--400, Aug. 2000.
[24]
R. Tarjan. Depth-first search and linear graph algorithms. SIAM Journal on Computing, 1 (2): 146--160, 1972.
[25]
D. Walls. How to use the restrict qualifier in C. Sun Microsystems, Sun Developer Network (SDN), March 2006. {online} http://developers.sun.com/.
[26]
P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao. An integrated simdization framework using virtual vectors. In Proceedings of the 19th annual International Conference on Supercomputing, ICS '05, pages 169--178, 2005.

Cited By

View all
  • (2022)Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-VectorizationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569663(439-450)Online publication date: 8-Oct-2022
  • (2020)Combining SIMD and Many/Multi-core Parallelism for Finite-state Machines with Enumerative SpeculationACM Transactions on Parallel Computing10.1145/33997147:3(1-26)Online publication date: 21-Jun-2020
  • (2019)Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary TranslationACM Transactions on Architecture and Code Optimization10.1145/330148816:1(1-24)Online publication date: 13-Feb-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
February 2012
352 pages
ISBN:9781450311601
DOI:10.1145/2145816
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 47, Issue 8
    PPOPP '12
    August 2012
    334 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2370036
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DFG-based vectorization
  2. SIMD processors
  3. irregular kernels

Qualifiers

  • Research-article

Conference

PPoPP '12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-VectorizationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569663(439-450)Online publication date: 8-Oct-2022
  • (2020)Combining SIMD and Many/Multi-core Parallelism for Finite-state Machines with Enumerative SpeculationACM Transactions on Parallel Computing10.1145/33997147:3(1-26)Online publication date: 21-Jun-2020
  • (2019)Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary TranslationACM Transactions on Architecture and Code Optimization10.1145/330148816:1(1-24)Online publication date: 13-Feb-2019
  • (2018)Control Flow Vectorization for ARM NEONProceedings of the 21st International Workshop on Software and Compilers for Embedded Systems10.1145/3207719.3207721(66-75)Online publication date: 28-May-2018
  • (2017)Insufficient Vectorization: A New Method to Exploit Superword Level ParallelismIEICE Transactions on Information and Systems10.1587/transinf.2016EDP7236E100.D:1(91-106)Online publication date: 2017
  • (2017)POSTERACM SIGPLAN Notices10.1145/3155284.301903752:8(439-440)Online publication date: 26-Jan-2017
  • (2017)S-CaffeACM SIGPLAN Notices10.1145/3155284.301876952:8(193-205)Online publication date: 26-Jan-2017
  • (2017)Exploiting Vector and Multicore Parallelism for Recursive, Data- and Task-Parallel ProgramsACM SIGPLAN Notices10.1145/3155284.301876352:8(117-130)Online publication date: 26-Jan-2017
  • (2017)Contention in Structured ConcurrencyACM SIGPLAN Notices10.1145/3155284.301876252:8(75-88)Online publication date: 26-Jan-2017
  • (2017)Combining SIMD and Many/Multi-core Parallelism for Finite State Machines with Enumerative SpeculationACM SIGPLAN Notices10.1145/3155284.301876052:8(179-191)Online publication date: 26-Jan-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media