research-article

A decomposition for in-place matrix transposition

Authors:

Bryan Catanzaro,

Alexander Keller,

Michael GarlandAuthors Info & Claims

PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 193 - 206

https://doi.org/10.1145/2555243.2555253

Published: 06 February 2014 Publication History

Get Access

Abstract

We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m, n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny matrices that arise when converting Arrays of Structures to Structures of Arrays yields median throughput of 34.3 GB/s, and a maximum throughput of 51 GB/s.

Because of the simple structure of this algorithm, it is particularly suited for implementation using SIMD instructions to transpose the small arrays that arise when SIMD processors load from or store to Arrays of Structures. Using this algorithm to cooperatively perform accesses to Arrays of Structures, we measure 180 GB/s throughput on the K20c, which is up to 45 times faster than compiler-generated Array of Structures accesses.

In this paper, we explain the algorithm, prove its correctness and complexity, and explain how it can be instantiated efficiently for solving various transpose problems on both CPUs and GPUs.

References

[1]

F. Gustavson, L. Karlsson, and B. Kågström. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software, 38 (3): 1--32, Apr. 2012. 10.1145/2168773.2168775.

Digital Library

Google Scholar

[2]

Intel. Intel MKL, 2013. URL http://software.intel.com/en-us/intel-mkl.

Google Scholar

[3]

D. E. Knuth. phThe Art of Computer Programming, volume 3. Addison-Wesley, 1973. ISBN 0--201-03803-X.

Google Scholar

[4]

T. Leighton. Tight bounds on the complexity of parallel sorting. In Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, STOC '84, pages 71--80, New York, NY, USA, 1984. ACM. 10.1145/800057.808667.

Digital Library

Google Scholar

[5]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, pages 40--53, Mar.\slash Apr. 2008. 10.1145/1365490.1365500.

Digital Library

Google Scholar

[6]

I.-J. Sung. Data layout transformation through in-place transposition. PhD thesis, University of Illinois, Department of Electrical and Computer Engineering, May 2013. URL http://hdl.handle.net/2142/44300.

Google Scholar

[7]

I.-J. Sung, G. D. Liu, and W.-M. W. Hwu. DL: A data layout transformation system for heterogeneous computing. In Innovative Parallel Computing (InPar), May 2012. 10.1109/InPar.2012.6339606.

Crossref

Google Scholar

[8]

I.-J. Sung, J. Gómez-Luna, J. M. González-Linares, N. Guil, and W.-M. W. Hwu. In-place transposition of rectangular matrices on accelerators. In Principles and Practices of Parallel Programming (PPoPP), PPoPP '14, 2014. 10.1145/2555243.2555266.

Digital Library

Google Scholar

[9]

A. A. Tretyakov and E. E. Tyrtyshnikov. Optimal in-place transposition of rectangular matrices. Journal of Complexity, 25 (4): 377--384, Aug. 2009. 10.1016/j.jco.2009.02.008.

Digital Library

Google Scholar

[10]

H. S. Warren. Hacker's Delight. Addison-Wesley Professional, 2002. ISBN 978-0--201--91465--8.

Digital Library

Google Scholar

[11]

P. F. Windley. Transposing matrices in a digital computer. The Computer Journal, 2 (1): 47--48, Jan. 1959. 10.1093/comjnl/2.1.47.

Crossref

Google Scholar

Cited By

View all

Matsumura KDe Gonzalo SPeña AVerbrugge CLhoták OShen X(2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580253
Gomez-Luna JHajj IFernandez IGiannoula COliveira GMutlu O(2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3174101
Hong CDhulipala LShun JSarkar VKim H(2020)Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414657(55-69)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414657
Show More Cited By

Index Terms

A decomposition for in-place matrix transposition
1. Information systems
  1. Information storage systems
    1. Record storage systems
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

A decomposition for in-place matrix transposition
PPoPP '14

We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize,...
In-place transposition of rectangular matrices on accelerators
PPoPP '14

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place ...
In-place transposition of rectangular matrices on accelerators
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

February 2014

412 pages

ISBN:9781450326568

DOI:10.1145/2555243

General Chair:
José Moreira
IBM Research, USA
,
Program Chair:
James Larus
EPFL, Switzerland

ACM SIGPLAN Notices Volume 49, Issue 8
PPoPP '14
August 2014
390 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2692916
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

in-place transposition

Qualifiers

Research-article

Conference

PPoPP '14

Sponsor:

SIGPLAN

PPoPP '14: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 15 - 19, 2014

Florida, Orlando, USA

Acceptance Rates

PPoPP '14 Paper Acceptance Rate 28 of 184 submissions, 15%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
802
Total Downloads

Downloads (Last 12 months)93
Downloads (Last 6 weeks)18

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Matsumura KDe Gonzalo SPeña AVerbrugge CLhoták OShen X(2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580253
Gomez-Luna JHajj IFernandez IGiannoula COliveira GMutlu O(2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3174101
Hong CDhulipala LShun JSarkar VKim H(2020)Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414657(55-69)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414657
Berney KSitchinava N(2020)Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00119(1133-1142)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00119
Phothilimthana PElliott AWang AJangda AHagedorn BBarthels HKaufman SGrover VTorlak EBodik RBahar IHerlihy MWitchel ELebeck A(2019)Swizzle InventorProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304059(65-78)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304059
Karsin BWeichert VCasanova HIacono JSitchinava N(2018)Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUsProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205298(86-95)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205298
Kim JSukumaran-Rajam AHong CPanyala ASrivastava RKrishnamoorthy SSadayappan P(2018)Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUsProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205296(96-106)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205296
Qasem AAji AChu M(2018)Investigating Data Layout Transformations in Chapel2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00145(915-924)Online publication date: May-2018
https://doi.org/10.1109/IPDPSW.2018.00145
Gorawski MLorek M(2018)Efficient Processing of Large Data Structures on GPUsInternational Journal of Parallel Programming10.1007/s10766-017-0515-046:6(1063-1093)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-017-0515-0
Muralidharan SRoy AHall MGarland MRai P(2016)Architecture-Adaptive Code Variant TuningACM SIGARCH Computer Architecture News10.1145/2980024.287241144:2(325-338)Online publication date: 25-Mar-2016
https://dl.acm.org/doi/10.1145/2980024.2872411
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

A decomposition for in-place matrix transposition

In-place transposition of rectangular matrices on accelerators

In-place transposition of rectangular matrices on accelerators