[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2555243.2555253acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

A decomposition for in-place matrix transposition

Published: 06 February 2014 Publication History

Abstract

We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m, n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny matrices that arise when converting Arrays of Structures to Structures of Arrays yields median throughput of 34.3 GB/s, and a maximum throughput of 51 GB/s.
Because of the simple structure of this algorithm, it is particularly suited for implementation using SIMD instructions to transpose the small arrays that arise when SIMD processors load from or store to Arrays of Structures. Using this algorithm to cooperatively perform accesses to Arrays of Structures, we measure 180 GB/s throughput on the K20c, which is up to 45 times faster than compiler-generated Array of Structures accesses.
In this paper, we explain the algorithm, prove its correctness and complexity, and explain how it can be instantiated efficiently for solving various transpose problems on both CPUs and GPUs.

References

[1]
F. Gustavson, L. Karlsson, and B. Kågström. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software, 38 (3): 1--32, Apr. 2012. 10.1145/2168773.2168775.
[2]
Intel. Intel MKL, 2013. URL http://software.intel.com/en-us/intel-mkl.
[3]
D. E. Knuth. phThe Art of Computer Programming, volume 3. Addison-Wesley, 1973. ISBN 0--201-03803-X.
[4]
T. Leighton. Tight bounds on the complexity of parallel sorting. In Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, STOC '84, pages 71--80, New York, NY, USA, 1984. ACM. 10.1145/800057.808667.
[5]
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with CUDA. ACM Queue, pages 40--53, Mar.\slash Apr. 2008. 10.1145/1365490.1365500.
[6]
I.-J. Sung. Data layout transformation through in-place transposition. PhD thesis, University of Illinois, Department of Electrical and Computer Engineering, May 2013. URL http://hdl.handle.net/2142/44300.
[7]
I.-J. Sung, G. D. Liu, and W.-M. W. Hwu. DL: A data layout transformation system for heterogeneous computing. In Innovative Parallel Computing (InPar), May 2012. 10.1109/InPar.2012.6339606.
[8]
I.-J. Sung, J. Gómez-Luna, J. M. González-Linares, N. Guil, and W.-M. W. Hwu. In-place transposition of rectangular matrices on accelerators. In Principles and Practices of Parallel Programming (PPoPP), PPoPP '14, 2014. 10.1145/2555243.2555266.
[9]
A. A. Tretyakov and E. E. Tyrtyshnikov. Optimal in-place transposition of rectangular matrices. Journal of Complexity, 25 (4): 377--384, Aug. 2009. 10.1016/j.jco.2009.02.008.
[10]
H. S. Warren. Hacker's Delight. Addison-Wesley Professional, 2002. ISBN 978-0--201--91465--8.
[11]
P. F. Windley. Transposing matrices in a digital computer. The Computer Journal, 2 (1): 47--48, Jan. 1959. 10.1093/comjnl/2.1.47.

Cited By

View all
  • (2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
  • (2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
  • (2020)Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414657(55-69)Online publication date: 30-Sep-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2014
412 pages
ISBN:9781450326568
DOI:10.1145/2555243
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 February 2014

Permissions

Request permissions for this article.

Check for updates

Author Tag

  1. in-place transposition

Qualifiers

  • Research-article

Conference

PPoPP '14
Sponsor:

Acceptance Rates

PPoPP '14 Paper Acceptance Rate 28 of 184 submissions, 15%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)93
  • Downloads (Last 6 weeks)18
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
  • (2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
  • (2020)Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414657(55-69)Online publication date: 30-Sep-2020
  • (2020)Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00119(1133-1142)Online publication date: May-2020
  • (2019)Swizzle InventorProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304059(65-78)Online publication date: 4-Apr-2019
  • (2018)Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUsProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205298(86-95)Online publication date: 12-Jun-2018
  • (2018)Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUsProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205296(96-106)Online publication date: 12-Jun-2018
  • (2018)Investigating Data Layout Transformations in Chapel2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00145(915-924)Online publication date: May-2018
  • (2018)Efficient Processing of Large Data Structures on GPUsInternational Journal of Parallel Programming10.1007/s10766-017-0515-046:6(1063-1093)Online publication date: 1-Dec-2018
  • (2016)Architecture-Adaptive Code Variant TuningACM SIGARCH Computer Architecture News10.1145/2980024.287241144:2(325-338)Online publication date: 25-Mar-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media