More Web Proxy on the site http://driver.im/

Article

Language support for Morton-order matrices

Authors:

Jeremy D. Frens,

Gregory A. AlexanderAuthors Info & Claims

PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming

Pages 24 - 33

https://doi.org/10.1145/379539.379559

Published: 18 June 2001 Publication History

Abstract

The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important because it relaxes serious problems of locality and latency, and the tree helps to schedule multi-processing. Results here show how it facilitates algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment.

We have built a rudimentary C-to-C translator that implements matrices in Morton-order from source that presumes a row-major implementation. Early performance from LAPACK's reference implementation of \texttt{dgesv} (linear solver), and all its supporting routines (including \texttt{dgemm} matrix-multiplication) form a successful research demonstration. Its performance predicts improvements from new algebra in back-end optimizers.

We also present results from a more stylish \texttt{dgemm} algorithm that takes better advantage of this representation. With only routine back-end optimizations inserted by hand (unfolding the base case and passing arguments in registers), we achieve machine performance exceeding that of the manufacturer-crafted {\tt dgemm} running at 67% of peak flops. And the same code performs similarly on several machines.

Together, these results show how existing codes and future block-recursive algorithms can work well together on this matrix representation. Locality is key to future performance, and the new representation has a remarkable impact.

References

[1]

N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In A. Bode, T. Ludwig, W. Karl, and R. Wism. uller, editors, EURO-PAR 2000 Parallel Processing, volume 1900 of Lecture Notes in Computer Science, pages 368-378, Heidelberg, 2000. Springer. http://link.springer.de/link/service/series/0558/bibs/1900/19000368.htm]]

[2]

F. E. Allen, J. Cocke, and K. Kennedy. Reduction of operator strength. In S. W. Muchnick and N. D. Jones, editors, Program Flow Analysis: Theory and Applications, chapter 3.2, pages 79-101. Prentice-Hall, Englewood Cliffs, NJ, 1981.]]

[3]

J. Backus. The history of F ortran i, ii, and iii. In R. L. Wexelblat, editor, History of Programming Languages, pages 25-45. Academic Press, New York, 1981. Also preprinted in SIGPLAN Not., 13(8):166-180, Aug. 1978.]]

Digital Library

[4]

J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a protable, high-performance, ansi c coding methodology. InProc. 1977 Intl. Conf. on Supercomputing, pages 340-347, New York, July 1997. ACM Press. http://www.acm.org/pubs/citations/proceedings/supercomputing/263580/p340-bilmes/]]

Digital Library

[5]

R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. J.ACM, 24(1):44-67, Jan. 1977. http://www.acm.org/pubs/citations/journals/jacm/1977-24-1/p44-burstall/]]

Digital Library

[6]

F. W. Burton, V. J. Kollias, and J. G. Kollias. Real-time raster-to-quadtree and quadtree-to-raster conversion algorithms with modest storage requirements. Angew. Informatik, 4:170-174, 1986.]]

[7]

S. Chatterjee, V. V. Jain, A. R. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In Proc. 1999 Intl. Conf. on Supercomputing, pages 444-453, New York, June 1999. ACM Press. http://www.acm.org/pubs/citations/proceedings/supercomputing/305138/p444-chatterjee/]]

Digital Library

[8]

S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottenthodi. Recursive array layouts and fast parallel matrix multiplication. In Proc. 11th ACM Symp. Parallel Algorithms and Architectures, pages 222-231, New York, June 1999. ACM Press. http://www.acm.org/pubs/citations/proceedings/spaa/305619/p222-chatterjee/]]

Digital Library

[9]

H. G. Cragon. A historical note on binary tree. SIGARCH Comput. Archit. News, 18(4):3, Dec. 1990.]]

[10]

D. Culler, R. Karp, D. Patterson, A. Sahay, K.E.Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: a practical model of parallel computation. Commun. ACM, 39(11):78-85, Nov. 1996. http://www.acm.org/pubs/citations/journals/cacm/1996-39-11/p78-culler/]]

Digital Library

[11]

C. Ding and K. Kennedy. The memory bandwidth bottleneck and its amelioration by a compiler. In 14th International Parallel and Distributed Processing Symposium (IPDPS'00), Los Alamitos, CA, 2000. IEEE Computer Society Press. http://www.computer.org/proceedings/ipdps/0574/05740181abs.htm]]

Digital Library

[12]

J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of F ortran basic linear algebra subprograms. ACM Trans. Math. Softw., 14(1):1-17, Mar. 1988. http://www.acm.org/pubs/citations/journals/toms/1988-14-1/p1-dongarra/]]

Digital Library

[13]

E. Elmroth and F. Gustavson. Applying recursion to serial and parallel QR factorization leads to better performance. IBM J. Res. Develop., 44(4):605-624, July 2000. http://www.research.ibm.com/journal/rd/444/elmroth.html]]

Digital Library

[14]

P. C. Fischer and R. L. Probert. Storage reorganization techniques for matrix computation in a paging environment. Commun. ACM, 22(7):405-415, July 1979. http://www.acm.org/pubs/citations/journals/cacm/1979-22-7/p405-fischer/]]

Digital Library

[15]

C. W. Fraser and D. R. Hanson. A Retargetable C Compiler : Design and Implementation. Benjamin/Cummings, Redwood City, CA, 1995.]]

Digital Library

[16]

J. D. Frens and D. S. Wise. Auto-blocking matrix multiplication, or tracking BLAS3 performance from source code. Proc. 6th ACM SIGPLAN Symp. on Principles and Practice of Parallel Program., SIGPLAN Not., 32(7):206-216, July 1997. http://www.acm.org/pubs/citations/proceedings/ppopp/263764/p206-frens/]]

Digital Library

[17]

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th Ann. Symp. Foundations of Computer Science, session 6B. IEEE Computer Society, Los Alamitos, CA, 1999. http://www.computer.org/proceedings/focs/0409/04090285abs.htm]]

Digital Library

[18]

I. Gargantini. An effective way to represent quadtrees. Commun. ACM, 25(12):905-910, Dec. 1982. http://www.acm.org/pubs/citations/journals/cacm/1982-25-12/p905-gargantini/]]

Digital Library

[19]

P. W. Grant, J. A. Sharp, M. F. Webster, and X. Zhang. Experiences of parallising finite-element problemsin a functional style. Softw. Prac. Exper., 25(9):947-974, Sept. 1995.]]

Digital Library

[20]

F. G. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Develop., 41(6):737-755, Nov. 1997. http://www.research.ibm.com/journal/rd/416/gustavson.html]]

Digital Library

[21]

E.-J. Im and K. Yelick. Optimizing sparse matrix vector multimplication on SMPs. In 9th SIAM Conf. on Parallel Processing for Scientific Computing, volume 98 of Proc. in Applied Mathematics, Mar. 1999. http://www.siam.org/catalog/mcc07/pr98.htm]]

[22]

D. E. Knuth. Fundamental Algorithms, volume 1 of The Art of Computer Programming. Addison-Wesley, Reading, MA, third edition, 1997.]]

Digital Library

[23]

I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. Proc. ACM SIGPLAN '97 Conf. on Program. Language Design and Implementation, SIGPLAN Not., 32(7):346-357, May 1997. http://www.acm.org/pubs/citations/proceedings/pldi/258915/p346-kodukula/]]

Digital Library

[24]

A. R. Lebeck, X. Fan, H. Zeng, and C. Ellis. Power-aware page allocation. Proc. 9th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, SIGPLAN Not., 35(11):105-116, Nov. 2000. http://www.acm.org/pubs/citations/proceedings/asplos/356988/p105-lebeck/]]

Digital Library

[25]

A. C. McKellar and E. G. Coffman, Jr. Organizing matrices and matrix operations for paged-memory systems. Commun. ACM, 12(3):153-165, Mar. 1969.]]

Digital Library

[26]

G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, Mar. 1, 1966.]]

[27]

G. Newman. Organizing arrays for paged memory systems. Commun. ACM, 38(7):93-103 + 108-110, July 1995. http://www.acm.org/pubs/citations/journals/cacm/1995-38-7/p93-newman/]]

Digital Library

[28]

J. A. Orenstein and T. H. Merrett. A class of data structures for associative searching. In Proc. 3rd ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, pages 181-190, New York, 1984. ACM Press.]]

Digital Library

[29]

G. Peano. Sur une courbe, qui remplit toute une aire plaine. Math. Ann., 36:157-160, 1890.]]

[30]

H. Samet. The Design and Analysis of Spatial Data Structures, section 2.7. Addison-Wesley, Reading, MA, 1990.]]

Digital Library

[31]

G. Schrack. Finding neighbors of equal size in linear quadtrees and octrees in constant time. CVGIP: Image Underst., 55(3):221-230, May 1992.]]

Digital Library

[32]

K. D. Tocher. The application of automatic computers to sampling experiments. J. Roy. Statist. Soc. Ser. B, 16(1):39-61, 1954. See pp. 53-55.]]

[33]

M. S. Warren. and J. K. Salmon. A parallel hashed oct-tree N-body problem. In Proc. Supercomputing '93, pages 12-21, Los Alamitos, CA, 1993. IEEE Computer Society Press.]]

[34]

R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proc. Supercomputing '98, Los Alamitos, CA, 1998. IEEE Computer Society. http://www.supercomp.org/sc98/TechPapers/sc98 FullAbstracts/Whaley814/INDEX.HTM]]

Digital Library

[35]

D. S. Wise. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In A. Bode, T. Ludwig, W. Karl, and R. Wism. uller, editors, Euro-Par 2000 - Parallel Processing, volume 1900 of Lecture Notes in Computer Science, pages 774-883, Heidelberg, 2000. Springer. http://link.springer.de/link/service/series/0558/bibs/1900/19000774.htm]]

[36]

D. S. Wise and J. D. Frens. Morton-order matrices deserve compilers' support. Technical Report 533, Computer Science Dept., Indiana University, Nov. 1999. http://www.cs.indiana.edu/ftp/techreports/TR533.html]]

[37]

Q.Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. Proc. ACM SIGPLAN '00 Conf. on Program. Language Design and Implementation, SIGPLAN Not., 35(5):169-181, May 2000. http://www.acm.org/pubs/citations/proceedings/pldi/301618a/p169-yi/]]

Digital Library

Cited By

Wang BFukazawa YKondo TSasaki T(2017)Tile/line access cache memory based on a multi‐level Z‐order tiling data layoutConcurrency and Computation: Practice and Experience10.1002/cpe.437530:9Online publication date: 17-Dec-2017
https://doi.org/10.1002/cpe.4375
Vuduc RDemmel JBilmes J(2016)Statistical Models for Empirical Search-Based Performance TuningThe International Journal of High Performance Computing Applications10.1177/109434200404129318:1(65-94)Online publication date: 26-Jul-2016
https://doi.org/10.1177/1094342004041293
Wang BFukazawa YKondo TSasaki T(2016)A cache memory with unit tile and line accessibility2016 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2016.7568425(866-874)Online publication date: Jul-2016
https://doi.org/10.1109/HPCSim.2016.7568425
Show More Cited By

Index Terms

Recommendations

Language support for Morton-order matrices

The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important ...
Analyzing block locality in Morton-order and Morton-hybrid matrices
MEDEA '06: Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures

As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be ...
The Opie compiler from row-major source to Morton-ordered matrices
WMPI '04: Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture

The Opie Project aims to develop a compiler to transform C codes written for row-major matrix representation into equivalent codes for Morton-order matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming

June 2001

142 pages

ISBN:1581133464

DOI:10.1145/379539

Chairmen:
Michael Heath
Univ. of Illinois, Illinois, IN
,
Andrew Lumsdaine
Indiana Univ.

ACM SIGPLAN Notices Volume 36, Issue 7
July 2001
143 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/568014
Issue’s Table of Contents

Copyright © 2001 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2001

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PPoPP01

Sponsor:

SIGPLAN

PPoPP01: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Utah, Snowbird, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
844
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang BFukazawa YKondo TSasaki T(2017)Tile/line access cache memory based on a multi‐level Z‐order tiling data layoutConcurrency and Computation: Practice and Experience10.1002/cpe.437530:9Online publication date: 17-Dec-2017
https://doi.org/10.1002/cpe.4375
Vuduc RDemmel JBilmes J(2016)Statistical Models for Empirical Search-Based Performance TuningThe International Journal of High Performance Computing Applications10.1177/109434200404129318:1(65-94)Online publication date: 26-Jul-2016
https://doi.org/10.1177/1094342004041293
Wang BFukazawa YKondo TSasaki T(2016)A cache memory with unit tile and line accessibility2016 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2016.7568425(866-874)Online publication date: Jul-2016
https://doi.org/10.1109/HPCSim.2016.7568425
Chen YLiu YMacii E(2013)Dual-addressing memory architecture for two-dimensional memory access patternsProceedings of the Conference on Design, Automation and Test in Europe10.5555/2485288.2485308(71-76)Online publication date: 18-Mar-2013
https://dl.acm.org/doi/10.5555/2485288.2485308
Quintana-Ortí GIgual FQuintana-Ortí Evan de Geijn R(2009)Solving dense linear systems on platforms with multiple hardware acceleratorsACM SIGPLAN Notices10.1145/1594835.150419644:4(121-130)Online publication date: 14-Feb-2009
https://dl.acm.org/doi/10.1145/1594835.1504196
Quintana-Ortí GQuintana-Ortí EGeijn RZee FChan E(2009)Programming matrix algorithms-by-blocks for thread-level parallelismACM Transactions on Mathematical Software10.1145/1527286.152728836:3(1-26)Online publication date: 23-Jul-2009
https://dl.acm.org/doi/10.1145/1527286.1527288
Quintana-Ortí GIgual FQuintana-Ortí Evan de Geijn RReed DSarkar V(2009)Solving dense linear systems on platforms with multiple hardware acceleratorsProceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/1504176.1504196(121-130)Online publication date: 14-Feb-2009
https://dl.acm.org/doi/10.1145/1504176.1504196
Brun AMartin-Fernandez MAcar BMunoz-Moreno ECammoun LSigfridsson ASosa-Cabrera DSvensson BHerberthson MKnutsson H(2009)Similar Tensor Arrays – A Framework for Storage of Tensor Array DataTensors in Image Processing and Computer Vision10.1007/978-1-84882-299-3_19(407-428)Online publication date: 2009
https://doi.org/10.1007/978-1-84882-299-3_19
Lim WThottethodi M(2007)Evaluating ISA support and hardware support for recursive data layoutsProceedings of the 14th international conference on High performance computing10.5555/1782174.1782191(95-106)Online publication date: 18-Dec-2007
https://dl.acm.org/doi/10.5555/1782174.1782191
Lim WThottethodi M(2007)Evaluating ISA Support and Hardware Support for Recursive Data LayoutsHigh Performance Computing – HiPC 200710.1007/978-3-540-77220-0_13(95-106)Online publication date: 2007
https://doi.org/10.1007/978-3-540-77220-0_13
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents