[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/379539.379559acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
Article

Language support for Morton-order matrices

Published: 18 June 2001 Publication History

Abstract

The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important because it relaxes serious problems of locality and latency, and the tree helps to schedule multi-processing. Results here show how it facilitates algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment.
We have built a rudimentary C-to-C translator that implements matrices in Morton-order from source that presumes a row-major implementation. Early performance from LAPACK's reference implementation of \texttt{dgesv} (linear solver), and all its supporting routines (including \texttt{dgemm} matrix-multiplication) form a successful research demonstration. Its performance predicts improvements from new algebra in back-end optimizers.
We also present results from a more stylish \texttt{dgemm} algorithm that takes better advantage of this representation. With only routine back-end optimizations inserted by hand (unfolding the base case and passing arguments in registers), we achieve machine performance exceeding that of the manufacturer-crafted {\tt dgemm} running at 67% of peak flops. And the same code performs similarly on several machines.
Together, these results show how existing codes and future block-recursive algorithms can work well together on this matrix representation. Locality is key to future performance, and the new representation has a remarkable impact.

References

[1]
N. Ahmed and K. Pingali. Automatic generation of block-recursive codes. In A. Bode, T. Ludwig, W. Karl, and R. Wism. uller, editors, EURO-PAR 2000 Parallel Processing, volume 1900 of Lecture Notes in Computer Science, pages 368-378, Heidelberg, 2000. Springer. http://link.springer.de/link/service/series/0558/bibs/1900/19000368.htm]]
[2]
F. E. Allen, J. Cocke, and K. Kennedy. Reduction of operator strength. In S. W. Muchnick and N. D. Jones, editors, Program Flow Analysis: Theory and Applications, chapter 3.2, pages 79-101. Prentice-Hall, Englewood Cliffs, NJ, 1981.]]
[3]
J. Backus. The history of F ortran i, ii, and iii. In R. L. Wexelblat, editor, History of Programming Languages, pages 25-45. Academic Press, New York, 1981. Also preprinted in SIGPLAN Not., 13(8):166-180, Aug. 1978.]]
[4]
J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a protable, high-performance, ansi c coding methodology. InProc. 1977 Intl. Conf. on Supercomputing, pages 340-347, New York, July 1997. ACM Press. http://www.acm.org/pubs/citations/proceedings/supercomputing/263580/p340-bilmes/]]
[5]
R. M. Burstall and J. Darlington. A transformation system for developing recursive programs. J.ACM, 24(1):44-67, Jan. 1977. http://www.acm.org/pubs/citations/journals/jacm/1977-24-1/p44-burstall/]]
[6]
F. W. Burton, V. J. Kollias, and J. G. Kollias. Real-time raster-to-quadtree and quadtree-to-raster conversion algorithms with modest storage requirements. Angew. Informatik, 4:170-174, 1986.]]
[7]
S. Chatterjee, V. V. Jain, A. R. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In Proc. 1999 Intl. Conf. on Supercomputing, pages 444-453, New York, June 1999. ACM Press. http://www.acm.org/pubs/citations/proceedings/supercomputing/305138/p444-chatterjee/]]
[8]
S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottenthodi. Recursive array layouts and fast parallel matrix multiplication. In Proc. 11th ACM Symp. Parallel Algorithms and Architectures, pages 222-231, New York, June 1999. ACM Press. http://www.acm.org/pubs/citations/proceedings/spaa/305619/p222-chatterjee/]]
[9]
H. G. Cragon. A historical note on binary tree. SIGARCH Comput. Archit. News, 18(4):3, Dec. 1990.]]
[10]
D. Culler, R. Karp, D. Patterson, A. Sahay, K.E.Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: a practical model of parallel computation. Commun. ACM, 39(11):78-85, Nov. 1996. http://www.acm.org/pubs/citations/journals/cacm/1996-39-11/p78-culler/]]
[11]
C. Ding and K. Kennedy. The memory bandwidth bottleneck and its amelioration by a compiler. In 14th International Parallel and Distributed Processing Symposium (IPDPS'00), Los Alamitos, CA, 2000. IEEE Computer Society Press. http://www.computer.org/proceedings/ipdps/0574/05740181abs.htm]]
[12]
J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of F ortran basic linear algebra subprograms. ACM Trans. Math. Softw., 14(1):1-17, Mar. 1988. http://www.acm.org/pubs/citations/journals/toms/1988-14-1/p1-dongarra/]]
[13]
E. Elmroth and F. Gustavson. Applying recursion to serial and parallel QR factorization leads to better performance. IBM J. Res. Develop., 44(4):605-624, July 2000. http://www.research.ibm.com/journal/rd/444/elmroth.html]]
[14]
P. C. Fischer and R. L. Probert. Storage reorganization techniques for matrix computation in a paging environment. Commun. ACM, 22(7):405-415, July 1979. http://www.acm.org/pubs/citations/journals/cacm/1979-22-7/p405-fischer/]]
[15]
C. W. Fraser and D. R. Hanson. A Retargetable C Compiler : Design and Implementation. Benjamin/Cummings, Redwood City, CA, 1995.]]
[16]
J. D. Frens and D. S. Wise. Auto-blocking matrix multiplication, or tracking BLAS3 performance from source code. Proc. 6th ACM SIGPLAN Symp. on Principles and Practice of Parallel Program., SIGPLAN Not., 32(7):206-216, July 1997. http://www.acm.org/pubs/citations/proceedings/ppopp/263764/p206-frens/]]
[17]
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th Ann. Symp. Foundations of Computer Science, session 6B. IEEE Computer Society, Los Alamitos, CA, 1999. http://www.computer.org/proceedings/focs/0409/04090285abs.htm]]
[18]
I. Gargantini. An effective way to represent quadtrees. Commun. ACM, 25(12):905-910, Dec. 1982. http://www.acm.org/pubs/citations/journals/cacm/1982-25-12/p905-gargantini/]]
[19]
P. W. Grant, J. A. Sharp, M. F. Webster, and X. Zhang. Experiences of parallising finite-element problemsin a functional style. Softw. Prac. Exper., 25(9):947-974, Sept. 1995.]]
[20]
F. G. Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Develop., 41(6):737-755, Nov. 1997. http://www.research.ibm.com/journal/rd/416/gustavson.html]]
[21]
E.-J. Im and K. Yelick. Optimizing sparse matrix vector multimplication on SMPs. In 9th SIAM Conf. on Parallel Processing for Scientific Computing, volume 98 of Proc. in Applied Mathematics, Mar. 1999. http://www.siam.org/catalog/mcc07/pr98.htm]]
[22]
D. E. Knuth. Fundamental Algorithms, volume 1 of The Art of Computer Programming. Addison-Wesley, Reading, MA, third edition, 1997.]]
[23]
I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. Proc. ACM SIGPLAN '97 Conf. on Program. Language Design and Implementation, SIGPLAN Not., 32(7):346-357, May 1997. http://www.acm.org/pubs/citations/proceedings/pldi/258915/p346-kodukula/]]
[24]
A. R. Lebeck, X. Fan, H. Zeng, and C. Ellis. Power-aware page allocation. Proc. 9th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, SIGPLAN Not., 35(11):105-116, Nov. 2000. http://www.acm.org/pubs/citations/proceedings/asplos/356988/p105-lebeck/]]
[25]
A. C. McKellar and E. G. Coffman, Jr. Organizing matrices and matrix operations for paged-memory systems. Commun. ACM, 12(3):153-165, Mar. 1969.]]
[26]
G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, Mar. 1, 1966.]]
[27]
G. Newman. Organizing arrays for paged memory systems. Commun. ACM, 38(7):93-103 + 108-110, July 1995. http://www.acm.org/pubs/citations/journals/cacm/1995-38-7/p93-newman/]]
[28]
J. A. Orenstein and T. H. Merrett. A class of data structures for associative searching. In Proc. 3rd ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, pages 181-190, New York, 1984. ACM Press.]]
[29]
G. Peano. Sur une courbe, qui remplit toute une aire plaine. Math. Ann., 36:157-160, 1890.]]
[30]
H. Samet. The Design and Analysis of Spatial Data Structures, section 2.7. Addison-Wesley, Reading, MA, 1990.]]
[31]
G. Schrack. Finding neighbors of equal size in linear quadtrees and octrees in constant time. CVGIP: Image Underst., 55(3):221-230, May 1992.]]
[32]
K. D. Tocher. The application of automatic computers to sampling experiments. J. Roy. Statist. Soc. Ser. B, 16(1):39-61, 1954. See pp. 53-55.]]
[33]
M. S. Warren. and J. K. Salmon. A parallel hashed oct-tree N-body problem. In Proc. Supercomputing '93, pages 12-21, Los Alamitos, CA, 1993. IEEE Computer Society Press.]]
[34]
R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proc. Supercomputing '98, Los Alamitos, CA, 1998. IEEE Computer Society. http://www.supercomp.org/sc98/TechPapers/sc98 FullAbstracts/Whaley814/INDEX.HTM]]
[35]
D. S. Wise. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In A. Bode, T. Ludwig, W. Karl, and R. Wism. uller, editors, Euro-Par 2000 - Parallel Processing, volume 1900 of Lecture Notes in Computer Science, pages 774-883, Heidelberg, 2000. Springer. http://link.springer.de/link/service/series/0558/bibs/1900/19000774.htm]]
[36]
D. S. Wise and J. D. Frens. Morton-order matrices deserve compilers' support. Technical Report 533, Computer Science Dept., Indiana University, Nov. 1999. http://www.cs.indiana.edu/ftp/techreports/TR533.html]]
[37]
Q.Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. Proc. ACM SIGPLAN '00 Conf. on Program. Language Design and Implementation, SIGPLAN Not., 35(5):169-181, May 2000. http://www.acm.org/pubs/citations/proceedings/pldi/301618a/p169-yi/]]

Cited By

View all
  • (2017)Tile/line access cache memory based on a multi‐level Z‐order tiling data layoutConcurrency and Computation: Practice and Experience10.1002/cpe.437530:9Online publication date: 17-Dec-2017
  • (2016)Statistical Models for Empirical Search-Based Performance TuningThe International Journal of High Performance Computing Applications10.1177/109434200404129318:1(65-94)Online publication date: 26-Jul-2016
  • (2016)A cache memory with unit tile and line accessibility2016 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2016.7568425(866-874)Online publication date: Jul-2016
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
June 2001
142 pages
ISBN:1581133464
DOI:10.1145/379539
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2001

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. paging
  2. quadtrees

Qualifiers

  • Article

Conference

PPoPP01
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Tile/line access cache memory based on a multi‐level Z‐order tiling data layoutConcurrency and Computation: Practice and Experience10.1002/cpe.437530:9Online publication date: 17-Dec-2017
  • (2016)Statistical Models for Empirical Search-Based Performance TuningThe International Journal of High Performance Computing Applications10.1177/109434200404129318:1(65-94)Online publication date: 26-Jul-2016
  • (2016)A cache memory with unit tile and line accessibility2016 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2016.7568425(866-874)Online publication date: Jul-2016
  • (2013)Dual-addressing memory architecture for two-dimensional memory access patternsProceedings of the Conference on Design, Automation and Test in Europe10.5555/2485288.2485308(71-76)Online publication date: 18-Mar-2013
  • (2009)Solving dense linear systems on platforms with multiple hardware acceleratorsACM SIGPLAN Notices10.1145/1594835.150419644:4(121-130)Online publication date: 14-Feb-2009
  • (2009)Programming matrix algorithms-by-blocks for thread-level parallelismACM Transactions on Mathematical Software10.1145/1527286.152728836:3(1-26)Online publication date: 23-Jul-2009
  • (2009)Solving dense linear systems on platforms with multiple hardware acceleratorsProceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/1504176.1504196(121-130)Online publication date: 14-Feb-2009
  • (2009)Similar Tensor Arrays – A Framework for Storage of Tensor Array DataTensors in Image Processing and Computer Vision10.1007/978-1-84882-299-3_19(407-428)Online publication date: 2009
  • (2007)Evaluating ISA support and hardware support for recursive data layoutsProceedings of the 14th international conference on High performance computing10.5555/1782174.1782191(95-106)Online publication date: 18-Dec-2007
  • (2007)Evaluating ISA Support and Hardware Support for Recursive Data LayoutsHigh Performance Computing – HiPC 200710.1007/978-3-540-77220-0_13(95-106)Online publication date: 2007
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media