[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

A superlinear speedup region for matrix multiplication

Published: 10 August 2014 Publication History

Abstract

The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores e.g., L3 cache and each core possesses private low level caches e.g., L1 and L2 caches. Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd.

References

[1]
Drevet CE, Islam MN, Schost E. Optimization techniques for small matrix multiplication. ACM Communications in Computer Algebra 2011; Volume 44 Issue 3/4: pp.107-108.
[2]
Hennessy JL, Patterson DA. Computer Architecture, Fifth Edition: A Quantitative Approach. Elsevier: MA, USA, 2012.
[3]
Playne DP, Hawick KA. Comparison of GPU architectures for asynchronous communication with finite-differencing applications. Concurrency and Computation: Practice and Experience 2012; Volume 24 Issue 1: pp.73-83.
[4]
Djinevski L, Ristov S, Gusev M. Superlinear speedup in GPU devices. In ICT Innovations 2012, Vol.Volume AISC 257, Markovski, Gusev eds. Springer Verlag: Berlin / Heidelberg, 2013; pp.285-296.
[5]
Ristov S, Gusev M. Superlinear speedup for matrix multiplication. Information Technology Interfaces, Proceedings of the ITI 2012 34th International Conference on, Cavtat, Croatia, 2012; pp.499-504.
[6]
Gustafson JL. The consequences of fixed time performance measurement. Proceedings of the Hawaii International Conference on System Sciences 1992; Volume 25: pp.113.
[7]
Sun XH, Ni LM. Scalable problems and memory-bounded speedup. Journal of Parallel and Distributed Computing 1993; Volume 19 Issue 1: pp.27-37.
[8]
Gustafson J, Montry G, Benner R. Development of parallel methods for a 1024-processor hypercube. SIAM Journal on Scientific and Statistical Computing 1988; Volume 9 Issue 4: pp.532-533.
[9]
Amdahl GM. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings, Vol.Volume 30. AFIPS Press: Reston. Va., Atlantic City, N.J., 1967; pp.483-485.
[10]
Gustafson JL. Reevaluating Amdahl's law. Communication of ACM 1988; Volume 31 Issue 5: pp.532-533.
[11]
Gruop L. Lapack homepage. Avilable from: "http://www.netlib.org/lapack/", {Accessed on March, 2013}.
[12]
Gebali F. Algorithms and Parallel Computing. Wiley Series on Parallel and Distributed Computing, 2011.
[13]
Faber V, Lubeck O, White A. Superlinear speedup of an efficient sequential algorithm is not possible. Parallel Computing 1986; Volume 3: pp.259-260.
[14]
Sun XH, Ni LM. Another view on parallel speedup. Proceedings of Supercomputing '90, New York, USA, 1993; pp.324-333.
[15]
Gustafson JL. Fixed time, tiered memory and superlinear speedup. Proceedings of the Fifth Distributed Memory Computing Conference 1990; Volume 2: pp.1256-1260.
[16]
Rao VN, Kumar V. Superlinear speedup in parallel state-space search. In Proceedings of the Eighth Conference on Foundations of Software Technology and Theoretical Computer Science, Nori KV, Kumar S eds. Springer-Verlag: London, UK, 1988; pp.161-174.
[17]
Alba E, Troya JM. Analyzing synchronous and asynchronous parallel distributed genetic algorithms. Future Generation Computer Systems 2001; Volume 17 Issue 4: pp.451-465.
[18]
Torres EA. Parallel evolutionary algorithms can achieve super-linear performance. Information Processing Letters 2002; Volume 82 Issue 1: pp.7-13.
[19]
So B, Ghuloum AM, Wu Y. Optimizing data parallel operations on many-core platforms. First Workshop on Software Tools for Multi-Core Systems STMCS 2006: pp.66-70.
[20]
Jeon D, Garcia S, Louie C, Taylor MB. Kismet: parallel speedup estimates for serial programs. Sigplan Notices 2011; Volume 46 Issue 10: pp.519-536.
[21]
Goumas G, Kourtis K, Anastopoulos N, Karakasis V, Koziris N. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. Journal of Supercomputing 2009; Volume 50 Issue 1: pp.36-77.
[22]
Martorell X, Labarta J, Navarro N, Ayguad E. A library implementation of the nano-threads programming model. Euro-Par, Vol. II'96 1996: pp.644-649.
[23]
Castaldo AM, Whaley RC. Scaling Lapack panel operations using parallel cache assignment. Sigplan Notices 2010; Volume 45 Issue 5: pp.223-232.
[24]
Jenks S. Multithreading and thread migration using MPI and Myrinet. Proceedings of the Parallel and Distributed Computing and Systems, PDCS'04, MIT Cambridge, USA, 2004.
[25]
Al-Jaroodi J, Mohamed N, Jiang H, Swanson D. An agent-based infrastructure for parallel Java on heterogeneous clusters. In Proceedings of the IEEE International Conference on Cluster Computing. CLUSTER '02, IEEE Computer Society: Washington, DC, USA, 2002; pp.19-27.
[26]
Kolberg M, Bohlender G, Claudio D. Improving the performance of a verified linear system solver using optimized libraries and parallel computation. In High Performance Computing for Computational Science - VECPAR 2008, Palma JM, Amestoy PR, Daydé M, Mattoso M, Lopes JaC eds. Springer-Verlag: Berlin, Heidelberg, 2008; pp.13-26.
[27]
Lee DJ, Downar TJ. The application of posix threads and OpenMP to the U.S. nrc neutron kinetics code parcs. In Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming, WOMPAT '01. Springer-Verlag: London, UK, UK, 2001; pp.90-100.
[28]
Blikberg R, Sørevik T. Nested parallelism: allocation of threads to tasks and OpenMP implementation. Journal Scientific Programming 2001; Volume 9 Issue 2,3: pp.185-194.
[29]
Huiwei LV, Cheng Y, Bai L, Chen M, Fan D, Sun N. P-gas: parallelizing a cycle-accurate event-driven many-core processor simulator using parallel discrete event simulation. Principles of Advanced and Distributed Simulation PADS, 2010 IEEE Workshop on, Atlanta, USA, 2010; pp.1-8.
[30]
Cao Z, Xu J, Chen M, Zheng G, Huiwei LV, Sun N. Hppnetsim: a parallel simulation of large-scale interconnection networks. In Proceedings of the 2009 Spring Simulation Multiconference. SpringSim '09, Society for Computer Simulation International: San Diego, CA, USA, 2009; pp.32:1-32:8.
[31]
Dongarra J, Faverge M, Ltaief H, Luszczek P. Exploiting fine-grain parallelism in recursive LU factorization. Advances in Parallel Computing 2012; Volume 22 Issue 3: pp.429-436.
[32]
Steinbrecher J, Shang W. On supernode transformations and multithreading for the longest common subsequence problem. In Australasian Symposium on Parallel and Distributed Computing AusPDC 2012, CRPIT, Vol.Volume 127, Chen J, Ranjan R eds. ACS: Melbourne, Australia, 2012; pp.3-12. Avilable from: "http://crpit.com/confpapers/CRPITV127Steinbrecher.pdf" {Accessed on July, 2013}.
[33]
Castaldo AM, Whaley RC. Achieving scalable parallelization for the Hessenberg factorization. In Proceedings of the 2011 IEEE International Conference on Cluster Computing. CLUSTER '11, IEEE Computer Society: Washington, DC, USA, 2011; pp.65-73.
[34]
Shi Y. Reevaluating Amdahl's law and Gustafson's law. Technical Report MS:38-24, Computer Sciences Department, Temple University, Oct 1996.
[35]
Piedra R. A parallel approach for matrix multiplication on the TMS320C4x DSP. Texas Instruments SPRA107, Texas, USA, February 1994; pp.1-24.
[36]
Ristov S, Gusev M. Achieving maximum performance for matrix multiplication using set associative cache. In Computing Technology and Information Management ICCM2012, 2012 The 8th International Conference on ICNIT '12, Vol.Volume 2: Seoul, Korea South, AICIT, 2012; pp.542-547.
[37]
Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing 2009; Volume 35 Issue 3: pp.178-194.
[38]
Gusev M, Ristov S. Performance gains and drawbacks using set associative cache. Journal of Next Generation Information Technology JNIT 2012; Volume 3 Issue 3: pp.87-98.
[39]
VMware. VMware ESXi, 4 Jan 2012. "http://communities.vmware.com/community/vmtn/server/vsphere/esxi" {Accessed on July, 2013}.
[40]
Microsoft. Microsoft Hyper-V server 2008 R2 Jan 2012. "http://www.microsoft.com/en-us/server-cloud/hyper-v-server/" {Accessed on July, 2013}.

Cited By

View all
  1. A superlinear speedup region for matrix multiplication

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Concurrency and Computation: Practice & Experience
        Concurrency and Computation: Practice & Experience  Volume 26, Issue 11
        August 2014
        100 pages
        ISSN:1532-0626
        EISSN:1532-0634
        Issue’s Table of Contents

        Publisher

        John Wiley and Sons Ltd.

        United Kingdom

        Publication History

        Published: 10 August 2014

        Author Tags

        1. Amdahl's law
        2. Gustafson's law
        3. cache memory
        4. high performance computing
        5. shared memory multiprocessor

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 07 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media