More Web Proxy on the site http://driver.im/

article

A superlinear speedup region for matrix multiplication

Authors:

Sasko RistovAuthors Info & Claims

Concurrency and Computation: Practice & Experience, Volume 26, Issue 11

Pages 1847 - 1868

https://doi.org/10.1002/cpe.3102

Published: 10 August 2014 Publication History

Abstract

The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores e.g., L3 cache and each core possesses private low level caches e.g., L1 and L2 caches. Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd.

References

[1]

Drevet CE, Islam MN, Schost E. Optimization techniques for small matrix multiplication. ACM Communications in Computer Algebra 2011; Volume 44 Issue 3/4: pp.107-108.

[2]

Hennessy JL, Patterson DA. Computer Architecture, Fifth Edition: A Quantitative Approach. Elsevier: MA, USA, 2012.

[3]

Playne DP, Hawick KA. Comparison of GPU architectures for asynchronous communication with finite-differencing applications. Concurrency and Computation: Practice and Experience 2012; Volume 24 Issue 1: pp.73-83.

[4]

Djinevski L, Ristov S, Gusev M. Superlinear speedup in GPU devices. In ICT Innovations 2012, Vol.Volume AISC 257, Markovski, Gusev eds. Springer Verlag: Berlin / Heidelberg, 2013; pp.285-296.

[5]

Ristov S, Gusev M. Superlinear speedup for matrix multiplication. Information Technology Interfaces, Proceedings of the ITI 2012 34th International Conference on, Cavtat, Croatia, 2012; pp.499-504.

[6]

Gustafson JL. The consequences of fixed time performance measurement. Proceedings of the Hawaii International Conference on System Sciences 1992; Volume 25: pp.113.

[7]

Sun XH, Ni LM. Scalable problems and memory-bounded speedup. Journal of Parallel and Distributed Computing 1993; Volume 19 Issue 1: pp.27-37.

[8]

Gustafson J, Montry G, Benner R. Development of parallel methods for a 1024-processor hypercube. SIAM Journal on Scientific and Statistical Computing 1988; Volume 9 Issue 4: pp.532-533.

[9]

Amdahl GM. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings, Vol.Volume 30. AFIPS Press: Reston. Va., Atlantic City, N.J., 1967; pp.483-485.

[10]

Gustafson JL. Reevaluating Amdahl's law. Communication of ACM 1988; Volume 31 Issue 5: pp.532-533.

[11]

Gruop L. Lapack homepage. Avilable from: "http://www.netlib.org/lapack/", {Accessed on March, 2013}.

[12]

Gebali F. Algorithms and Parallel Computing. Wiley Series on Parallel and Distributed Computing, 2011.

[13]

Faber V, Lubeck O, White A. Superlinear speedup of an efficient sequential algorithm is not possible. Parallel Computing 1986; Volume 3: pp.259-260.

[14]

Sun XH, Ni LM. Another view on parallel speedup. Proceedings of Supercomputing '90, New York, USA, 1993; pp.324-333.

[15]

Gustafson JL. Fixed time, tiered memory and superlinear speedup. Proceedings of the Fifth Distributed Memory Computing Conference 1990; Volume 2: pp.1256-1260.

[16]

Rao VN, Kumar V. Superlinear speedup in parallel state-space search. In Proceedings of the Eighth Conference on Foundations of Software Technology and Theoretical Computer Science, Nori KV, Kumar S eds. Springer-Verlag: London, UK, 1988; pp.161-174.

[17]

Alba E, Troya JM. Analyzing synchronous and asynchronous parallel distributed genetic algorithms. Future Generation Computer Systems 2001; Volume 17 Issue 4: pp.451-465.

[18]

Torres EA. Parallel evolutionary algorithms can achieve super-linear performance. Information Processing Letters 2002; Volume 82 Issue 1: pp.7-13.

[19]

So B, Ghuloum AM, Wu Y. Optimizing data parallel operations on many-core platforms. First Workshop on Software Tools for Multi-Core Systems STMCS 2006: pp.66-70.

[20]

Jeon D, Garcia S, Louie C, Taylor MB. Kismet: parallel speedup estimates for serial programs. Sigplan Notices 2011; Volume 46 Issue 10: pp.519-536.

[21]

Goumas G, Kourtis K, Anastopoulos N, Karakasis V, Koziris N. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. Journal of Supercomputing 2009; Volume 50 Issue 1: pp.36-77.

[22]

Martorell X, Labarta J, Navarro N, Ayguad E. A library implementation of the nano-threads programming model. Euro-Par, Vol. II'96 1996: pp.644-649.

[23]

Castaldo AM, Whaley RC. Scaling Lapack panel operations using parallel cache assignment. Sigplan Notices 2010; Volume 45 Issue 5: pp.223-232.

[24]

Jenks S. Multithreading and thread migration using MPI and Myrinet. Proceedings of the Parallel and Distributed Computing and Systems, PDCS'04, MIT Cambridge, USA, 2004.

[25]

Al-Jaroodi J, Mohamed N, Jiang H, Swanson D. An agent-based infrastructure for parallel Java on heterogeneous clusters. In Proceedings of the IEEE International Conference on Cluster Computing. CLUSTER '02, IEEE Computer Society: Washington, DC, USA, 2002; pp.19-27.

[26]

Kolberg M, Bohlender G, Claudio D. Improving the performance of a verified linear system solver using optimized libraries and parallel computation. In High Performance Computing for Computational Science - VECPAR 2008, Palma JM, Amestoy PR, Daydé M, Mattoso M, Lopes JaC eds. Springer-Verlag: Berlin, Heidelberg, 2008; pp.13-26.

[27]

Lee DJ, Downar TJ. The application of posix threads and OpenMP to the U.S. nrc neutron kinetics code parcs. In Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming, WOMPAT '01. Springer-Verlag: London, UK, UK, 2001; pp.90-100.

Digital Library

[28]

Blikberg R, Sørevik T. Nested parallelism: allocation of threads to tasks and OpenMP implementation. Journal Scientific Programming 2001; Volume 9 Issue 2,3: pp.185-194.

[29]

Huiwei LV, Cheng Y, Bai L, Chen M, Fan D, Sun N. P-gas: parallelizing a cycle-accurate event-driven many-core processor simulator using parallel discrete event simulation. Principles of Advanced and Distributed Simulation PADS, 2010 IEEE Workshop on, Atlanta, USA, 2010; pp.1-8.

Digital Library

[30]

Cao Z, Xu J, Chen M, Zheng G, Huiwei LV, Sun N. Hppnetsim: a parallel simulation of large-scale interconnection networks. In Proceedings of the 2009 Spring Simulation Multiconference. SpringSim '09, Society for Computer Simulation International: San Diego, CA, USA, 2009; pp.32:1-32:8.

[31]

Dongarra J, Faverge M, Ltaief H, Luszczek P. Exploiting fine-grain parallelism in recursive LU factorization. Advances in Parallel Computing 2012; Volume 22 Issue 3: pp.429-436.

[32]

Steinbrecher J, Shang W. On supernode transformations and multithreading for the longest common subsequence problem. In Australasian Symposium on Parallel and Distributed Computing AusPDC 2012, CRPIT, Vol.Volume 127, Chen J, Ranjan R eds. ACS: Melbourne, Australia, 2012; pp.3-12. Avilable from: "http://crpit.com/confpapers/CRPITV127Steinbrecher.pdf" {Accessed on July, 2013}.

[33]

Castaldo AM, Whaley RC. Achieving scalable parallelization for the Hessenberg factorization. In Proceedings of the 2011 IEEE International Conference on Cluster Computing. CLUSTER '11, IEEE Computer Society: Washington, DC, USA, 2011; pp.65-73.

[34]

Shi Y. Reevaluating Amdahl's law and Gustafson's law. Technical Report MS:38-24, Computer Sciences Department, Temple University, Oct 1996.

[35]

Piedra R. A parallel approach for matrix multiplication on the TMS320C4x DSP. Texas Instruments SPRA107, Texas, USA, February 1994; pp.1-24.

[36]

Ristov S, Gusev M. Achieving maximum performance for matrix multiplication using set associative cache. In Computing Technology and Information Management ICCM2012, 2012 The 8th International Conference on ICNIT '12, Vol.Volume 2: Seoul, Korea South, AICIT, 2012; pp.542-547.

[37]

Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing 2009; Volume 35 Issue 3: pp.178-194.

[38]

Gusev M, Ristov S. Performance gains and drawbacks using set associative cache. Journal of Next Generation Information Technology JNIT 2012; Volume 3 Issue 3: pp.87-98.

[39]

VMware. VMware ESXi, 4 Jan 2012. "http://communities.vmware.com/community/vmtn/server/vsphere/esxi" {Accessed on July, 2013}.

[40]

Microsoft. Microsoft Hyper-V server 2008 R2 Jan 2012. "http://www.microsoft.com/en-us/server-cloud/hyper-v-server/" {Accessed on July, 2013}.

Cited By

Shringi K(2021)Ryan Ripley on Making Scrum WorkIEEE Software10.1109/MS.2020.303123138:1(126-128)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1109/MS.2020.3031231
Ristov SPedratscher SWallnoefer JFahringer T(2021)DAF: Dependency-Aware FaaSifier for Node.js Monolithic ApplicationsIEEE Software10.1109/MS.2020.301833438:1(48-53)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1109/MS.2020.3018334

A superlinear speedup region for matrix multiplication

Recommendations

Virtualized environments in cloud can have superlinear speedup
BCI '12: Proceedings of the Fifth Balkan Conference in Informatics

CPU cache is used to speedup the execution of memory intensive algorithms. Usage of greater cache memory sizes reduces the cache misses and overall execution time. This paper addresses architectures in modern processors realized as multi chip and multi ...
Speeding-up multiprocessors running DBMS workloads through coherence protocols

In this work, it is shown how a DBMS workload, running on a shared-bus shared-memory multiprocessor, can be accelerated by adding simple support to the MESI coherence protocol. As a DBMS workload, we choose the TPC-D benchmark running on the PostgreSQL ...
Performance of One's Complement Caches

On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Concurrency and Computation: Practice & Experience

Concurrency and Computation: Practice & Experience Volume 26, Issue 11

August 2014

100 pages

ISSN:1532-0626

EISSN:1532-0634

Issue’s Table of Contents

Publisher

John Wiley and Sons Ltd.

United Kingdom

Publication History

Published: 10 August 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shringi K(2021)Ryan Ripley on Making Scrum WorkIEEE Software10.1109/MS.2020.303123138:1(126-128)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1109/MS.2020.3031231
Ristov SPedratscher SWallnoefer JFahringer T(2021)DAF: Dependency-Aware FaaSifier for Node.js Monolithic ApplicationsIEEE Software10.1109/MS.2020.301833438:1(48-53)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1109/MS.2020.3018334

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents