Distributed shared memory (DSM) machines can be characterized by four parameters, based on a slightly modified version of the logP model. The l (latency) and o (occupancy of the communication controller) parameters are the keys to performance in these machines, and are largely determined by major architectural decisions about the aggressiveness and customization of the node and network. For recent and upcoming machines, the g (gap) parameter that measures node-to-network bandwidth does not appear to be a bottleneck. Conventional wisdom is that latency is the dominant factor in determining the performance of a DSM machine. We show, however, that controller occupancy--which causes contention even in highly optimized applications--plays a major role, especially at low latencies. When latency hiding is used, occupancy becomes more critical, even in machines with high latency networks. Scaling the problem size is often used as a technique to overcome limitations in communication latency and bandwidth. We show that in many structured computations occupancy-induced contention is not alleviated by increasing problem size, and that there are important classes of applications for which the performance lost by using higher latency networks or higher occupancy controllers cannot be regained easily, if at all, by scaling the problem size.
Cited By
- Zhang Z and Seidel S A performance model for fine-grain accesses in UPC Proceedings of the 20th international conference on Parallel and distributed processing, (65-65)
- Falsafi B and Wood D (2005). Evaluating scheduling policies for fine-grain communication protocols on a cluster of SMPs, Journal of Parallel and Distributed Computing, 65:4, (464-478), Online publication date: 1-Apr-2005.
- Chaudhuri M, Heinrich M, Holt C, Singh J, Rothberg E and Hennessy J (2003). Latency, Occupancy, and Bandwidth in DSM Multiprocessors, IEEE Transactions on Computers, 52:7, (862-880), Online publication date: 1-Jul-2003.
- Hsiao H and King C (2019). An Application-Driven Study of Multicast Communication for Write Invalidation, The Journal of Supercomputing, 18:3, (279-304), Online publication date: 1-Mar-2001.
- Moritz C and Frank M (2001). LoGPC, IEEE Transactions on Parallel and Distributed Systems, 12:4, (404-415), Online publication date: 1-Apr-2001.
- Hoisie A, Lubeck O and Wasserman H (2000). Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications, International Journal of High Performance Computing Applications, 14:4, (330-346), Online publication date: 1-Nov-2000.
- Heinrich M, Soundararajan V, Hennessy J and Gupta A (1999). A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols, IEEE Transactions on Computers, 48:2, (205-217), Online publication date: 1-Feb-1999.
- Michael M, Nanda A and Lim B (1999). Coherence Controller Architectures for Scalable Shared-Memory Multiprocessors, IEEE Transactions on Computers, 48:2, (245-255), Online publication date: 1-Feb-1999.
- Hwang K, Wang C, Wang C and Xu Z (1999). Resource Scaling Effects on MPP Performance, IEEE Transactions on Parallel and Distributed Systems, 10:5, (509-527), Online publication date: 1-May-1999.
- Sundaram-Stukel D and Vernon M Predictive analysis of a wavefront application using LogGP Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, (141-150)
- Sundaram-Stukel D and Vernon M (1999). Predictive analysis of a wavefront application using LogGP, ACM SIGPLAN Notices, 34:8, (141-150), Online publication date: 1-Aug-1999.
- Bilas A, Iftode L and Singh J Evaluation of hardware write propagation support for next-generation shared virtual memory clusters Proceedings of the 12th international conference on Supercomputing, (274-281)
- Moritz C and Frank M LoGPC Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, (254-263)
- Moritz C and Frank M (1998). LoGPC, ACM SIGMETRICS Performance Evaluation Review, 26:1, (254-263), Online publication date: 1-Jun-1998.
- Qin X and Baer J Optimizing software cache-coherent cluster architectures Proceedings of the 1998 ACM/IEEE conference on Supercomputing, (1-14)
- Qin X and Baer J A performance evaluation of cluster architectures Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, (237-247)
- Qin X and Baer J (1997). A performance evaluation of cluster architectures, ACM SIGMETRICS Performance Evaluation Review, 25:1, (237-247), Online publication date: 1-Jun-1997.
- Frank M, Agarwal A and Vernon M LoPC Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, (276-287)
- Frank M, Agarwal A and Vernon M (1997). LoPC, ACM SIGPLAN Notices, 32:7, (276-287), Online publication date: 1-Jul-1997.
- Martin R, Vahdat A, Culler D and Anderson T Effects of communication latency, overhead, and bandwidth in a cluster architecture Proceedings of the 24th annual international symposium on Computer architecture, (85-97)
- Michael M, Nanda A, Lim B and Scott M Coherence controller architectures for SMP-based CC-NUMA multiprocessors Proceedings of the 24th annual international symposium on Computer architecture, (219-228)
- Martin R, Vahdat A, Culler D and Anderson T (1997). Effects of communication latency, overhead, and bandwidth in a cluster architecture, ACM SIGARCH Computer Architecture News, 25:2, (85-97), Online publication date: 1-May-1997.
- Michael M, Nanda A, Lim B and Scott M (1997). Coherence controller architectures for SMP-based CC-NUMA multiprocessors, ACM SIGARCH Computer Architecture News, 25:2, (219-228), Online publication date: 1-May-1997.
- Bilas A and Singh J The effects of communication parameters on end performance of shared virtual memory clusters Proceedings of the 1997 ACM/IEEE conference on Supercomputing, (1-35)
- Moga A, Dubois M and Gefflaut A Hardware Versus Software Implementation of COMA Proceedings of the international Conference on Parallel Processing, (248-256)
- Iftode L, Singh J and Li K Understanding application performance on shared virtual memory systems Proceedings of the 23rd annual international symposium on Computer architecture, (122-133)
- Holt C, Singh J and Hennessy J Application and architectural bottlenecks in large scale distributed shared memory machines Proceedings of the 23rd annual international symposium on Computer architecture, (134-145)
- Iftode L, Singh J and Li K (1996). Understanding application performance on shared virtual memory systems, ACM SIGARCH Computer Architecture News, 24:2, (122-133), Online publication date: 1-May-1996.
- Holt C, Singh J and Hennessy J (1996). Application and architectural bottlenecks in large scale distributed shared memory machines, ACM SIGARCH Computer Architecture News, 24:2, (134-145), Online publication date: 1-May-1996.
- Woo S, Ohara M, Torrie E, Singh J and Gupta A The SPLASH-2 programs Proceedings of the 22nd annual international symposium on Computer architecture, (24-36)
- Woo S, Ohara M, Torrie E, Singh J and Gupta A (1995). The SPLASH-2 programs, ACM SIGARCH Computer Architecture News, 23:2, (24-36), Online publication date: 1-May-1995.
Recommendations
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation
While the desire to use commodity parts in the communication architecture of a DSM multiprocessor offers advantages in cost and design time, the impact on application performance is unclear. We study this performance impact through detailed simulation, ...
Scalable directory architecture for distributed shared memory chip multiprocessors
Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main ...
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors
ISCA '03: Proceedings of the 30th annual international symposium on Computer architectureDestination-set prediction can improve the latency/bandwidth tradeoff in shared-memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal ...