Simulation plays an important role in analyzing the performance of multiprocessor memory systems, but detailed simulation can consume enormous amounts of CPU time. This dissertation investigates three techniques that can reduce the time required to simulate a multiprocessor memory system: trace-driven simulation, simplifying approximations in memory-system simulators, and parallel simulation.
Trace-driven simulation allows simulation of multiple target systems from a single interpretation of a workload, recorded in a trace. It is often assumed that the trace is system-independent, but this assumption is violated by parallel workloads that have race-conditions or utilize dynamic scheduling. This dissertation measures and analyzes the impact of system-dependent traces on the accuracy of trace-driven simulations of multiprocessors.
Two such simplifying approximations for simulations of multiprocessor memory systems were investigated. Precise simulation requires that events originating from different parts of the system be simulated in the correct order. The investigation revealed that relaxing this constraint can double or triple simulator performance with little loss of accuracy. Precise simulation also requires simulating references to private data, which have little effect on performance. By neglecting private references, simulation was accelerated by up to 80 percent with little loss of accuracy.
To further increase speed, an existing simulator was parallelized to run on the DASH multiprocessor. The challenge was to preserve accuracy without serializing the simulation. Two approaches were evaluated: (1) ordering synchronization operations only and (2) relaxing the order of all events by a small amount. Both approaches yielded parallel speedups, but performance was limited by load imbalance. Overall, performance was comparable to the performance of hardware-assisted parallel simulators such as the Wisconsin Wind Tunnel.
Cited By
- Petit S, Sahuquillo J and Pont A Characterizing parallel workloads to reduce multiple writer overhead in shared virtual memory systems Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing, (261-268)
- Gibson J, Kunz R, Ofelt D, Horowitz M, Hennessy J and Heinrich M (2000). FLASH vs. (Simulated) FLASH, ACM SIGOPS Operating Systems Review, 34:5, (49-58), Online publication date: 1-Dec-2000.
- Gibson J, Kunz R, Ofelt D, Horowitz M, Hennessy J and Heinrich M (2000). FLASH vs. (Simulated) FLASH, ACM SIGARCH Computer Architecture News, 28:5, (49-58), Online publication date: 1-Dec-2000.
- Gibson J, Kunz R, Ofelt D, Horowitz M, Hennessy J and Heinrich M FLASH vs. (Simulated) FLASH Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, (49-58)
- Gibson J, Kunz R, Ofelt D, Horowitz M, Hennessy J and Heinrich M (2000). FLASH vs. (simulated) FLASH, ACM SIGPLAN Notices, 35:11, (49-58), Online publication date: 1-Nov-2000.
- Zhang Z, Cintra M and Torrellas J (1999). Excel-NUMA, IEEE Transactions on Computers, 48:2, (256-264), Online publication date: 1-Feb-1999.
- Koufaty D and Torrellas J Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs Proceedings of the 12th international conference on Supercomputing, (53-60)
- Olukotun K, Heinrich M and Ofelt D Digital system simulation Proceedings of the 35th annual Design Automation Conference, (658-663)
- Magnusson P Efficient instruction cache simulation and execution profiling with a threaded-code interpreter Proceedings of the 29th conference on Winter simulation, (1093-1100)
- Richard J and Singh J Parallel hierarchical computation of specular radiosity Proceedings of the IEEE symposium on Parallel rendering, (59-69)
- Choi J and Park K Hybrid Full Map Directory Scheme for Distributed Shared Memory Multiprocessors Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
- Torrie E, Martonosi M, Tseng C and Hall M (1996). Characterizing the Memory Behavior of Compiler-Parallelized Applications, IEEE Transactions on Parallel and Distributed Systems, 7:12, (1224-1237), Online publication date: 1-Dec-1996.
- Kozlov A and Singh J (2019). Parallel Implementations of Probabilistic Inference, Computer, 29:12, (33-40), Online publication date: 1-Dec-1996.
- Legedza U and Weihl W (1996). Reducing synchronization overhead in parallel simulation, ACM SIGSIM Simulation Digest, 26:1, (86-95), Online publication date: 1-Jul-1996.
- Legedza U and Weihl W Reducing synchronization overhead in parallel simulation Proceedings of the tenth workshop on Parallel and distributed simulation, (86-95)
- Martonosi M, Ofelt D and Heinrich M Integrating performance monitoring and communication in parallel computers Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, (138-147)
- Martonosi M, Ofelt D and Heinrich M (1996). Integrating performance monitoring and communication in parallel computers, ACM SIGMETRICS Performance Evaluation Review, 24:1, (138-147), Online publication date: 15-May-1996.
- Holt C, Singh J and Hennessy J Application and architectural bottlenecks in large scale distributed shared memory machines Proceedings of the 23rd annual international symposium on Computer architecture, (134-145)
- Holt C, Singh J and Hennessy J (1996). Application and architectural bottlenecks in large scale distributed shared memory machines, ACM SIGARCH Computer Architecture News, 24:2, (134-145), Online publication date: 1-May-1996.
- Park D and Saavedra R Trojan Proceedings of the 29th Annual Simulation Symposium (SS '96)
- Raynaud A, Zhang Z and Torrellas J Distance-Adaptive Update Protocols for Scalable Shared-Memory Multiprocessors Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
- Rajamony R and Cox A A Performance Debugger for Eliminating Excess Synchronization in Shared-Memory Parallel Programs Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
- Martonosi M, Clark D and Mesarina M The SHRIMP performance monitor Proceedings of the SIGMETRICS symposium on Parallel and distributed tools, (61-69)
- Erlichson A, Nayfeh B, Singh J and Olukotun K The benefits of clustering in shared address space multiprocessors Proceedings of the 1995 ACM/IEEE conference on Supercomputing, (60-es)
- Zhang Z and Torrellas J Speeding up irregular applications in shared-memory multiprocessors Proceedings of the 22nd annual international symposium on Computer architecture, (188-199)
- Woo S, Ohara M, Torrie E, Singh J and Gupta A The SPLASH-2 programs Proceedings of the 22nd annual international symposium on Computer architecture, (24-36)
- Torrie E, Tseng C, Martonosi M and Hall M Evaluating the impact of advanced memory systems on compiler-parallelized codes Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques, (204-213)
- Zhang Z and Torrellas J (1995). Speeding up irregular applications in shared-memory multiprocessors, ACM SIGARCH Computer Architecture News, 23:2, (188-199), Online publication date: 1-May-1995.
- Woo S, Ohara M, Torrie E, Singh J and Gupta A (1995). The SPLASH-2 programs, ACM SIGARCH Computer Architecture News, 23:2, (24-36), Online publication date: 1-May-1995.
- Bedichek R Talisman Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, (14-24)
- Bedichek R (1995). Talisman, ACM SIGMETRICS Performance Evaluation Review, 23:1, (14-24), Online publication date: 1-May-1995.
- Heinrich M, Kuskin J, Ofelt D, Heinlein J, Baxter J, Singh J, Simoni R, Gharachorloo K, Nakahira D, Horowitz M, Gupta A, Rosenblum M and Hennessy J (1994). The performance impact of flexibility in the Stanford FLASH multiprocessor, ACM SIGOPS Operating Systems Review, 28:5, (274-285), Online publication date: 1-Dec-1994.
- Woo S, Singh J and Hennessy J (1994). The performance advantages of integrating block data transfer in cache-coherent multiprocessors, ACM SIGOPS Operating Systems Review, 28:5, (219-229), Online publication date: 1-Dec-1994.
- Heinlein J, Gharachorloo K, Dresser S and Gupta A (1994). Integration of message passing and shared memory in the Stanford FLASH multiprocessor, ACM SIGOPS Operating Systems Review, 28:5, (38-50), Online publication date: 1-Dec-1994.
- Heinrich M, Kuskin J, Ofelt D, Heinlein J, Baxter J, Singh J, Simoni R, Gharachorloo K, Nakahira D, Horowitz M, Gupta A, Rosenblum M and Hennessy J The performance impact of flexibility in the Stanford FLASH multiprocessor Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, (274-285)
- Woo S, Singh J and Hennessy J The performance advantages of integrating block data transfer in cache-coherent multiprocessors Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, (219-229)
- Heinlein J, Gharachorloo K, Dresser S and Gupta A Integration of message passing and shared memory in the Stanford FLASH multiprocessor Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, (38-50)
- Heinrich M, Kuskin J, Ofelt D, Heinlein J, Baxter J, Singh J, Simoni R, Gharachorloo K, Nakahira D, Horowitz M, Gupta A, Rosenblum M and Hennessy J (1994). The performance impact of flexibility in the Stanford FLASH multiprocessor, ACM SIGPLAN Notices, 29:11, (274-285), Online publication date: 1-Nov-1994.
- Woo S, Singh J and Hennessy J (1994). The performance advantages of integrating block data transfer in cache-coherent multiprocessors, ACM SIGPLAN Notices, 29:11, (219-229), Online publication date: 1-Nov-1994.
- Heinlein J, Gharachorloo K, Dresser S and Gupta A (1994). Integration of message passing and shared memory in the Stanford FLASH multiprocessor, ACM SIGPLAN Notices, 29:11, (38-50), Online publication date: 1-Nov-1994.
- Kuskin J, Ofelt D, Heinrich M, Heinlein J, Simoni R, Gharachorloo K, Chapin J, Nakahira D, Baxter J, Horowitz M, Gupta A, Rosenblum M and Hennessy J The Stanford FLASH multiprocessor Proceedings of the 21st annual international symposium on Computer architecture, (302-313)
- Nayfeh B and Olukotun K Exploring the design space for a shared-cache multiprocessor Proceedings of the 21st annual international symposium on Computer architecture, (166-175)
- Kuskin J, Ofelt D, Heinrich M, Heinlein J, Simoni R, Gharachorloo K, Chapin J, Nakahira D, Baxter J, Horowitz M, Gupta A, Rosenblum M and Hennessy J (1994). The Stanford FLASH multiprocessor, ACM SIGARCH Computer Architecture News, 22:2, (302-313), Online publication date: 1-Apr-1994.
- Nayfeh B and Olukotun K (1994). Exploring the design space for a shared-cache multiprocessor, ACM SIGARCH Computer Architecture News, 22:2, (166-175), Online publication date: 1-Apr-1994.
- Kuskin J, Ofelt D, Heinrich M, Heinlein J, Simoni R, Gharachorloo K, Chapin J, Nakahira D, Baxter J, Horowitz M, Gupta A, Rosenblum M and Hennessy J (1994). The Stanford FLASH multiprocessor, ACM SIGARCH Computer Architecture News, 22:2, (302-313), Online publication date: 1-Apr-1994.
Index Terms
- Simulation of multiprocessors: accuracy and performance
Recommendations
Parallel simulation of message routing networks
PDP '95: Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed ProcessingAn implementation of a conservative parallel simulator with deadlock avoidance is presented. Its performance when working with a realistic model of a message routing network is evaluated and contrasted against a sequential simulator. Different factors ...
Shared-files parallel simulation framework for dynamic multi-domains networks using OMNeT++ (WIP)
SummerSim '15: Proceedings of the Conference on Summer Computer SimulationParallel discrete event simulation (PDES) has been recognized as a challenging research field bridging between modeling and simulation and high-performance computing. It tackles the problem of executing discrete event simulations on parallel processors. ...
Conservative parallel simulation of ATM networks
A new Conservative algorithm for both parallel and sequential simulation of networks is described. The technique is motivated by the construction of a high performance simulator for ATM networks. It permits very fast execution of models of ATM systems, ...