Parallel machines with an extremely large number of processors are now in operation. For example, the IBM BlueGene/L machine with 128K processors is currently being deployed. It is going to be a significant challenge for application developers to write parallel programs in order to exploit the enormous compute power available and manually scale their applications on such machines. Solving these problems involves finding suitable parallel programming models for such machines and addressing issues like load imbalance. This thesis explores processor virtualization in Charm++ programming model and employing migratable objects for programming petaflops class machines supported by parallel emulation for algorithm validation, parallel simulation for performance prediction, and using new kinds of automatic load balancing strategies to substantially address many of these challenges for programming very large machines.
It is important to understand the performance of parallel applications on very large parallel machines. This thesis explores Parallel Discrete Event Simulation techniques to simulate parallel applications and predict their performance. We present a novel optimistic synchronization protocol which exploits the inherent determinacy in parallel applications to effectively reduce the synchronization overhead.
Load balance problem presents significant challenges to applications to achieve scalability on very large machines. We study load balancing techniques and develop a spectrum of load balancing strategies motivated by several real-world applications. We optimize our load balancing strategies in multiple dimensions of criteria such as communication-aware load balancing, sub-step load balancing, and computation phase-aware load balancing. We have successfully scaled NAMD (a classical molecular dynamics application) to 1TF of peak performance on 3000 processors of PSC Lemieux, using the load balancing framework presented in this thesis.
We further motivate the need for next generation load balancing strategies for petaflops class machines. We explore a novel design of a scalable hierarchical load balancing scheme, which incorporates an explicit memory cost control function to make it easy to adapt to extremely large machines with small memory footprint. This hierarchical load balancing scheme builds load data from instrumenting an application automatically at run-time on both computation and communication pattern. The load balancing strategy takes application communication pattern into account explicitly.
Cited By
- Jeannot E, Mercier G and Tessier F Topology and affinity aware hierarchical and distributed load-balancing in charm++ Proceedings of the First Workshop on Optimization of Communication in HPC, (63-72)
- Acun B, Miller P and Kale L Variation Among Processors Under Turbo Boost in HPC Systems Proceedings of the 2016 International Conference on Supercomputing, (1-12)
- Wang J, Abu-Ghazaleh N and Ponomarev D (2015). AIR, ACM Transactions on Modeling and Computer Simulation, 25:3, (1-25), Online publication date: 7-May-2015.
- Acun B, Gupta A, Jain N, Langer A, Menon H, Mikida E, Ni X, Robson M, Sun Y, Totoni E, Wesolowski L and Kale L Parallel programming with migratable objects Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (647-658)
- Lifflander J, Krishnamoorthy S and Kale L Optimizing data locality for fork/join programs using constrained work stealing Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (857-868)
- Zhang X, Ou J, Davis K and Jiang S Orthrus Proceedings of the 29th International Conference on Supercomputing - Volume 8488, (348-364)
- Gupta A, Sarood O, Kale L and Milojicic D Improving HPC application performance in cloud through dynamic load balancing Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, (402-409)
- Lifflander J, Krishnamoorthy S and Kale L Steal Tree Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, (507-518)
- Lifflander J, Krishnamoorthy S and Kale L (2013). Steal Tree, ACM SIGPLAN Notices, 48:6, (507-518), Online publication date: 23-Jun-2013.
- Rodrigues E, Navaux P, Panetta J and Mendes C (2013). Preserving the original MPI semantics in a virtualized processor environment, Science of Computer Programming, 78:4, (412-421), Online publication date: 1-Apr-2013.
- Sarood O, Meneses E and Kale L A 'cool' way of improving the reliability of HPC machines Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-12)
- Hou B, Yao Y and Peng S Empirical Study on Entity Interaction Graph of Large-Scale Parallel Simulations Proceedings of the 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation, (1-6)
- Sarood O and Kale L A 'cool' load balancer for parallel applications Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, (1-11)
- Rodrigues E, Navaux P, Panetta J and Mendes C A new technique for data privatization in user-level threads and its use in parallel applications Proceedings of the 2010 ACM Symposium on Applied Computing, (2149-2154)
- Bhatelé A, Kalé L and Kumar S Dynamic topology aware load balancing algorithms for molecular dynamics applications Proceedings of the 23rd international conference on Supercomputing, (110-116)
- Gioachin F and Kalé L Memory tagging in Charm++ Proceedings of the 6th workshop on Parallel and distributed systems: testing, analysis, and debugging, (1-7)
- Huang C, Zheng G, Kalé L and Kumar S Performance evaluation of adaptive MPI Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, (12-21)
- Chakravorty S, Mendes C, Kalé L, Jones T, Tauferner A, Inglett T and Moreira J (2006). HPC-Colony, ACM SIGOPS Operating Systems Review, 40:2, (43-49), Online publication date: 1-Apr-2006.
- Zheng G, Huang C and Kalé L (2006). Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++, ACM SIGOPS Operating Systems Review, 40:2, (90-99), Online publication date: 1-Apr-2006.
- Huang C, Lee C and Kalé L Support for adaptivity in ARMCI using migratable objects Proceedings of the 20th international conference on Parallel and distributed processing, (383-383)
- Agarwal T, Sharma A and Kalé L Topology-aware task mapping for reducing communication contention on large parallel machines Proceedings of the 20th international conference on Parallel and distributed processing, (145-145)
- Gioachin F, Sharma A, Chakravorty S, Mendes C, Kalé L and Quinn T Scalable cosmological simulations on parallel machines Proceedings of the 7th international conference on High performance computing for computational science, (476-489)
- Jiao X, Zheng G, Alexander P, Campbell M, Lawlor O, Norris J, Haselbacher A and Heath M (2006). A system integration framework for coupled multiphysics simulations, Engineering with Computers, 22:3-4, (293-309), Online publication date: 1-Dec-2006.
- Lawlor O, Chakravorty S, Wilmarth T, Choudhury N, Dooley I, Zheng G and Kalé L (2006). ParFUM, Engineering with Computers, 22:3-4, (215-235), Online publication date: 1-Dec-2006.
Index Terms
- Achieving high performance on extremely large parallel machines: performance prediction and load balancing
Recommendations
Performance prediction of large parallel applications using parallel simulations
Accurate simulation of large parallel applications can be facilitated with the use of direct execution and parallel discrete event simulation. This paper describes the use of COMPASS, a direct execution-driven, parallel simulator for performance ...