To make significant advances in the design of large-scale multiprocessing systems, designers need to understand the characteristics of the applications that will run on these systems. This thesis studies the parallelization and system implications of one important class of scientific applications: N-body applications that use hierarchical solution methods. We focus primarily on the Barnes-Hut and Fast Multipole methods, the two most prominent hierarchical methods for classical N-body problems. In addition, we study a radiosity application from computer graphics, which is an application of the hierarchical N-body approach to a very different problem domain.
We first address the problem of obtaining effective parallel performance on these applications, which is challenging owing to their nonuniform, dynamically changing nature and their need for long-range communication. For the classical applications, we show that simple yet effective partitioning techniques can be developed by exploiting key insights into the problems and solution methods. Using a novel partitioning technique which we propose, 45-fold speedups are obtained on a 48-processor Stanford DASH machine (a cache-coherent, shared address space multiprocessor) and 118-fold speedups are obtained on a 128-processor simulated architecture, even for relatively small problems. The radiosity application also yields good parallel performance (27-fold speedup on a 32-processor DASH), but only by resorting to dynamic task stealing.
We then study the implications of this class of applications for multiprocessor architecture. We show that both a shared address space and caching are important architectural properties for achieving high performance. Finally, we examine the architectural implications of scaling the applications to run on machines with more processors. We find that using a scaling methodology which reflects the concerns of the application scientist leads to different conclusions than do more naive methods of scaling typically found in the literature. In particular, both the communication to computation ratio that a machine must support, as well as the size of a processor's working set (and hence the per-processor cache size required for effective performance) grow slowly as larger problems are run on larger machines.
Cited By
- Zhang J, Behzad B and Snir M Optimizing the Barnes-Hut algorithm in UPC Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, (1-11)
- Banicescu I, Carino R, Pabico J and Balasubramaniam M Overhead Analysis of a Dynamic Load Balancing Library for Cluster Computing Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
- Banicescu I and Velusamy V Load Balancing Highly Irregular Computations with the Adaptive Factoring Proceedings of the 16th International Parallel and Distributed Processing Symposium
- Liu P and Bhatt S (2000). Experiences with Parallel N-Body Simulation, IEEE Transactions on Parallel and Distributed Systems, 11:12, (1306-1323), Online publication date: 1-Dec-2000.
- Rauber T and Rünger G Matrix computations behind the hierarchical radiosity method Proceedings of the 1999 ACM symposium on Applied computing, (533-540)
- Garmann R On the partitionability of hierarchical radiosity Proceedings of the 1999 IEEE symposium on Parallel visualization and graphics, (69-78)
- Aluru S, Gustafson J, Prabhu G and Sevilgen F (1998). Distribution-Independent Hierarchical Algorithmsfor the N-body Problem, The Journal of Supercomputing, 12:4, (303-323), Online publication date: 1-Oct-1998.
- Karamcheti V and Chien A A hierarchical load-balancing framework for dynamic multithreaded computations Proceedings of the 1998 ACM/IEEE conference on Supercomputing, (1-17)
- Liu P and Wu J A Framework for Parallel Tree-Based Scientific Simulations Proceedings of the international Conference on Parallel Processing, (137-144)
- Kozlov A and Singh J (2019). Parallel Implementations of Probabilistic Inference, Computer, 29:12, (33-40), Online publication date: 1-Dec-1996.
- Rinard M and Diniz P Commutativity analysis Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation, (54-67)
- Rinard M and Diniz P (1996). Commutativity analysis, ACM SIGPLAN Notices, 31:5, (54-67), Online publication date: 1-May-1996.
- Pilkington J and Baden S (1996). Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves, IEEE Transactions on Parallel and Distributed Systems, 7:3, (288-300), Online publication date: 1-Mar-1996.
- Sundaram C and Eager D Future applicability of bus-based shared memory multiprocessors Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, (203-212)
- Harzallah K and Sevcik K Predicting application behavior in large scale shared-memory multiprocessors Proceedings of the 1995 ACM/IEEE conference on Supercomputing, (53-es)
- Aluru S, Prabhu G and Gustafson J Truly distribution-independent algorithms for the N-body problem Proceedings of the 1994 ACM/IEEE conference on Supercomputing, (420-428)
- Liu P and Bhatt S Experiences with parallel N-body simulation Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, (122-131)
- Singh J, Holt C, Hennessy J and Gupta A A parallel adaptive fast multipole method Proceedings of the 1993 ACM/IEEE conference on Supercomputing, (54-65)
Index Terms
- Parallel hierarchical N-body methods and their implications for multiprocessors
Recommendations
Implications of hierarchical N-body methods for multiprocessor architectures
To design effective large-scale multiprocessors, designers need to understand the characteristics of the applications that will use the machines. Application characteristics of particular interest include the amount of communication relative to ...
A data-parallel implementation of O(N) hierarchical N-body methods
Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on SupercomputingThe O(N) hierarchical N-body algorithms and Massively Parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We present a data-parallel implementation of Anderson's method and demonstrate both ...
A Data-Parallel Implementation of Hierarchical N-Body Methods
The ON hierarchical N-body algorithms and mas sively parallel processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We describe a data-parallel implementation of Anderson's method and demonstrate both ...