Abstract
This paper presents two parallel formulations for the Barnes-Hut algorithm on the Cell architecture, which differ in tree distribution and construction phases of the algorithm. In the initial parallelization, the domains are dynamically partitioned and assigned to the synergistic processing elements (SPEs), and SPEs construct local trees of the sub-domains in parallel. The enhanced parallelization scheme provides better clustering of the particles by sequentially constructing the global tree of the entire work space in the power processing element (PPE) and by partitioning the tree into sub-trees that can fit in the Local Store. SPEs operate on the sub-tree data and construct local trees in parallel. Our experimental evaluation indicates that this application performs much faster on the Cell BE compared to the Intel Xeon based system. Specifically, our first and second methods on the Cell BE outperform Intel Xeon by a factor of 5.8 and 7.1 for 8192 particles, respectively.
Similar content being viewed by others
References
Kahle, J., Day, M., Hofstee, H., Johns, C., Maeurer, T., Shippy, D.: Introduction to the Cell multiprocessor. IBM J. Res. Dev. 49(4–5), 589–604 (2005)
Kongetira, P., Aingaran, K., Olukotun, K.: Niagara: a 32-way multithreaded SPARC processor. IEEE Microw. Mag. 25(2), 21–29 (2005)
Olukotun, K., Hammond, L.: The future of microprocessors. ACM Queue 3(7), 26–29 (2005)
Kumar, R., Tullsen, D., Jouppi, N.: Heterogeneous chip multiprocessors. IEEE Comput. Soc. 38(11), 32–38 (2005)
Taylor, M.B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F., Greenwald, B., Hoffmann, H., Johnson, P., Lee, J.-W., Lee, W., Ma, A., Saraf, A., Seneski, M., Shnidman, N., Strumpen, V., Frank, M., Amarasinghe, S., Agarwal, A.: The RAW microprocessor: a computational fabric for software circuits and general purpose programs. IEEE MICRO 22(2), 25–35 (2002)
Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Keckler, S.W., Moore, C.R.: Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In: Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 442–433 (2003)
Swanson, S., Michelson, K., Schwerin, A., Oskin, M.: Wavescalar. In: Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture, Washington, DC, pp. 291–302 (2003)
Mai, K., Paaske, T., Jayasena, N., Ho, R., Dally, W.J., Horowitz, M.: Smart memories: a modular reconfigurable architecture. In: Proceedings of the Annual International Symposium on Computer Architecture, New York, NY, USA, pp. 161–171 (2000)
Hofstee, H.P.: Power efficient processor architecture and the Cell processor. In: Proceedings of the International Symposium on High Performance Computer Architecture, pp. 258–262 (2005)
Pham, D., Asano, S., Bolliger, M., Day, M.N., Hofstee, H.P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., Yazawa, K.: The design and implementation of a first-generation Cell processor. In: International Solid State Circuits Conference, San Fransisco (2005)
Asanovic, K., Bodik, R., Catanzaro, B., Gebis, J., Husbands, P., Keutzer, K., Patterson, D., Plishker, W., Shalf, J., Williams, S., Yelik, K.: The landscape of parallel computing research: a view from Berkley. Technical Report, EECS Department, University of California at Berkley, UCB/EECS-2006-183 (2006)
Barnes, J., Hut, P.: A hierarchical O(nlog n) force calculation algorithm. Nature 324, 446–449 (1986)
Grama, A., Kumar, V., Sameh, A.: Scalable parallel formulations of the Barnes-Hut method for N-Body simulations. In: Proceedings of Supercomputing, pp. 439–448 (1994)
Greengard, L.: The Rapid Evolution of Potential Fields in Particle Systems. The MIT Press, Cambridge (1988)
Ramachandran, U., et al.: Architectural mechanisms for explicit communication in shared memory multiprocessors. In: Proceedings of Supercomputing (1995)
Warren, M., Salmon, J.: A parallel hashed oct tree n-body algorithm. In: Proceedings of Supercomputing Conference (1993)
Singh, J., Holt, C., Totsuka, T., Gupta, A., Hennessy, J.: Load balancing and data locality in hierarchical n-body methods. J. Parallel Distrib. Comput. (1994)
Bader, D.A., Agarwal, V., Madduri, K., Kang, S.: High performance combinatorial algorithm design on the Cell broadband engine processor. Parallel Comput. 33(10–11), 720–740 (2007)
Kistler, M., Perrone, M., Petrini, F.: Cell multiprocessor communication network: built for speed. Communication build for speed. IEEE MICRO 26(3), 10–23 (2006)
Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The potential of the cell processor for scientific computing. In: Proceedings of the 3rd Conference on Computing Frontiers, pp. 9–20 (2006)
Bader, D.A., Agarwal, V.: FFTC: fastest Fourier transform for the IBM Cell broadband. In: Engine Eleventh Annual High Performance Embedded Computing Workshop (HPEC), Lexington, MA (2007)
Chow, A.C., Fossum, G.C., Brokenshire, D.A.: A programming example: large FFT on the Cell broadband engine. In: Tec. Conf. Proc. of the Global Signal Processing Expo (GSPx) (2005)
Cico, L., Cooper, R., Greene, J.: Performance and programability of the IBM/Sony/Toshiba Cell broadband engine processor. White paper (2006)
Bader, D.A., Agarwal, V., Madduri, K.: On design and analysis of irregular algorithms on the Cell processor: a case study of list ranking. In: The IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA (2007)
Villa, O., Scarpazza, D.P., Petrini, F., Peinador, J.F.: Challenges in mapping graph exploration algorithms on advanced multi-Core processors. In: Parallel and Distributed Processing Symposium, Long Beach, CA (2007)
Sachdeva, V., Kistler, M., Speight, E., Tzeng, T.H.K.: Exploring the viability of the Cell broadband engine for bioinformatics applications. In: Parallel and Distributed Processing Symposium, Long Beach, CA (2007)
Brokenshire, D.A.: Maximizing the power of the cell broadband engine processor: 25 tips to optimal application performance. In: IBM Developer Works (2006)
Demiroz, B., Topcuoglu, H., Kandemir, M., Tosun, O.: Parallelizing Barnes-Hut method on the Cell BE architecture. In: HIPEAC 2010, Third Workshop on Programmability Issues for Multicore Computers (MULTIPROG’10), Pisa, Italy (2010)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Demiroz, B., Topcuoglu, H.R., Kandemir, M. et al. Particle simulation on the Cell BE architecture. Cluster Comput 14, 419–432 (2011). https://doi.org/10.1007/s10586-011-0169-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-011-0169-4