Abstract
With the development of parallel computation, the scale of high performance computing system increases dramatically and the collective communication has become its bottleneck. The collective communication with the hardware support has the relatively high performance. However, scalability of collective communication is always a crucial problem, because the number of nodes involved is not fixed. This paper proposes the Relax Blocking Parallel Collective Communication Mechanism (RBPCCM) to improve the performance of the collective communication in parallel computation. This mechanism, cooperating hardware and software, implements the scalable collective communication by distributing collective resource allocation numbers. Furthermore, RBPCCM supports the implementation in various scales of endpoint, unconstrained by the interconnect network topology. A functional simulation model is built based on the system of Sunway Taihu Light to verify the correctness and scalability of this proposed method. The implementation of RBPCCM prototype is built based on the network interface, and a FPGA platform is constructed for performance test. It is testified that RBPCCM has the improvement as regards to delay performance from 2.4 to 37 times, compared with the Point-to-Point communication based on software.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Lucas, R., Ang, J., Bergman, K., et al.: DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) report: top ten exascale research challenges (2014)
Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing supercomputer performance. In: Achieving Optimal Performance on the 8192 Processors of ASCI Q, Proceedings of SC2003, pp. 1–17. ACM, New York (2003)
Rabenseifner, R.: Automatic MPI counter profiling of all users: first result on a CRAY T3E 900-512. In: Proceedings of the Message Passing Interface Developer’s and User’s Conference (MPIDC), pp. 77–85. HLRS, Atlanta, USA (1999)
Moody, A., Fernandez, J., Petrini, F., et al.: Scalable NIC-based reduction on large-scale clusters. In: ACM/IEEE Conference on Supercomputing, p. 59. ACM (2003)
Culler, D., Richard, K.Y., Patterson, D., Eicken, T. et al.: LogP: towards a realistic model of parallel computation. 28(7), 1–12 (1993)
Gabrielyan, E., Hersch, R.D.: Network topology aware scheduling of collective communications. In: International Conference on Telecommunications, vol. 2, pp. 1051–1058. IEEE (2003)
Sanders, P., Sibeyn, J.F.: A bandwidth latency tradeoff for broadcast and reduction. In: Bode, A., Ludwig, T., Karl, W., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 918–926. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-44520-X_128
Hoefler, T., Squyres, J.M., Rehm, W., Lumsdaine, A.: A case for non-blocking collective operations. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA 2006. LNCS, vol. 4331, pp. 155–164. Springer, Heidelberg (2006). https://doi.org/10.1007/11942634_17
Petrini, F., Coll, S., Frachtenberg, E., et al.: Hardware- and software-based collective communication on the quadrics network. In: IEEE International Symposium on Network Computing and Applications, pp. 24–35. IEEE (2001)
Giampapa, M.E., Giampapa, M.E., Giampapa, M.E., et al.: The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer. International Conference on Supercomputing, pp. 94–103. ACM (2008)
Faraj, A., Kumar, S., Smith, B., et al.: MPI collective communications on the Blue Gene/P supercomputer: algorithms and optimizations. In: International Conference on Supercomputing, pp. 489–490. ACM (2009)
Haring, R., Ohmacht, M., Fox, T., et al.: The IBM Blue Gene/Q compute chip. IEEE Micro 32(2), 48–60 (2011)
Arimilli, B., Arimilli, R., Chung, V., et al.: The PERCS high-performance interconnect, pp. 75–82. IEEE (2010)
Mai, L., Rupprecht, L., Alim, A., et al.: NetAgg: using middleboxes for application-specific on-path aggregation in data centres, vol. 23(6), pp. 249–262 (2014)
Wagner, A., Jin, H.W., Panda, D.K., et al.: NIC-based offload of dynamic user-defined modules for Myrinet clusters. IEEE International Conference on CLUSTER Computing, pp. 205–214. IEEE Computer Society (2004)
Yu, W., Buntinas, D., Graham, R.L., et al.: Efficient and scalable barrier over quadrics and Myrinet with a new NIC-based collective message passing protocol, p. 182 (2004)
Zahavi, E., Zahavi, E., Zahavi, E., et al.: Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In: The Workshop on Optimization of Communication in HPC, pp. 1–10. IEEE Press (2016)
Arap, O., Swany, M.: Offloading collective operations to programmable logic on a Zynq cluster. In: High-Performance Interconnects, pp. 76–83. IEEE (2016)
Lu, Y., Shen, Z., Zhou, E., Zhu, M.: MCRM system: CIM-. In: Chen, G., Pan, Y., Guo, M., Lu, J. (eds.) ISPA 2005. LNCS, vol. 3759, pp. 549–558. Springer, Heidelberg (2005). https://doi.org/10.1007/11576259_60
Acknowledgements
This research is supported by National Science and Technology Major Project with No. 2013ZX0102-8001-001-001.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ren, Xj., Zhou, Z., Peng, Q., Xie, Xh. (2018). RBPCCM: Relax Blocking Parallel Collective Communication Mechanism Base on Hardware with Scalability. In: Xu, W., Xiao, L., Li, J., Zhang, C., Zhu, Z. (eds) Computer Engineering and Technology. NCCET 2017. Communications in Computer and Information Science, vol 600. Springer, Singapore. https://doi.org/10.1007/978-981-10-7844-6_7
Download citation
DOI: https://doi.org/10.1007/978-981-10-7844-6_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7843-9
Online ISBN: 978-981-10-7844-6
eBook Packages: Computer ScienceComputer Science (R0)