[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Benefits of Adding Hardware Support for Broadcast and Reduce Operations in MPSoC Applications

Published: 03 September 2014 Publication History

Abstract

MPI has been used as a parallel programming model for supercomputers and clusters and recently in MultiProcessor Systems-on-Chip (MPSoC). One component of MPI is collective communication and its performance is key for certain parallel applications to achieve good speedups. Previous work showed that, with synthetic communication-only benchmarks, communication improvements of up to 11.4-fold and 22-fold for broadcast and reduce operations, respectively, can be achieved by providing hardware support at the network level in a Network-on-Chip (NoC). However, these numbers do not provide a good estimation of the advantage for actual applications, as there are other factors that affect performance besides communications, such as computation. To this end, we extend our previous work by evaluating the impact of hardware support over a set of five parallel application kernels of varying computation-to-communication ratios. By introducing some useful computation to the performance evaluation, we obtain more representative results of the benefits of adding hardware support for broadcast and reduce operations. The experiments show that applications with lower computation-to-communication ratios benefit the most from hardware support as they highly depend on efficient collective communications to achieve better scalability. We also extend our work by doing more analysis on clock frequency, resource usage, power, and energy. The results show reasonable scalability for resource utilization and power in the network interfaces as the number of channels increases and that, even though more power is dissipated in the network interfaces due to the added hardware, the total energy used can still be less if the actual speedup is sufficient. The application kernels are executed in a 24-embedded-processor system distributed across four FPGAs.

References

[1]
L. A. Aguilar, D. A. Steinman, and R. S. C. Cobbold. 2010. On the synthesis of sample volumes for real-time spectral doppler ultrasound simulation. Ultrasound Med. Biol. 36, 12, 2107--2116.
[2]
Q. Ali, S. P. Midkiff, and V. S. Pai. 2009. Efficient high performance collective communication for the cell blade. In Proceedings of the 23rd International Conference on Supercomputing (ICS'09). ACM Press, New York, 193--203.
[3]
M. P. Allen and D. J. Tildesley. 1987. Computer Simulation of Liquids. Clarendon Press, New York.
[4]
G. Almasi, P. Heidelberger, C. J. Archer, X. Martorell, C. C. Erway, J. E. Moreira, B. Steinmacher-Burow, and Y. Zheng. 2005. Optimization of mpi collective communication on bluegene/l systems. In Proceedings of the 19th Annual International Conference on Supercomputing (ICS'05). ACM Press, New York, 253--262.
[5]
M. Barnett, R. Littlefield, D. Payne, and R. Van De Geijn. 1993. Global combine on mesh architectures with wormhole routing. In Proceedings of the 7th International Parallel Processing Symposium. 156--162.
[6]
Beecube 2011. Beecube. http://beecube.com/.
[7]
I. S. Dhillon and D. S. Modha. 2000. A data-clustering algorithm on distributed memory multiprocessors. In Proceedings of the Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems (SIGKDD'00). Springer, 245--260.
[8]
S. Gao, A. Schmidt, and R. Sass. 2010. Impact of reconfigurable hardware on accelerating MPI reduce. In Proceedings of the International Conference on Field-Programmable Technology (FPT'10). 29--36.
[9]
T. Hoefler, C. Siebert, and W. Rehm. 2007. A practically constant-time MPI broadcast algorithm for large-scale infiniband clusters with multicast. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS'07). 1--8.
[10]
J. Liu, A. R. Mamidala, and D. K. Panda. 2003. Fast and scalable MPI-level broadcast using infiniband's hardware multicast support. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS'07).
[11]
P. Mahr, C. Lorchner, H. Ishebabi, and C. Bobda. 2008. SoC-MPI: A flexible message passing library for multiprocessor systems-on-chips. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs. IEEE Computer Society, 187--192.
[12]
MPI Forum. 1993. MPI: A message passing interface. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing'93). ACM Press, New York, 878--883.
[13]
P. S. Pacheco. 1997. An application: Numerical integration. In Parallel Programming with MPI, Morgan Kaufmann Publishers, San Francisco, 53--60.
[14]
Y. Peng, M. Saldana, and P. Chow. 2011. Hardware support for broadcast and reduce in MPSOC. In Proceedings of the 21st International Conference on Field-Programmable Logic and Applications. 144--150.
[15]
M. Saldana, A. Patel, C. Madill, D. Nunes, D. Wang, P. Chow, R. Wittig, H. Styles, and A. Putnam. 2010. MPI as a programming model for high-performance reconfigurable computers. ACM Trans. Reconfig. Technol. Syst. 3, 22:1--22:29.
[16]
K. D. Underwood, W. B. Ligon III, and R. R. Sass. 2003. Analysis of a prototype intelligent network interface. Concurr. Comput. Pract. Exper. 15, 7--8, 751--777.
[17]
M. K. Velamati, A. Kumar, N. Jayam, G. Senthilkumar, P. K. Baruah, R. Sharma, S. Kapoor, and A. Srinivasan. 2007. Optimization of collective communication in intra-cell MPI. In Proceedings of the 14th International Conference on High-Performance Computing (HiPC'07). Springer, 488--499.
[18]
Voltaire. 2011. Voltaire. http://www.voltaire.com/.
[19]
Xpower. 2011. Xilinx. http://www.xilinx.com/.
[20]
J. Zhu. 1994. Solving Partial Differential Equations on Parallel Computers. World Scientific.

Cited By

View all
  • (2017)Collective Communication on FPGA Clusters with Static SchedulingACM SIGARCH Computer Architecture News10.1145/3039902.303990444:4(2-7)Online publication date: 11-Jan-2017
  • (2016)Application-Aware Collective Communication (Extended Abstract)2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM.2016.55(197-197)Online publication date: May-2016
  • (2015)CORDIC-Based Enhanced Systolic Array Architecture for QR DecompositionACM Transactions on Reconfigurable Technology and Systems10.1145/28277009:2(1-22)Online publication date: 14-Dec-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 7, Issue 3
Special Issue on 11th International Conference on Field-Programmable Technology (FPT'12) and Special Issue on the 7th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC'12)
August 2014
199 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/2664590
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 September 2014
Accepted: 01 February 2014
Revised: 01 January 2014
Received: 01 June 2013
Published in TRETS Volume 7, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FPGA
  2. MPI
  3. multiprocessor
  4. network-on-chip
  5. parallel computing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Collective Communication on FPGA Clusters with Static SchedulingACM SIGARCH Computer Architecture News10.1145/3039902.303990444:4(2-7)Online publication date: 11-Jan-2017
  • (2016)Application-Aware Collective Communication (Extended Abstract)2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM.2016.55(197-197)Online publication date: May-2016
  • (2015)CORDIC-Based Enhanced Systolic Array Architecture for QR DecompositionACM Transactions on Reconfigurable Technology and Systems10.1145/28277009:2(1-22)Online publication date: 14-Dec-2015
  • (2015)An Enhanced Adaptive Recoding Rotation CORDICACM Transactions on Reconfigurable Technology and Systems10.1145/28128139:1(1-25)Online publication date: 2-Nov-2015
  • (2015)A low-power and high-SFDR Direct Digital Frequency Synthesizer based on adaptive recoding CORDIC2015 IEEE 16th Annual Wireless and Microwave Technology Conference (WAMICON)10.1109/WAMICON.2015.7120358(1-3)Online publication date: Apr-2015
  • (2015)FPGA implementation of low-power and high-PSNR DCT/IDCT architecture based on adaptive recoding CORDIC2015 International Conference on Field Programmable Technology (FPT)10.1109/FPT.2015.7393139(128-135)Online publication date: Dec-2015
  • (2014)An efficient FPGA implementation of QR decomposition using a novel systolic array architecture based on enhanced vectoring CORDIC2014 International Conference on Field-Programmable Technology (FPT)10.1109/FPT.2014.7082764(123-130)Online publication date: Dec-2014

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media