research-article

Benefits of Adding Hardware Support for Broadcast and Reduce Operations in MPSoC Applications

Authors:

Yuanxi Peng,

Manuel Saldaña,

Christopher A. Madill,

Xiaofeng Zou,

Paul ChowAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 7, Issue 3

Article No.: 17, Pages 1 - 23

https://doi.org/10.1145/2629470

Published: 03 September 2014 Publication History

Get Access

Abstract

MPI has been used as a parallel programming model for supercomputers and clusters and recently in MultiProcessor Systems-on-Chip (MPSoC). One component of MPI is collective communication and its performance is key for certain parallel applications to achieve good speedups. Previous work showed that, with synthetic communication-only benchmarks, communication improvements of up to 11.4-fold and 22-fold for broadcast and reduce operations, respectively, can be achieved by providing hardware support at the network level in a Network-on-Chip (NoC). However, these numbers do not provide a good estimation of the advantage for actual applications, as there are other factors that affect performance besides communications, such as computation. To this end, we extend our previous work by evaluating the impact of hardware support over a set of five parallel application kernels of varying computation-to-communication ratios. By introducing some useful computation to the performance evaluation, we obtain more representative results of the benefits of adding hardware support for broadcast and reduce operations. The experiments show that applications with lower computation-to-communication ratios benefit the most from hardware support as they highly depend on efficient collective communications to achieve better scalability. We also extend our work by doing more analysis on clock frequency, resource usage, power, and energy. The results show reasonable scalability for resource utilization and power in the network interfaces as the number of channels increases and that, even though more power is dissipated in the network interfaces due to the added hardware, the total energy used can still be less if the actual speedup is sufficient. The application kernels are executed in a 24-embedded-processor system distributed across four FPGAs.

References

[1]

L. A. Aguilar, D. A. Steinman, and R. S. C. Cobbold. 2010. On the synthesis of sample volumes for real-time spectral doppler ultrasound simulation. Ultrasound Med. Biol. 36, 12, 2107--2116.

Abstract

References

Cited By

Index Terms

Recommendations

Hardware Support for Broadcast and Reduce in MPSoC

Coprocessor design to support MPI primitives in configurable multiprocessors

A Detailed Performance Analysis of the Interpolation Supplemented Lattice Boltzmann Method on the Cray T3E and Cray X1A Detailed Performance Analysis of the Interpolation Supplemented Lattice Boltzmann Method on the Cray T3E and Cray X1

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations