research-article

Open access

Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable Processors

Authors:

George Katevenis,

Manolis Ploumidis,

Manolis MarazakisAuthors Info & Claims

ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

Pages 295 - 305

https://doi.org/10.1145/3605573.3605616

Published: 13 September 2023 Publication History

All formats PDF

Abstract

Recent processor advances have made feasible HPC nodes with high core counts, capable of hosting tens or even, hundreds of processes. Therefore, designing MPI collective operations at the intra-node level has received significant attention over the past years. Deriving efficient algorithms for modern HPC nodes, with complex internal topologies and memory hierarchies, is challenging. Moreover, the cache coherency protocol, and its impact on performance, further complicate algorithm design for MPI collectives. This latter concern is often only partially addressed.

In this work, we demonstrate a particularly challenging performance degradation scenario in the case of shared-memory–based MPI broadcast, on three generations of the Intel Xeon Scalable processor architecture. Based on analysis of hardware-based performance counters, we conclude that the performance degradation observed is attributed to the cache coherency protocol and the multi-socket configuration of the execution platforms examined. We present a number of novel approaches designed to amend this effect, and apply them in a cache coherency aware version of the MPI broadcast implementation. We reduce the overall latency of the broadcast operation by up to 1.5 × and 1.25 × for small and large messages, respectively.

Supplemental Material

PDF File

Appendix describing the computational artifacts, that allow for reproduction of the observations and experiments performed in the paper.

Download
330.82 KB

References

[1]

Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda. 2017. Scalable Reduction Collectives with Data Partitioning-Based Multi-Leader Design. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’17). Association for Computing Machinery, New York, NY, USA, Article 64, 11 pages. https://doi.org/10.1145/3126908.3126954

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Simulation based Performance Study of Cache Coherence Protocols

Cache Coherence Method for Improving Multi-threaded Applications on Multicore Systems

Direct MPI Library for Intel Xeon Phi Co-Processors

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations