[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3605573.3605616acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable Processors

Published: 13 September 2023 Publication History

Abstract

Recent processor advances have made feasible HPC nodes with high core counts, capable of hosting tens or even, hundreds of processes. Therefore, designing MPI collective operations at the intra-node level has received significant attention over the past years. Deriving efficient algorithms for modern HPC nodes, with complex internal topologies and memory hierarchies, is challenging. Moreover, the cache coherency protocol, and its impact on performance, further complicate algorithm design for MPI collectives. This latter concern is often only partially addressed.
In this work, we demonstrate a particularly challenging performance degradation scenario in the case of shared-memory–based MPI broadcast, on three generations of the Intel Xeon Scalable processor architecture. Based on analysis of hardware-based performance counters, we conclude that the performance degradation observed is attributed to the cache coherency protocol and the multi-socket configuration of the execution platforms examined. We present a number of novel approaches designed to amend this effect, and apply them in a cache coherency aware version of the MPI broadcast implementation. We reduce the overall latency of the broadcast operation by up to 1.5 × and 1.25 × for small and large messages, respectively.

Supplemental Material

PDF File
Appendix describing the computational artifacts, that allow for reproduction of the observations and experiments performed in the paper.

References

[1]
Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda. 2017. Scalable Reduction Collectives with Data Partitioning-Based Multi-Leader Design. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’17). Association for Computing Machinery, New York, NY, USA, Article 64, 11 pages. https://doi.org/10.1145/3126908.3126954
[2]
Mohammadreza Bayatpour, Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Hari Subramoni, Pouya Kousha, and Dhabaleswar K. Panda. 2018. SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, New York, NY, USA, 12–23. https://doi.org/10.1109/CLUSTER.2018.00014
[3]
Darius Buntinas, Brice Goglin, David Goodell, Guillaume Mercier, and Stéphanie Moreaud. 2009. Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis. In 2009 International Conference on Parallel Processing. IEEE, New York, NY, USA, 462–469. https://doi.org/10.1109/ICPP.2009.22
[4]
Paul Caheny, Lluc Alvarez, Said Derradji, Mateo Valero, Miquel Moretó, and Marc Casas. 2018. Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach. IEEE Transactions on Parallel and Distributed Systems 29, 5 (2018), 1174–1187. https://doi.org/10.1109/TPDS.2017.2787123
[5]
Alan L. Cox and Robert J. Fowler. 1993. Adaptive Cache Coherency for Detecting Migratory Shared Data. SIGARCH Comput. Archit. News 21, 2 (5 1993), 98–108. https://doi.org/10.1145/173682.165146
[6]
Message Passing Interface Forum. 1994. MPI: A Message-Passing Interface Standard. https://dl.acm.org/doi/10.5555/898758
[7]
S. Mahdieh Ghazimirsaeed, Qinghua Zhou, Amit Ruhela, Mohammadreza Bayatpour, Hari Subramoni, and Dhabaleswar K. DK Panda. 2020. A Hierarchical and Load-Aware Design for Large Message Neighborhood Collectives. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New York, NY, USA, 1–13. https://doi.org/10.1109/SC41405.2020.00038
[8]
James R. Goodman and Hhj Hum. 2004. MESIF: A Two-Hop Cache Coherency Protocol for Point-to-Point Interconnects.
[9]
Richard Graham, Manjunath Gorentla Venkata, Joshua Ladd, Pavel Shamis, Ishai Rabinovitz, Vasily Filipov, and Gilad Shainer. 2011. Cheetah: A Framework for Scalable Hierarchical Collective Operations. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, New York, NY, USA, 73–83. https://doi.org/10.1109/CCGrid.2011.42
[10]
Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. 1992. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes*. Springer US, Boston, MA, 167–192. https://doi.org/10.1007/978-1-4615-3604-8_9
[11]
Daniel Hackenberg, Daniel Molka, and Wolfgang E. Nagel. 2009. Comparing Cache Architectures and Coherency Protocols on X86-64 Multicore SMP Systems. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (New York, New York) (MICRO 42). Association for Computing Machinery, New York, NY, USA, 413–422. https://doi.org/10.1145/1669112.1669165
[12]
Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Mohammadreza Bayatpour, Hari Subramoni, and Dhabaleswar K. Panda. 2018. Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, New York, NY, USA, 1020–1029. https://doi.org/10.1109/IPDPS.2018.00111
[13]
Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Mohammadreza Bayatpour, Hari Subramoni, and Dhabaleswar K. Panda. 2019. Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, New York, NY, USA, 410–419. https://doi.org/10.1109/CCGRID.2019.00055
[14]
Johannes Hofmann, Georg Hager, Gerhard Wellein, and Dietmar Fey. 2017. An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors. In High Performance Computing, Julian M. Kunkel, Rio Yokota, Pavan Balaji, and David Keyes (Eds.). Springer International Publishing, Cham, 294–314.
[15]
Marcos Horro, Mahmut T. Kandemir, Louis-Noël Pouchet, Gabriel Rodríguez, and Juan Touriño. 2019. Effect of Distributed Directories in Mesh Interconnects. In Proceedings of the 56th Annual Design Automation Conference 2019 (Las Vegas, NV, USA) (DAC ’19). Association for Computing Machinery, New York, NY, USA, Article 51, 6 pages. https://doi.org/10.1145/3316781.3317808
[16]
Intel. 2017. Intel Xeon Processor Scalable Family Technical Overview. https://www.intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-technical-overview.html
[17]
Intel. 2021. 3rd Gen Intel® Xeon® Processor Scalable Family, Codename Ice Lake, Uncore Performance Monitoring Reference Manual. https://www.intel.com/content/www/us/en/content-details/639778/3rd-gen-intel-xeon-processor-scalable-family-codename-ice-lake-uncore-performance-monitoring-reference-manual.html Document ID 639778, rev. 1.0.
[18]
Surabhi Jain, Rashid Kaleem, Marc Gamell Balmana, Akhil Langer, Dmitry Durnov, Alexander Sannikov, and Maria Garzaran. 2018. Framework for Scalable Intra-Node Collective Operations using Shared Memory. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New York, NY, USA, 374–385. https://doi.org/10.1109/SC.2018.00032
[19]
Aamer Jaleel, Joseph Nuzman, Adrian Moga, Simon C. Steely, and Joel Emer. 2015. High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, New York, NY, USA, 343–353. https://doi.org/10.1109/HPCA.2015.7056045
[20]
George Katevenis, Manolis Ploumidis, and Manolis Marazakis. 2022. A framework for hierarchical single-copy MPI collectives on multicore nodes. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, New York, NY, USA, 94–105. https://doi.org/10.1109/CLUSTER51413.2022.00024
[21]
George Katevenis, Manolis Ploumidis, and Manolis Marazakis. 2023. Cache_eng: Replication & profiling of shared-memory communication patterns. https://github.com/CARV-ICS-FORTH/XHC-OpenMPI/tree/icpp-23/cache_eng
[22]
Sailesh Kottapalli, Vedaraman Geetha, Henk G. Neefs, and Youngsoo Choi. 2013. Opportunistic snoop broadcast (osb) in directory enabled home snoopy systems. https://patents.google.com/patent/US20130007376 Patent No. US 2013/000737.6 A1, Filed Jul. 1st., 2011, Issued Jan. 3rd., 2013.
[23]
Network-Based Computing Laboratory. 2022. OSU Micro-Benchmarks. Ohio State University. https://mvapich.cse.ohio-state.edu/benchmarks/
[24]
Xi Luo, Wei Wu, George Bosilca, Yu Pei, Qinglei Cao, Thananon Patinyasakdikul, Dong Zhong, and Jack Dongarra. 2020. HAN: a Hierarchical AutotuNed Collective Communication Framework. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, New York, NY, USA, 23–34. https://doi.org/10.1109/CLUSTER49012.2020.00013
[25]
Teng Ma, George Bosilca, Aurelien Bouteiller, Brice Goglin, Jeffrey M. Squyres, and Jack J. Dongarra. 2011. Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs. In 2011 International Conference on Parallel Processing. IEEE, New York, NY, USA, 532–541. https://doi.org/10.1109/ICPP.2011.29
[26]
Teng Ma, Thomas Herault, George Bosilca, and Jack J. Dongarra. 2011. Process Distance-Aware Adaptive MPI Collective Communications. In 2011 IEEE International Conference on Cluster Computing. IEEE, New York, NY, USA, 196–204. https://doi.org/10.1109/CLUSTER.2011.30
[27]
John D. McCalpin. 2018. HPL and DGEMM Performance Variability on the Xeon Platinum 8160 Processor. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New York, NY, USA, 225–237. https://doi.org/10.1109/SC.2018.00021
[28]
John D. McCalpin. 2021. Mapping Addresses to L3/CHA Slices in Intel Processors. Technical Report. Texas Advanced Computing Center (TACC). https://repositories.lib.utexas.edu/handle/2152/87595
[29]
Adrian C. Moga, Malcolm Mandviwalla, Vedaraman Geetha, and Herbert H. Hum. 2013. Allocation and write policy for a glueless area-efficient directory cache for hotly contested cache lines. https://patents.google.com/patent/US8392665B2 Patent No. US 8,392.665 B2, Filed Sep. 25th., 2010, Issued Mar. 5th., 2013.
[30]
Daniel Molka, Daniel Hackenberg, Robert Schöne, and Wolfgang E. Nagel. 2015. Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture. In 2015 44th International Conference on Parallel Processing. IEEE, New York, NY, USA, 739–748. https://doi.org/10.1109/ICPP.2015.83
[31]
Rahul Pal, Ishwar AGARWAL, Yen-Cheng Liu, Joseph Nuzman, Ashok Jagannathan, Bahaa Fahim, and Nithiyanandan Bashyam. 2017. Method and apparatus for distributed snoop filtering. https://patents.google.com/patent/US9727475B2 Patent No. US 9,727.475 B2, Filed Sep. 26th., 2014, Issued Aug. 8th., 2017.
[32]
Sabela Ramos and Torsten Hoefler. 2016. Cache Line Aware Algorithm Design for Cache-Coherent Architectures. IEEE Transactions on Parallel and Distributed Systems 27, 10 (2016), 2824–2837. https://doi.org/10.1109/TPDS.2016.2516540
[33]
Sabela Ramos and Torsten Hoefler. 2017. Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, New York, NY, USA, 297–306. https://doi.org/10.1109/IPDPS.2017.30
[34]
Martin Ruefenacht, Mark Bull, and Stephen Booth. 2017. Generalisation of recursive doubling for AllReduce: Now with simulation. Parallel Comput. 69 (2017), 24–44. https://doi.org/10.1016/j.parco.2017.08.004
[35]
Hermann Schweizer, Maciej Besta, and Torsten Hoefler. 2015. Evaluating the Cost of Atomic Operations on Modern Architectures. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, New York, NY, USA, 445–456. https://doi.org/10.1109/PACT.2015.24
[36]
Hari Subramoni, Ammar Ahmad Awan, Khaled Hamidouche, Dmitry Pekurovsky, Akshay Venkatesh, Sourav Chakraborty, Karen Tomko, and Dhabaleswar K. Panda. 2015. Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters. In High Performance Computing, Julian M. Kunkel and Thomas Ludwig (Eds.). Springer International Publishing, Cham, 434–453.
[37]
Jerome Vienne. 2014. Benefits of Cross Memory Attach for MPI Libraries on HPC Clusters. In Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment (Atlanta, GA, USA) (XSEDE ’14). Association for Computing Machinery, New York, NY, USA, Article 33, 6 pages. https://doi.org/10.1145/2616498.2616532
[38]
Michael Woodacre, Derek Robb, Dean Roe, and Karl Feind. 2005. The SGI® AltixTM 3000 global shared-memory architecture. Technical Report. Silicon Graphics, Inc.
[39]
Mengjia Yan, Read Sprabery, Bhargava Gopireddy, Christopher Fletcher, Roy Campbell, and Josep Torrellas. 2019. Attack Directories, Not Caches: Side Channel Attacks in a Non-Inclusive World. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, New York, NY, USA, 888–904. https://doi.org/10.1109/SP.2019.00004

Cited By

View all
  • (2024)Exploring the ARM Coherent Mesh Network TopologyArchitecture of Computing Systems10.1007/978-3-031-66146-4_15(221-235)Online publication date: 1-Aug-2024

Index Terms

  1. Impact of Cache Coherence on the Performance of Shared-Memory based MPI Primitives: A Case Study for Broadcast on Intel Xeon Scalable Processors

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing
      August 2023
      858 pages
      ISBN:9798400708435
      DOI:10.1145/3605573
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 September 2023

      Check for updates

      Author Tags

      1. HPC
      2. MPI
      3. cache coherence protocol
      4. intra-node

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      ICPP 2023
      ICPP 2023: 52nd International Conference on Parallel Processing
      August 7 - 10, 2023
      UT, Salt Lake City, USA

      Acceptance Rates

      Overall Acceptance Rate 91 of 313 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,129
      • Downloads (Last 6 weeks)184
      Reflects downloads up to 14 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Exploring the ARM Coherent Mesh Network TopologyArchitecture of Computing Systems10.1007/978-3-031-66146-4_15(221-235)Online publication date: 1-Aug-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media