Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleOctober 2021
MPI collective communication through a single set of interfaces: A case for orthogonality
AbstractWe present and discuss a unified view of and interface for collective communication in the MPI (Message-Passing Interface) standard that in a natural way exploits MPI’s orthogonality of concepts. We observe that the currently separate ...
- review-articleJuly 2019
Efficient design for MPI asynchronous progress without dedicated resources
- Amit Ruhela,
- Hari Subramoni,
- Sourav Chakraborty,
- Mohammadreza Bayatpour,
- Pouya Kousha,
- Dhabaleswar K. (DK) Panda
Highlights- Presents a scalable asynchronous progress design that requires - No additional software or hardware resources. - No interrupts from the network adapter. - No ...
The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network ...
- research-articleSeptember 2018
Efficient Asynchronous Communication Progress for MPI without Dedicated Resources
- Amit Ruhela,
- Hari Subramoni,
- Sourav Chakraborty,
- Mohammadreza Bayatpour,
- Pouya Kousha,
- Dhabaleswar K. Panda
EuroMPI '18: Proceedings of the 25th European MPI Users' Group MeetingArticle No.: 14, Pages 1–11https://doi.org/10.1145/3236367.3236376The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), ...
- articleMay 2016
A novel MPI reduction algorithm resilient to imbalances in process arrival times
The Journal of Supercomputing (JSCO), Volume 72, Issue 5Pages 1973–2013https://doi.org/10.1007/s11227-016-1707-xReduction algorithms are optimized only under the assumption that all processes commence the reduction simultaneously. Research on process arrival times has shown that this is rarely the case. Thus, all benchmarking methodologies that take into account ...
- articleDecember 2014
Scalable PGAS collective operations in NUMA clusters
- Damián A. Mallón,
- Guillermo L. Taboada,
- Carlos Teijeiro,
- Jorge González-Domínguez,
- Andrés Gómez,
- Brian Wibecan
Cluster Computing (KLU-CLUS), Volume 17, Issue 4Pages 1473–1495https://doi.org/10.1007/s10586-014-0377-9The increasing number of cores per processor is turning manycore-based systems in pervasive. This involves dealing with multiple levels of memory in non uniform memory access (NUMA) systems and processor cores hierarchies, accessible via complex ...
- research-articleSeptember 2014
Exploring the effect of noise on the performance benefit of nonblocking allreduce
EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group MeetingPages 77–82https://doi.org/10.1145/2642769.2642786Relaxed synchronization offers the potential of maintaining application scalability by allowing many processes to make independent progress when some processes suffer delays. Yet, the benefits of this approach in important parallel workloads have not ...
- articleFebruary 2013
KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework
Journal of Parallel and Distributed Computing (JPDC), Volume 73, Issue 2Pages 176–188https://doi.org/10.1016/j.jpdc.2012.09.016The multiplication of cores in today's architectures raises the importance of intra-node communication in modern clusters and their impact on the overall parallel application performance. Although several proposals focused on this issue in the past, ...
- ArticleSeptember 2012
Low-Latency Collectives for the Intel SCC
CLUSTER '12: Proceedings of the 2012 IEEE International Conference on Cluster ComputingPages 346–354https://doi.org/10.1109/CLUSTER.2012.58Message passing has been adopted as the main programming paradigm for many-core processors with on-chip networks for inter-core communication. To this end, message-passing libraries such as MPI can be used, as they provide well-known interfaces to ...
- articleOctober 2007
A calculus for parallel computations over multidimensional dense arrays
Computer Languages, Systems and Structures (CLSS), Volume 33, Issue 3-4Pages 82–110https://doi.org/10.1016/j.cl.2006.07.005We present a calculus to formalize and give costs to parallel computations over multidimensional dense arrays. The calculus extends a simple distribution calculus (proposed in some previous work) with computation and data collection. We consider an SPMD ...
- articleSeptember 2007
Optimizing a conjugate gradient solver with non-blocking collective operations
Parallel Computing (PACO), Volume 33, Issue 9Pages 624–633https://doi.org/10.1016/j.parco.2007.06.006This paper presents a case study that analyzes the suitability and usage of non-blocking collective operations in parallel applications. As with their point-to-point counterparts, non-blocking collective operations provide the ability to overlap ...