Temuçin et al., 2021 - Google Patents

Efficient multi-path NVLink/PCIe-aware UCX based collective communication for deep learning

Temuçin et al., 2021

Document ID: 14030283137655136110
Author: Temuçin Y; Sojoodi A; Alizadeh P; Afsahi A
Publication year: 2021
Publication venue: 2021 IEEE Symposium on High-Performance Interconnects (HOTI)

External Links

Cited by

Snippet

High-performance communication for very large messages on modern multi-GPU nodes has become increasingly important for Deep Learning workloads. These computing nodes are equipped with state-of-the-art interconnects, such as Nvidia's NVLink and PCIe, to facilitate …

Continue reading at www.queensu.ca (PDF) (other versions)

238000004891 communication 0 title abstract description 55

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5044—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogramme communication; Intertask communication
- G06F9/546—Message passing systems or structures, e.g. queues
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Programme initiating; Programme switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/485—Task life-cycle, e.g. stopping, restarting, resuming execution
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a programme unit and a register, e.g. for a simultaneous processing of several programmes
- G06F15/163—Interprocessor communication
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored programme computers
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F1/00—Details of data-processing equipment not covered by groups G06F3/00 - G06F13/00, e.g. cooling, packaging or power supply specially adapted for computer application
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation

Similar Documents

Publication	Publication Date	Title
Graham et al.	2016	Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction
Wang et al.	2020	Blink: Fast and generic collectives for distributed ml
Sapio et al.	2021	Scaling distributed machine learning with {In-Network} aggregation
Luo et al.	2018	Parameter hub: a rack-scale parameter server for distributed deep neural network training
Cheng et al.	2018	Using high-bandwidth networks efficiently for fast graph computation
US9864759B2 (en)	2018-01-09	System and method for providing scatter/gather data processing in a middleware environment
Chu et al.	2020	Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems
Temuçin et al.	2021	Efficient multi-path NVLink/PCIe-aware UCX based collective communication for deep learning
US9575689B2 (en)	2017-02-21	Data storage system having segregated control plane and/or segregated data plane architecture
He et al.	2021	Easynet: 100 gbps network for hls
Biswas et al.	2018	Accelerating tensorflow with adaptive rdma-based grpc
Daglis et al.	2015	Manycore network interfaces for in-memory rack-scale computing
Haghi et al.	2020	FPGAs in the network and novel communicator support accelerate MPI collectives
Sun et al.	2012	A ugni-based asynchronous message-driven runtime system for cray supercomputers with gemini interconnect
He et al.	2021	Accl: Fpga-accelerated collectives over 100 gbps tcp-ip
Singh et al.	2011	MPI alltoall personalized exchange on GPGPU clusters: Design alternatives and benefit
Karamati et al.	2022	“Smarter” NICs for faster molecular dynamics: a case study
Temuçin et al.	2022	Accelerating deep learning using interconnect-aware ucx communication for mpi collectives
Jin et al.	2024	Distmind: Efficient resource disaggregation for deep learning workloads
Chen et al.	2021	Resource abstraction and data placement for distributed hybrid memory pool
Morozov et al.	2012	ALCF MPI benchmarks: Understanding machine-specific communication behavior
Subramoni et al.	2010	Design and evaluation of generalized collective communication primitives with overlap using connectx-2 offload engine
Sridhar et al.	2008	ScELA: Scalable and extensible launching architecture for clusters
Wang et al.	2016	Coupling GPU and MPTCP to improve Hadoop/MapReduce performance
Venkata et al.	2019	Accelerating OpenSHMEM collectives using in-network computing approach