Temuçin et al., 2021 - Google Patents
Efficient multi-path NVLink/PCIe-aware UCX based collective communication for deep learningTemuçin et al., 2021
View PDF- Document ID
- 14030283137655136110
- Author
- Temuçin Y
- Sojoodi A
- Alizadeh P
- Afsahi A
- Publication year
- Publication venue
- 2021 IEEE Symposium on High-Performance Interconnects (HOTI)
External Links
Snippet
High-performance communication for very large messages on modern multi-GPU nodes has become increasingly important for Deep Learning workloads. These computing nodes are equipped with state-of-the-art interconnects, such as Nvidia's NVLink and PCIe, to facilitate …
- 238000004891 communication 0 title abstract description 55
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5044—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogramme communication; Intertask communication
- G06F9/546—Message passing systems or structures, e.g. queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Programme initiating; Programme switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/485—Task life-cycle, e.g. stopping, restarting, resuming execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a programme unit and a register, e.g. for a simultaneous processing of several programmes
- G06F15/163—Interprocessor communication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored programme computers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F1/00—Details of data-processing equipment not covered by groups G06F3/00 - G06F13/00, e.g. cooling, packaging or power supply specially adapted for computer application
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Graham et al. | Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction | |
Wang et al. | Blink: Fast and generic collectives for distributed ml | |
Sapio et al. | Scaling distributed machine learning with {In-Network} aggregation | |
Luo et al. | Parameter hub: a rack-scale parameter server for distributed deep neural network training | |
Cheng et al. | Using high-bandwidth networks efficiently for fast graph computation | |
US9864759B2 (en) | System and method for providing scatter/gather data processing in a middleware environment | |
Chu et al. | Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems | |
Temuçin et al. | Efficient multi-path NVLink/PCIe-aware UCX based collective communication for deep learning | |
US9575689B2 (en) | Data storage system having segregated control plane and/or segregated data plane architecture | |
He et al. | Easynet: 100 gbps network for hls | |
Biswas et al. | Accelerating tensorflow with adaptive rdma-based grpc | |
Daglis et al. | Manycore network interfaces for in-memory rack-scale computing | |
Haghi et al. | FPGAs in the network and novel communicator support accelerate MPI collectives | |
Sun et al. | A ugni-based asynchronous message-driven runtime system for cray supercomputers with gemini interconnect | |
He et al. | Accl: Fpga-accelerated collectives over 100 gbps tcp-ip | |
Singh et al. | MPI alltoall personalized exchange on GPGPU clusters: Design alternatives and benefit | |
Karamati et al. | “Smarter” NICs for faster molecular dynamics: a case study | |
Temuçin et al. | Accelerating deep learning using interconnect-aware ucx communication for mpi collectives | |
Jin et al. | Distmind: Efficient resource disaggregation for deep learning workloads | |
Chen et al. | Resource abstraction and data placement for distributed hybrid memory pool | |
Morozov et al. | ALCF MPI benchmarks: Understanding machine-specific communication behavior | |
Subramoni et al. | Design and evaluation of generalized collective communication primitives with overlap using connectx-2 offload engine | |
Sridhar et al. | ScELA: Scalable and extensible launching architecture for clusters | |
Wang et al. | Coupling GPU and MPTCP to improve Hadoop/MapReduce performance | |
Venkata et al. | Accelerating OpenSHMEM collectives using in-network computing approach |