Lists (1)
Sort Name ascending (A-Z)
Stars
CUDA benchmarks for measuring GPU utilization and interference
A prototype of using ibis-substrait to compile against a substrait extension
Distributed Communication-Optimal LU-factorization Algorithm
A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
RMG is an Open Source code for electronic structure calculations and modeling of materials and molecules. It is based on density functional theory and uses a real space basis and pseudopotentials.
Zero-copy MPI communication of JAX arrays, for turbo-charged HPC applications in Python ⚡
Flax is a neural network library for JAX that is designed for flexibility.
Making large AI models cheaper, faster and more accessible
PyTorch3D is FAIR's library of reusable components for deep learning with 3D data
Model parallel transformers in JAX and Haiku
Fast and memory-efficient exact attention
Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
ML-Perf HPC WG Implementation of Mesh-Tensorflow and (buildscripts) for Tensorflow with MPI
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
Distributed Communication-Optimal Shuffle and Transpose Algorithm
Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
Parallelformers: An Efficient Model Parallelization Toolkit for Deployment
Transformer related optimization, including BERT, GPT
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.