Lists (1)
Sort Name ascending (A-Z)
Stars
CUDA benchmarks for measuring GPU utilization and interference
A prototype of using ibis-substrait to compile against a substrait extension
Distributed Communication-Optimal LU-factorization Algorithm
A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
RMG is an Open Source code for electronic structure calculations and modeling of materials and molecules. It is based on density functional theory and uses a real space basis and pseudopotentials.
Google Research
Zero-copy MPI communication of JAX arrays, for turbo-charged HPC applications in Python ⚡
Long Range Arena for Benchmarking Efficient Transformers
Flax is a neural network library for JAX that is designed for flexibility.
Making large AI models cheaper, faster and more accessible
PyTorch3D is FAIR's library of reusable components for deep learning with 3D data
Model parallel transformers in JAX and Haiku
Fast and memory-efficient exact attention
Training and serving large-scale neural networks with auto parallelization.
Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
ML-Perf HPC WG Implementation of Mesh-Tensorflow and (buildscripts) for Tensorflow with MPI
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
Distributed Communication-Optimal Shuffle and Transpose Algorithm
Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
Parallelformers: An Efficient Model Parallelization Toolkit for Deployment
Transformer related optimization, including BERT, GPT