Lists (3)
Sort Name ascending (A-Z)
Stars
3D Gaussian Splatting, reimagined: Unleashing unmatched speed with C++ and CUDA from the ground up!
CLI tool for developing and profiling GPU kernels locally. Just write, test, and profile GPU code from your laptop.
FlashMLA: Efficient MLA decoding kernels
Fast and memory-efficient exact attention
Examples of CUDA implementations by Cutlass CuTe
Examples of programs built using Modal
Efficient Triton Kernels for LLM Training
Flash Attention in ~100 lines of CUDA (forward pass only)
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.🎉
The simplest but fast implementation of matrix multiplication in CUDA.
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Step-by-step optimization of CUDA SGEMM
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.