Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
-
Updated
Feb 15, 2025 - Cuda
8000
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Learnings and experimentation with GPU programming
Building Triton and CUDA kernels side-by-side to create a cuBLAS-performant GEMM kernel.
Add a description, image, and links to the triton topic page so that developers can more easily learn about it.
To associate your repository with the triton topic, visit your repo's landing page and select "manage topics."