8000 ssiu (steve) / Starred · GitHub

More Web Proxy on the site http://driver.im/

ssiu

Follow

steve ssiu

Follow

2 followers · 5 following

Lists (3)

Sort

cutlass

gemm

NVIDIA

Stars

theunnecessarythings / llm-ptx

GPT2 in handwritten PTX

Cuda 5 1 Updated Jun 29, 2025

MrNeRF / gaussian-splatting-cuda

3D Gaussian Splatting, reimagined: Unleashing unmatched speed with C++ and CUDA from the ground up!

C++ 1,172 115 Updated Jul 5, 2025

Herdora / chisel

CLI tool for developing and profiling GPU kernels locally. Just write, test, and profile GPU code from your laptop.

Python 32 2 Updated Jul 2, 2025

Dao-AILab / quack

A Quirky Assortment of CuTe Kernels

Python 126 4 Updated Jul 4, 2025

Dao-AILab / grouped-latent-attention

Python 116 2 Updated May 29, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient MLA decoding kernels

Cuda 11,641 874 Updated Apr 29, 2025

salykova / sgemm.cu

High-Performance SGEMM on CUDA devices

Cuda 97 5 Updated Jan 21, 2025

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 18,199 1,781 Updated Jul 5, 2025

pranjalssh / fast.cu

Fastest kernels written from scratch

Cuda 287 39 Updated Apr 3, 2025

spatters / mma-matmul

Cuda 3 1 Updated Apr 1, 2025

DD-DuDa / Cute-Learning

Examples of CUDA implementations by Cutlass CuTe

Makefile 201 28 Updated Jul 1, 2025

alexarmbr / matmul-playground

Cuda 11 2 Updated Apr 7, 2025

modal-labs / modal-examples

Examples of programs built using Modal

Python 893 225 Updated Jul 5, 2025

ColfaxResearch / cutlass-kernels

Cuda 213 33 Updated Jul 11, 2024

linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training

Python 5,300 364 Updated Jul 5, 2025

axolotl-ai-cloud / axolotl

Go ahead and axolotl questions

Python 9,825 1,058 Updated Jul 5, 2025

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 857 84 Updated Dec 30, 2024

ELS-RD / kernl

Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.

Jupyter Notebook 1,575 98 Updated Feb 16, 2024

neuralmagic / AutoFP8

Python 195 25 Updated May 5, 2025

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 642 49 Updated May 5, 2025

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 434 80 Updated Sep 8, 2024

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.🎉

Cuda 5,326 565 Updated Jun 29, 2025

andylolu2 / simpleGEMM

The simplest but fast implementation of matrix multiplication in CUDA.

Cuda 37 4 Updated Jul 26, 2024

NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines

C++ 7,790 1,296 Updated Jul 3, 2025

Liu-xiandong / How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,079 162 Updated Jul 29, 2023

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 27,073 3,113 Updated Jun 26, 2025

NervanaSystems / maxas

Assembler for NVIDIA Maxwell architecture

Sass 1,011 167 Updated Jan 3, 2023

wangzyon / NVIDIA_SGEMM_PRACTICE

Step-by-step optimization of CUDA SGEMM

Cuda 348 46 Updated Mar 30, 2022

siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

Cuda 762 118 Updated Dec 28, 2023

yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 359 50 Updated Jan 2, 2025

0