8000 ssiu (steve) / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View ssiu's full-sized avatar

Block or report ssiu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

GPT2 in handwritten PTX

Cuda 5 1 Updated Jun 29, 2025

3D Gaussian Splatting, reimagined: Unleashing unmatched speed with C++ and CUDA from the ground up!

C++ 1,172 115 Updated Jul 5, 2025

CLI tool for developing and profiling GPU kernels locally. Just write, test, and profile GPU code from your laptop.

Python 32 2 Updated Jul 2, 2025

A Quirky Assortment of CuTe Kernels

Python 126 4 Updated Jul 4, 2025

FlashMLA: Efficient MLA decoding kernels

Cuda 11,641 874 Updated Apr 29, 2025

High-Performance SGEMM on CUDA devices

Cuda 97 5 Updated Jan 21, 2025

Fast and memory-efficient exact attention

Python 18,199 1,781 Updated Jul 5, 2025

Fastest kernels written from scratch

Cuda 287 39 Updated Apr 3, 2025
Cuda 3 1 Updated Apr 1, 2025

Examples of CUDA implementations by Cutlass CuTe

Makefile 201 28 Updated Jul 1, 2025

Examples of programs built using Modal

Python 893 225 Updated Jul 5, 2025

Efficient Triton Kernels for LLM Training

Python 5,300 364 Updated Jul 5, 2025

Go ahead and axolotl questions

Python 9,825 1,058 Updated Jul 5, 2025

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 857 84 Updated Dec 30, 2024

Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.

Jupyter Notebook 1,575 98 Updated Feb 16, 2024
Python 195 25 Updated May 5, 2025

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 642 49 Updated May 5, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 434 80 Updated Sep 8, 2024

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.🎉

Cuda 5,326 565 Updated Jun 29, 2025

The simplest but fast implementation of matrix multiplication in CUDA.

Cuda 37 4 Updated Jul 26, 2024

CUDA Templates for Linear Algebra Subroutines

C++ 7,790 1,296 Updated Jul 3, 2025

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,079 162 Updated Jul 29, 2023

LLM training in simple, raw C/CUDA

Cuda 27,073 3,113 Updated Jun 26, 2025

Assembler for NVIDIA Maxwell architecture

Sass 1,011 167 Updated Jan 3, 2023

Step-by-step optimization of CUDA SGEMM

Cuda 348 46 Updated Mar 30, 2022

Fast CUDA matrix multiplication from scratch

Cuda 762 118 Updated Dec 28, 2023

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 359 50 Updated Jan 2, 2025
Next
0