We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstra…

C++ 181 11 Updated Jan 28, 2025

mirage-project / mirage

Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA

C++ 847 56 Updated May 27, 2025

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 2,390 144 Updated May 27, 2025

linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training

Python 5,102 335 Updated May 27, 2025

NVIDIA / cuEquivariance

cuEquivariance is a math library that is a collective of low-level primitives and tensor ops to accelerate widely-used models, like DiffDock, MACE, Allegro and NEQUIP, based on equivariant neural n…

Python 218 12 Updated May 27, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥

Cuda 4,511 473 Updated May 28, 2025

ScalingIntelligence / KernelBench

KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems

Python 339 37 Updated May 10, 2025

MekkCyber / CutlassAcademy

A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS

178 7 Updated May 6, 2025

FlagOpen / FlagGems

FlagGems is an operator library for large language models implemented in the Triton Language.

Python 546 98 Updated May 28, 2025

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 1,210 94 Updated May 28, 2025

ByteDance-Seed / Triton-distributed

Distributed Triton for Parallel Systems

Python 765 49 Updated May 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shu Li shuuul

Achievements

Achievements

Highlights

Block or report shuuul

GPU

NVIDIA / thrust

utkarsh530 / GPUODEBenchmarks

JuliaGPU / Adapt.jl

srush / GPU-Puzzles

facebookresearch / fvcore

f0uriest / interpax

FlagOpen / FlagAttention

ml-explore / mlx

amirzandieh / HyperAttention

BBuf / how-to-optim-algorithm-in-cuda

jinwen-yang / cuPDLP.jl

gpu-mode / profiling-cuda-in-torch

alxndrTL / mamba.py

tspeterkim / flash-attention-minimal

NVIDIA / cutlass

gpu-mode / lectures

unslothai / unsloth

pytorch-labs / attention-gym

philipturner / metal-flash-attention

TiledTensor / TiledCUDA