Starred repositories
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
Flash Attention in ~100 lines of CUDA (forward pass only)
A LaTeX resume template designed for optimal information density and aesthetic appeal.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
Tile primitives for speedy kernels
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
Fast and memory-efficient exact attention
使用 cutlass 实现 flash-attention 精简版,具有教学意义
flash attention tutorial written in python, triton, cuda, cutlass
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
A high-throughput and memory-efficient inference and serving engine for LLMs
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
✔(已完结)最全面的 深度学习 笔记【土堆 Pytorch】【李沐 动手学深度学习】【吴恩达 深度学习】
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Next generation BLAS implementation for ROCm platform
A Easy-to-understand TensorOp Matmul Tutorial
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.