8000 SwampedWys / Starred · GitHub

More Web Proxy on the site http://driver.im/

SwampedWys

Follow

SwampedWys

Follow

4 followers · 25 following

Starred repositories

open-neutrino / neutrino

C 59 5 Updated Jun 23, 2025

zeroine / cutlass-cute-sample

C++ 35 11 Updated Apr 15, 2024

DD-DuDa / BitDecoding

A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.

C++ 49 4 Updated Jun 10, 2025

CalebDu / Awesome-Cute

C++ 80 13 Updated May 16, 2025

XiaoDiandian-623 / CUDA-Demo

Study CUDA

Cuda 5 1 Updated Apr 18, 2025

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 854 84 Updated Dec 30, 2024

ssiu / flash-attention-turing

Cuda 38 4 Updated Jun 17, 2025

fky2015 / resume-ng

A LaTeX resume template designed for optimal information density and aesthetic appeal.

TeX 510 57 Updated Jun 26, 2024

caiwanxianhust / flash-attention-opt

flash attention 优化日志

Cuda 14 Updated Jun 4, 2025

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 429 80 Updated Sep 8, 2024

weishengying / cutlass_flash_atten_fp8

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 71 6 Updated Aug 12, 2024

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 2,481 158 Updated Jun 22, 2025

ColfaxResearch / cutlass-kernels

Cuda 212 33 Updated Jul 11, 2024

MekkCyber / CutlassAcademy

A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS

190 9 Updated May 6, 2025

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 18,055 1,773 Updated Jun 25, 2025

weishengying / 8000 tiny-flash-attention

使用 cutlass 实现 flash-attention 精简版，具有教学意义

Cuda 42 5 Updated Aug 12, 2024

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 376 40 Updated May 14, 2025

reed-lau / cute-gemm

C++ 124 37 Updated Dec 6, 2024

deepseek-ai / open-infra-index

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,841 279 Updated May 15, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Python 5,481 632 Updated Jun 23, 2025

anilshanbhag / gpu-topk

Efficient Top-K implementation on the GPU

Cuda 179 21 Updated Apr 9, 2019

Tongkaio / CUDA_Kernel_Samples

CUDA 算子手撕与面试指南

Cuda 443 55 Updated Jan 15, 2025

microsoft / TileFusion

TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.

Cuda 90 6 Updated Jun 28, 2025

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 50,930 8,376 Updated Jun 28, 2025

NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,756 457 Updated Oct 9, 2023

AccumulateMore / CV

✔（已完结）最全面的深度学习笔记【土堆 Pytorch】【李沐动手学深度学习】【吴恩达深度学习】

Jupyter Notebook 11,301 1,358 Updated Jun 23, 2025

alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 804 68 Updated Jun 3, 2025

ROCm / rocBLAS

Next generation BLAS implementation for ROCm platform

C++ 382 187 Updated Jun 27, 2025

KnowingNothing / MatmulTutorial

A Easy-to-understand TensorOp Matmul Tutorial

C++ 365 47 Updated Sep 21, 2024

modelscope / dash-infer

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

C 257 27 Updated May 30, 2025

Starred topics

cuda-programming

CUDA

C++

0