8000 SwampedWys / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View SwampedWys's full-sized avatar

Block or report SwampedWys

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.

C++ 49 4 Updated Jun 10, 2025
C++ 80 13 Updated May 16, 2025

Study CUDA

Cuda 5 1 Updated Apr 18, 2025

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 854 84 Updated Dec 30, 2024

A LaTeX resume template designed for optimal information density and aesthetic appeal.

TeX 510 57 Updated Jun 26, 2024

flash attention 优化日志

Cuda 14 Updated Jun 4, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 429 80 Updated Sep 8, 2024

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 71 6 Updated Aug 12, 2024

Tile primitives for speedy kernels

Cuda 2,481 158 Updated Jun 22, 2025

A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS

190 9 Updated May 6, 2025

Fast and memory-efficient exact attention

Python 18,055 1,773 Updated Jun 25, 2025

使用 cutlass 实现 flash-attention 精简版,具有教学意义

Cuda 42 5 Updated Aug 12, 2024

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 376 40 Updated May 14, 2025
C++ 124 37 Updated Dec 6, 2024

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,841 279 Updated May 15, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Python 5,481 632 Updated Jun 23, 2025

Efficient Top-K implementation on the GPU

Cuda 179 21 Updated Apr 9, 2019

CUDA 算子手撕与面试指南

Cuda 443 55 Updated Jan 15, 2025

TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.

Cuda 90 6 Updated Jun 28, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 50,930 8,376 Updated Jun 28, 2025

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,756 457 Updated Oct 9, 2023

✔(已完结)最全面的 深度学习 笔记【土堆 Pytorch】【李沐 动手学深度学习】【吴恩达 深度学习】

Jupyter Notebook 11,301 1,358 Updated Jun 23, 2025

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 804 68 Updated Jun 3, 2025

Next generation BLAS implementation for ROCm platform

C++ 382 187 Updated Jun 27, 2025

A Easy-to-understand TensorOp Matmul Tutorial

C++ 365 47 Updated Sep 21, 2024

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

C 257 27 Updated May 30, 2025
Next
0