8000 shuuul's list / GPU · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View shuuul's full-sized avatar

Highlights

  • Pro

Block or report shuuul

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

GPU

34 repositories

[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl

C++ 4,964 762 Updated Feb 8, 2024

Comparsion of Julia's GPU Kernel based ODE solvers with other open-source GPU ODE solvers

Cuda 26 1 Updated Jan 4, 2024
Julia 106 27 Updated May 21, 2025

Solve puzzles. Learn CUDA.

Jupyter Notebook 11,036 851 Updated Sep 1, 2024

Collection of common code that's shared among different research projects in FAIR computer vision team.

Python 2,130 234 Updated Nov 26, 2024

Interpolation and function approximation with JAX

Python 190 19 Updated May 9, 2025

A collection of memory efficient attention operators implemented in the Triton language.

Python 270 18 Updated Jun 5, 2024

MLX: An array framework for Apple silicon

C++ 20,736 1,215 Updated May 23, 2025

Triton Implementation of HyperAttention Algorithm

Python 48 3 Updated Dec 11, 2023

how to optimize some algorithm in cuda.

Cuda 2,222 194 Updated May 25, 2025

GPU-based first-order solver for linear programming.

Julia 72 14 Updated Feb 18, 2025

A simple and efficient Mamba implementation in pure PyTorch and MLX.

Python 1,236 105 Updated Dec 4, 2024

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 827 79 Updated Dec 30, 2024

CUDA Templates for Linear Algebra Subroutines

C++ 7,600 1,247 Updated May 23, 2025

Material for gpu-mode lectures

Jupyter Notebook 4,501 453 Updated Feb 9, 2025

Finetune Qwen3, Llama 4, TTS, DeepSeek-R1 & Gemma 3 LLMs 2x faster with 70% less memory! 🦥

Python 39,485 3,116 Updated May 28, 2025

Helpful tools and examples for working with flex-attention

Python 802 47 Updated May 23, 2025

FlashAttention (Metal Port)

Swift 487 27 Updated Sep 22, 2024

We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstra…

C++ 181 11 Updated Jan 28, 2025

Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA

C++ 847 56 Updated May 27, 2025

Tile primitives for speedy kernels

Cuda 2,390 144 Updated May 27, 2025

Efficient Triton Kernels for LLM Training

Python 5,102 335 Updated May 27, 2025

cuEquivariance is a math library that is a collective of low-level primitives and tensor ops to accelerate widely-used models, like DiffDock, MACE, Allegro and NEQUIP, based on equivariant neural n…

Python 218 12 Updated May 27, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥

Cuda 4,511 473 Updated May 28, 2025

KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems

Python 339 37 Updated May 10, 2025

A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS

178 7 Updated May 6, 2025

FlagGems is an operator library for large language models implemented in the Triton Language.

Python 546 98 Updated May 28, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 1,210 94 Updated May 28, 2025

Distributed Triton for Parallel Systems

Python 765 49 Updated May 26, 2025
0