Stars
- All languages
- ActionScript
- Astro
- Batchfile
- C
- C#
- C++
- CMake
- CSS
- Clojure
- CoffeeScript
- Common Lisp
- Cuda
- Cython
- D
- Dockerfile
- Emacs Lisp
- Fortran
- Go
- HTML
- Haml
- Haskell
- Java
- JavaScript
- Jupyter Notebook
- Lua
- MATLAB
- MDX
- Makefile
- Nunjucks
- OCaml
- Objective-C
- PHP
- PureBasic
- Python
- R
- ReScript
- Ruby
- Rust
- SCSS
- Scala
- Shell
- Swift
- SystemVerilog
- TeX
- TypeScript
- Vim Script
- Vim Snippet
- Vue
🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton
Analyze computation-communication overlap in V3/R1.
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
FlashMLA: Efficient MLA decoding kernels
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
Introduction to Machine Learning Systems
data sets for performance analyses of Johann Sebastian Bach’s 'Goldberg Variations' BWV 988
Repository for answers for exercises in Programming Massively Parallel Processors book
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
An annotated implementation of the Transformer paper.
SGLang is a fast serving framework for large language models and vision language models.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
Efficient Triton Kernels for LLM Training
The official GitHub page for the survey paper "A Survey on Mixture of Experts in Large Language Models".
A curated list for Efficient Large Language Models
Bringing BERT into modernity via both architecture changes and scaling
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, Parallelism, MLA, etc.