Starred repositories
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
GPU programming related news and material links
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
Zero Bubble Pipeline Parallelism
This repository is the official implementation of "Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE"
OLMoE: Open Mixture-of-Experts Language Models
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.🎉
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Flash-Muon: An Efficient Implementation of Muon Optimizer
Monitoring recent cross-research on LLM & RL on arXiv for control. If there are good papers, PRs are welcome.
Automatically crawl arXiv papers daily and summarize them using AI. Illustrating them using GitHub Pages.
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
PPO x Family DRL Tutorial Course(决策智能入门级公开课:8节课帮你盘清算法理论,理顺代码逻辑,玩转决策AI应用实践 )
Official Repo for Open-Reasoner-Zero
XAttention: Block Sparse Attention with Antidiagonal Scoring
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
[ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"
Ring attention implementation with flash attention
A PyTorch native platform for training generative AI models
Minimalistic 4D-parallelism distributed training framework for education purpose