- All languages
- ATS
- Agda
- Assembly
- Batchfile
- C
- C#
- C++
- CMake
- CSS
- Chapel
- Clojure
- Common Lisp
- Coq
- Crystal
- Cuda
- Cython
- D
- Dhall
- Dylan
- Elixir
- Elm
- Emacs Lisp
- Erlang
- F#
- Frege
- GDScript
- Go
- Groovy
- HCL
- HTML
- Haskell
- Haxe
- Java
- JavaScript
- Jinja
- Jsonnet
- Julia
- Jupyter Notebook
- Koka
- Kotlin
- LLVM
- Lean
- Lex
- LilyPond
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Marko
- Mojo
- Mustache
- Nginx
- Nim
- Nix
- Nunjucks
- OCaml
- Objective-C
- Odin
- PHP
- PLpgSQL
- Perl
- Python
- R
- RPC
- Racket
- ReScript
- Reason
- Rich Text Format
- Rocq Prover
- Roff
- Ruby
- Rust
- SCSS
- Sass
- Scala
- Scheme
- Shell
- Standard ML
- Svelte
- Swift
- SystemVerilog
- TeX
- TypeScript
- Verilog
- Vim Script
- Vue
- Wren
- XQuery
- XSLT
- Zig
Starred repositories
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
Manually tweaked, auto-generated raylib bindings for zig. https://github.com/raysan5/raylib
Plugin for generating HTML reports for pytest results
Build and run containers leveraging NVIDIA GPUs
An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & vLLM & Ray & Dynamic Sampling & Async Agentic RL)
TritonParse is a tool designed to help developers analyze and debug Triton kernels by visualizing the compilation process and source code mappings.
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
CUDA Matrix Multiplication Optimization
FlashInfer: Kernel Library for LLM Serving
Step-by-step optimization of CUDA SGEMM
Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
Penn CIS 5650 (GPU Programming and Architecture) Final Project
Materials for the Learn PyTorch for Deep Learning: Zero to Mastery course.
Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
The book "Performance Analysis and Tuning on Modern CPU"
AI 基础知识 - GPU 架构、CUDA 编程以及大模型基础知识
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train Qwen3, Llama 4, DeepSeek-R1, Gemma 3, TTS 2x faster with 70% less VRAM.
Python tool for converting files and office documents to Markdown.