Stars
cjmcv / SageAttention
Forked from thu-ml/SageAttentionQuantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
cjmcv / flash-attention
Forked from Dao-AILab/flash-attentionFast and memory-efficient exact attention
Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
CUDA Matrix Multiplication Optimization
Fast and memory-efficient exact attention
程序员在家做饭方法指南。Programmer's guide about how to cook at home (Simplified Chinese only).
Open Source DeepWiki: AI-Powered Wiki Generator for GitHub/Gitlab/Bitbucket Repositories. Join the discord: https://discord.gg/gMwThUMeme
Ongoing research training transformer models at scale
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
My learning notes/codes for ML SYS.
A CPU tool for benchmarking the peak of floating points
NVIDIA curated collection of educational resources related to general purpose GPU programming.
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
cjmcv / vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
cjmcv / lighteval
Forked from huggingface/lightevalLighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
cjmcv / cutlass
Forked from NVIDIA/cutlassCUDA Templates for Linear Algebra Subroutines
collection of benchmarks to measure basic GPU capabilities
FlashMLA: Efficient MLA decoding kernels
DeepEP: an efficient expert-parallel communication library
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Reading notes on the open source code of AI infrastructure (sglang, llm, cutlass, hpc, etc.)
cjmcv / sglang
Forked from sgl-project/sglangSGLang is a fast serving framework for large language models and vision language models.
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends