Highlights
- Pro
Stars
KV cache store for distributed LLM inference
A lightweight design for computation-communication overlap.
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
The repository has collected a batch of noteworthy MLSys bloggers (Algorithms/Systems)
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
DefTruth / CUDA-Learn-Notes
Forked from xlite-dev/LeetCUDA📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
yinfan98 / website
Forked from agefanscom/websiteAGE animation official website URL release page(AGE动漫官网网址发布页)
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Nvidia Instruction Set Specification Generator
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Distributed Triton for Parallel Systems
An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & LoRA & vLLM & RFT)
Large Language Model (LLM) Systems Paper List
RussellGroupCV is a resume template made by following the guidelines followed by the Russell Group in the UK.
Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
深度学习面试宝典(含数学、机器学习、深度学习、计算机视觉、自然语言处理和SLAM等方向)
A debugging and profiling tool that can trace and visualize python code execution
Yuhong Luo and Pan Li. Neighborhood-aware scalable temporal network representation learning. In Learning on Graphs, 2022.