Stars
Fast, Flexible and Portable Structured Generation
PoC for "SpecReason: Fast and Accura 10000 te Inference-Time Compute via Speculative Reasoning" [arXiv '25]
quantum-compiler / Quarl
Forked from quantum-compiler/quartzQuarl: A Learning-Based Quantum Circuit Optimizer
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
Multi-Faceted AI Agent and Workflow Autotuning. Automatically optimizes LangChain, LangGraph, DSPy programs for better quality, lower execution latency, and lower execution cost. Also has a simple …
Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
SGLang is a fast serving framework for large language models and vision language models.
Universal LLM Deployment Engine with ML Compilation
FlashInfer: Kernel Library for LLM Serving
ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch
SpotServe: Serving Generative Large Language Models on Preemptible Instances
Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
scalable and robust tree-based speculative decoding algorithm
A high-throughput and memory-efficient inference and serving engine for LLMs
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
TOD: GPU-accelerated Outlier Detection via Tensor Operations
functorch is JAX-like composable function transforms for PyTorch.
Dorylus: Affordable, Scalable, and Accurate GNN Training
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections