jiazhihao

Zhihao Jia jiazhihao

510 followers · 1 following

Achievements

x3 x2 x3

Achievements

x3 x2 x3

Organizations

Stars

mlc-ai / xgrammar

Fast, Flexible and Portable Structured Generation

C++ 1,036 73 Updated Jun 16, 2025

ruipeterpan / specreason

PoC for "SpecReason: Fast and Accura 10000 te Inference-Time Compute via Speculative Reasoning" [arXiv '25]

Python 40 5 Updated May 16, 2025

quantum-compiler / Quarl

Forked from quantum-compiler/quartz

Quarl: A Learning-Based Quantum Circuit Optimizer

OpenQASM 3 Updated Jan 2, 2024

flexflow / flexflow-serve

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

C++ 48 5 Updated May 8, 2025

GenseeAI / cognify

Multi-Faceted AI Agent and Workflow Autotuning. Automatically optimizes LangChain, LangGraph, DSPy programs for better quality, lower execution latency, and lower execution cost. Also has a simple …

Python 239 27 Updated May 16, 2025

dywsjtu / apparate

Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]

Python 25 2 Updated Nov 21, 2024

DerrickYLJ / TidalDecode

[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Python 39 3 Updated Apr 18, 2025

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Python 15,513 2,201 Updated Jun 27, 2025

mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation

Python 20,867 1,755 Updated Jun 25, 2025

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 3,253 354 Updated Jun 27, 2025

humuyan / Korch

ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch

Python 37 Updated Mar 27, 2025

Hsword / SpotServe

SpotServe: Serving Generative Large Language Models on Preemptible Instances

123 12 Updated Feb 22, 2024

YouAreSpecialToMe / QST

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Python 44 3 Updated Nov 5, 2024

interestingLSY / swiftLLM

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

Python 226 26 Updated Jun 10, 2025

mirage-project / mirage

Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA

C++ 1,410 82 Updated Jun 27, 2025

Infini-AI-Lab / Sequoia

scalable and robust tree-based speculative decoding algorithm

Python 348 37 Updated Jan 28, 2025

jiazhihao / attention_superoptimizer

An Attention Superoptimizer

C++ 22 Updated Jan 20, 2025

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 50,873 8,367 Updated Jun 27, 2025

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…

C++ 10,872 1,528 Updated Jun 27, 2025