8000 jiazhihao (Zhihao Jia) / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View jiazhihao's full-sized avatar

Organizations

@mirage-project

Block or report jiazhihao

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Fast, Flexible and Portable Structured Generation

C++ 1,036 73 Updated Jun 16, 2025

PoC for "SpecReason: Fast and Accura 10000 te Inference-Time Compute via Speculative Reasoning" [arXiv '25]

Python 40 5 Updated May 16, 2025

Quarl: A Learning-Based Quantum Circuit Optimizer

OpenQASM 3 Updated Jan 2, 2024

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

C++ 48 5 Updated May 8, 2025

Multi-Faceted AI Agent and Workflow Autotuning. Automatically optimizes LangChain, LangGraph, DSPy programs for better quality, lower execution latency, and lower execution cost. Also has a simple …

Python 239 27 Updated May 16, 2025

Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]

Python 25 2 Updated Nov 21, 2024

[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Python 39 3 Updated Apr 18, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 15,513 2,201 Updated Jun 27, 2025

Universal LLM Deployment Engine with ML Compilation

Python 20,867 1,755 Updated Jun 25, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 3,253 354 Updated Jun 27, 2025

ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch

Python 37 Updated Mar 27, 2025

SpotServe: Serving Generative Large Language Models on Preemptible Instances

123 12 Updated Feb 22, 2024

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Python 44 3 Updated Nov 5, 2024

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

Python 226 26 Updated Jun 10, 2025

Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA

C++ 1,410 82 Updated Jun 27, 2025

scalable and robust tree-based speculative decoding algorithm

Python 348 37 Updated Jan 28, 2025

An Attention Superoptimizer

C++ 22 Updated Jan 20, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 50,873 8,367 Updated Jun 27, 2025

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…

C++ 10,872 1,528 Updated Jun 27, 2025

Inference code for Llama models

Python 58,431 9,778 Updated Jan 26, 2025

Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training

C++ 1,805 239 Updated Jun 24, 2025

TOD: GPU-accelerated Outlier Detection via Tensor Operations

Python 180 24 Updated Mar 2, 2023

functorch is JAX-like composable function transforms for PyTorch.

Jupyter Notebook 1,432 104 Updated Jun 27, 2025

The Quartz Quantum Compiler

OpenQASM 83 21 Updated May 25, 2025

Dorylus: Affordable, Scalable, and Accurate GNN Training

C++ 76 13 Updated May 31, 2021

PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections

C++ 121 10 Updated Jun 23, 2022
Python 3 1 Updated Oct 23, 2022

The Foundation for All Legate Libraries

C++ 218 63 Updated Jun 27, 2025
Next
0