Releases: vllm-pr 8000 oject/vllm
v0.9.1
Highlights
This release features 274 commits, from 123 contributors (27 new contributors!)
- Progress in large scale serving
- DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762)
- Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090)
- DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ray (#18779), allow AsyncLLMEngine.generate to target a specific DP rank (#19102), data parallel rank to KVEventBatch (#18925)
- Tooling: Simplify EP kernels installation (#19412)
- RLHF workflow: Support inplace model weights loading (#18745)
- Initial full support for Hybrid Memory Allocator (#17996), support cross-layer KV sharing (#18212)
- Add FlexAttention to vLLM V1 (#16078)
- Various production hardening related to full cuda graph mode (#19171, 19106, #19321)
Model Support
- Support Magistral (#19193), LoRA support for InternVL (#18842), minicpm eagle support (#18943), NemotronH support (#18863, #19249)
- Enable data parallel for Llama4 vision encoder (#18368)
- Add DeepSeek-R1-0528 function call chat template (#18874)
Hardware Support & Performance Optimizations
- Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205), Qwen3-235B-A22B (#19315)
- Blackwell: Add Cutlass MLA backend (#17625), Tunings for SM100 FP8 CUTLASS kernel (#18778), Use FlashInfer by default on Blackwell GPUs (#19118), Tune
scaled_fp8_quant
by increasing vectorization (#18844) - FP4: Add compressed-tensors NVFP4 support (#18312), FP4 MoE kernel optimization (#19110)
- CPU: V1 support for the CPU backend (#16441)
- ROCm: Add AITER grouped topk for DeepSeekV2 (#18825)
- POWER: Add IBM POWER11 Support to CPU Extension Detection (#19082)
- TPU: Initial support of model parallelism with single worker using SPMD (#18011), Multi-LoRA Optimizations for the V1 TPU backend (#15655)
- Neuron: Add multi-LoRA support for Neuron. (#18284), Add Multi-Modal model support for Neuron (#18921), Support quantization on neuron (#18283)
- Platform: Make torch distributed process group extendable (#18763)
Engine features
- Add Lora Support to Beam Search (#18346)
- Add rerank support to run_batch endpoint (#16278)
- CLI: add run batch (#18804)
- Server: custom logging (#18403), allowed_token_ids in ChatCompletionRequest (#19143)
LLM
API: make use_tqdm accept a callable for custom progress bars (#19357)- perf: [KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)
API Deprecations
- Disallow pos-args other than
model
when initializingLLM
(#18802) - Remove
inputs
arg fallback in Engine classes (#18799) - Remove fallbacks for Embeddings API (#18795)
- Remove mean pooling default for
Qwen2EmbeddingModel
(#18913) - Require overriding
get_dummy_text
andget_dummy_mm_data
(#18796) - Remove metrics that were deprecated in 0.8 (#18837)
Documentation
- Add CLI doc (#18871)
- Update SECURITY.md with link to our security guide (#18961), Add security warning to bug report template (#19365)
What's Changed
- [CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
- [Neuron] Support quantization on neuron by @aws-satyajith in #18283
- Support datasets in
vllm bench serve
and sync with benchmark_[serving,datasets].py by @mgoin in #18566 - [Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
- [Build] Fixes for CMake install by @ProExpertProg in #18570
- [Core] Improve Tensor serialisation by @lgeiger in #18774
- [rocm] Fix wrong attention log by @fxmarty-amd in #18764
- [Bugfix] Fix nomic max_model_len by @noooop in #18755
- [Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
- [V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
- [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
- [Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
- [Deprecation] Require overriding
get_dummy_text
andget_dummy_mm_data
by @DarkLight1337 in #18796 - [Deprecation] Remove unused sync methods in
async_timeout
by @DarkLight1337 in #18792 - [Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
- [CI] improve embed testing by @noooop in #18747
- Fix PiecewiseCompileInterpreter by @zou3519 in #17338
- [BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
- [Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
- Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
- [Frontend] add run batch to CLI by @reidliu41 in #18804
- decrement server_load on listen for disconnect by @daniel-salib in #18784
- [Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
- [Chore] update ty configuration by @aarnphm in #18839
- [Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
- [V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
- [Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
- [Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
- Remove checks for
None
for fields which should never beNone
by @hmellor in #17985 - [Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
- [Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
- Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
- Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
- [Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
- [Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
- Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
- Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
- [Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
- [LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
- [Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
- [doc] add CLI doc by @reidliu41 in #18871
- [Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
- [Misc] Replace TODO in serving transcription by @NickLucche in #18895
- [Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
- [BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
- Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
- [Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
- [Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
- [Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
- [Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
- [ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
- [Deprecation] Disallow pos-args other than
model
when initializingLLM
by @DarkLight1337 in #18802 - [Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
- [V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
- [BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
- [P/D] NixlConnector DP fixes by @wseaton ...
v0.9.1rc1
What's Changed
- [CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
- [Neuron] Support quantization on neuron by @aws-satyajith in #18283
- Support datasets in
vllm bench serve
and sync with benchmark_[serving,datasets].py by @mgoin in #18566 - [Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
- [Build] Fixes for CMake install by @ProExpertProg in #18570
- [Core] Improve Tensor serialisation by @lgeiger in #18774
- [rocm] Fix wrong attention log by @fxmarty-amd in #18764
- [Bugfix] Fix nomic max_model_len by @noooop in #18755
- [Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
- [V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
- [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
- [Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
- [Deprecation] Require overriding
get_dummy_text
andget_dummy_mm_data
by @DarkLight1337 in #18796 - [Deprecation] Remove unused sync methods in
async_timeout
by @DarkLight1337 in #18792 - [Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
- [CI] improve embed testing by @noooop in #18747
- Fix PiecewiseCompileInterpreter by @zou3519 in #17338
- [BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
- [Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
- Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
- [Frontend] add run batch to CLI by @reidliu41 in #18804
- decrement server_load on listen for disconnect by @daniel-salib in #18784
- [Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
- [Chore] update ty configuration by @aarnphm in #18839
- [Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
- [V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
- [Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
- [Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
- Remove checks for
None
for fields which should never beNone
by @hmellor in #17985 - [Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
- [Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
- Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
- Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
- [Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
- [Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
- Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
- Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
- [Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
- [LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
- [Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
- [doc] add CLI doc by @reidliu41 in #18871
- [Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
- [Misc] Replace TODO in serving transcription by @NickLucche in #18895
- [Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
- [BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
- Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
- [Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
- [Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
- [Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
- [Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
- [ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
- [Deprecation] Disallow pos-args other than
model
when initializingLLM
by @DarkLight1337 in #18802 - [Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
- [V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
- [BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
- [P/D] NixlConnector DP fixes by @wseaton in #18903
- Use standalone_compile by default in torch >= 2.8.0 by @zou3519 in #18846
- [TPU] remove transpose ops in moe kernel by @yaochengji in #18923
- [Bugfix] Fix PP default fallback behavior for V1 by @mgoin in #18915
- [Misc] Update type annotation for rotary embedding
base
by @DarkLight1337 in #18914 - [TPU][CI/CD] Clean up docker for TPU tests. by @CAROLZXYZXY in #18926
- improve the robustness of parsing vlms config in AutoRound by @wenhuach21 in #18894
- [Bugfix] Consistent ascii handling in tool parsers by @chaunceyjiang in #18883
- [Model] Use AutoWeightsLoader for mamba2 by @jinyouzhi in #18918
- [docs] fix: fix markdown syntax by @eric-haibin-lin in #18927
- [ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. by @vllmellm in #18938
- [Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy by @mgoin in #18861
- [Deprecation] Remove mean pooling default for
Qwen2EmbeddingModel
by @DarkLight1337 in #18913 - [Misc]Fix benchmarks/README.md for speculative decoding by @rabi in #18897
- [doc] add mkdocs doc by @reidliu41 in #18930
- [Model] Use in-place adds in SigLIP by @lgeiger in #18922
- [Bugfix][Failing Test] Fix test_vllm_port.py by @rabi in #18618
- [Misc]Fix typo by @Always-Naive in #18947
- [Bugfix][TPU] Fix tpu model runner testcase failure by @CAROLZXYZXY in #18810
- [CI/Build] remove regex from build dependencies by @dtrifiro in #18945
- [Feature] minicpm eagle support by @huangyuxiang03 in #18943
- [doc] show the count for fork and watch by @reidliu41 in #18950
- [Docs] Update SECURITY.md with link to our security guide by @russellb in #18961
- Improve "failed to get the hash of the compiled graph" error by @zou3519 in #18956
- [Perf] API-server scaleout with many-to-many server-engine comms by @njhill in #17546
- Benchmark script for fp8 vs bf16 gemm by @mgoin in #17126
- [VLM] Add PP support and fix GPTQ inference for Ovis models by @Isotr0py in #18958
- [Misc] add group_size is -1 in awq quantization by @lengrongfu in #18910
- Tool parser regex timeout handling by @wseaton in https://github.com/vl...
v0.9.0.1
This patch release contains important bugfix for DeepSeek family of models on NVIDIA Ampere and below (#18807)
Full Changelog: v0.9.0...v0.9.0.1
v0.9.0
Highlights
This release features 649 commits, from 215 contributors (82 new contributors!)
- vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
- The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
- As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
- Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
- You can use our docker image or install FlashInfer nightly wheel
pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl
then setVLLM_ATTENTION_BACKEND=FLASHINFER
for better performance. - Upgraded support for the new FlashInfer main branch. (#15777)
- Please checkout #18153 for the full roadmap
- You can use our docker image or install FlashInfer nightly wheel
- Initial DP, EP, PD support for large scale inference
- EP:
- DP:
- Decouple engine process management and comms (#15977)
- PD:
- Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)
Notable Changes
- Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
- Change
top_k
to be disabled with0
(still accept-1
for now) (#17773) - The seed is now set to
0
by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even iftemperature > 0
. This does not modify the random state in user code since workers are run in separate processes unlessVLLM_USE_V1_MULTIPROCESSING=0
. (#17929, #18741)
Model Enhancements
- Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
- Please install the development version of
transformers
(from source) to use Falcon-H1.
- Please install the development version of
- Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
- Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
- DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
- Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
- Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
- InternVL models with Qwen2.5 backbone now support video inputs (#18499)
Performance, Production and Scaling
- Support full cuda graph in v1 (#16072)
- Pipeline Parallelism: MultiprocExecutor support (#14219),
torchrun
(#17827) - Support sequence parallelism combined with pipeline parallelism (#18243)
- Async tensor parallelism using compilation pass (#17882)
- Perf: Use small max_num_batched_tokens for A100 (#17885)
- Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
- Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)
Security
- Prevent side-channel attacks via cache salting (#17045)
- Fix image hash collision in certain edge cases (#17378)
- Add
VLLM_ALLOW_INSECURE_SERIALIZATION
env var (#17490) - Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)
Features
- CLI:
deprecated=True
(#17426) - Frontend: progress bar for adding requests (#17525),
chat_template_kwargs
inLLM.chat
(#17356),/classify
endpoint (#17032), truncation control for embedding models (#14776),cached_tokens
in response usage (#18149) - LoRA: default local directory LoRA resolver plugin. (#16855)
- Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
- Quantization:
nvidia/DeepSeek-R1-FP4
(#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models withAOPerModuleConfig
(#17826), CUDA Graph support for V1 GGUF support (#18646) - Reasoning: deprecate
--enable-reasoning
(#17452) - Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
- Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466),
tool_choice: required
for Xgrammar (#17845), Structural Tag with Guidance backend (#17333) - Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)
Hardwares
- NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
- TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
- Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
- AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
- Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)
Documentation
- Update quickstart and install for cu128 using
--torch-backend=auto
(#18505) - NVIDIA TensorRT Model Optimizer (#17561)
- Usage of Qwen3 thinking (#18291)
Developer Facing
- Benchmark: Add single turn MTBench to Serving Bench (#17202)
- Usability: Decrease import time of
vllm.multimodal
(#18031) - Code Format: Code formatting using
ruff format
(#17656, #18068, #18400) - Readability:
- Process:
- Propose a deprecation policy for the project (#17063)
- Testing: expanding torch nightly tests (#18004)
What's Changed
- Support loading transformers models with named parameters by @wuisawesome in #16868
- Add tuned triton fused_moe configs for Qwen3Moe by @mgoin in #17328
- [Benchmark] Add single turn MTBench to Serving Bench by @ekagra-ranjan in #17202
- [Optim] Compute multimodal hash only once per item by @DarkLight1337 in #17314
- implement Structural Tag with Guidance backend by @mmoskal in #17333
- [V1][Spec Decode] Make Eagle model arch config driven by @ekagra-ranjan in #17323
- [model] make llama4 compatible with pure dense layers by @luccafong in #17315
- [Bugfix] Fix
numel()
downcast in fused_layernorm_dynamic_per_token_quant.cu by @r-barnes in #17316 - Ignore
'<string>'
filepath by @zou3519 in #17330 - [Bugfix] Add contiguous call inside rope kernel wrapper by @timzsu in #17091
- [Misc] Add a Jinja template to support Mistral3 function calling by @chaunceyjiang in #17195
- [Model] support MiniMax-VL-01 model by @qscqesze in #16328
- [Misc] Move config fields to MultiModalConfig by @DarkLight1337 in #17343
- [Misc]Use a platform independent interface to obtain the device attributes by @ponix-j in #17100
- [Fix] Documentation spacing in compilation config help text by @Zerohertz in #17342
- [Build][Bugfix] Restrict setuptools version to <80 by @gshtras in #17320
- [Model] Ignore rotary embed load for Cohere model by @ekagra-ranjan in #17319
- Update docs requirements by @hmellor in #17379
- [Doc] Fix QWen3MOE info by @jeejeelee in #17381
- [Bugfix] Clean up MiniMax-VL and fix processing by @DarkLight1337 in #17354
pre-commit autoupdate
by @hmellor in #17380- [Frontend] Support
chat_template_kwargs
inLLM.chat
by @DarkLight1337 in #17356 - Transformers backend tweaks by @hmellor in #17365
- Fix: Spelling of inference by @a2q1p in #17387
- Improve literal dataclass field conversion to argparse argument by @hmellor in #17391
- [V1] Remove num_input_tokens from attn_metadata by @heheda12345 in #17193
- [Bugfix] add qwen3 reasoning-parser fix content is None when disable β¦ by @mofanke in #17369
- fix gemma3 results all zero by @mayuyuace in #17364
- [Misc][ROCm] Exclude
cutlass_mla_decode
for ROCm build by @tywuAMD in #17289 - Enabling multi-group kernel tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-p...
v0.8.5.post1
This post release contains two bug fix for memory leak and model accuracy
- Fix Memory Leak in
_cached_reqs_data
(#17567) - Fix sliding window attention in V1 giving incorrect results (#17574)
Full Changelog: v0.8.5...v0.8.5.post1
v0.8.5
This release contains 310 commits from 143 contributors (55 new contributors!).
Highlights
This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.
Model Support
- Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
- Add ModernBERT (#16648)
- Add Granite Speech Support (#16246)
- Add PLaMo2 (#14323)
- Add Kimi-VL model support (#16387)
- Add Qwen2.5-Omni model support (thinker only) (#15130)
- Snowflake Arctic Embed (Family) (#16649)
- Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)
V1 Engine
- Add
structural_tag
support using xgrammar (#17085) - Disaggregated serving:
- Clean up: Remove Sampler from Model Code (#17084)
- MLA: Simplification to batch P/D reordering (#16673)
- Move usage stats to worker and start logging TPU hardware (#16211)
- Support FlashInfer Attention (#16684)
- Faster incremental detokenization (#15137)
- EAGLE-3 Support (#16937)
Features
- Validate urls object for multimodal content parts (#16990)
- Prototype support sequence parallelism using compilation pass (#16155)
- Add sampling params to
v1/audio/transcriptions
endpoint (#16591) - Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
- Add
vllm bench [latency, throughput]
CLI commands (#16508)
Performance
- Attention:
- MoE:
- Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
- Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)
Hardwares
- TPU:
- AMD:
- AITER Fused MOE V1 Support (#16752)
- Integrate Paged Attention Kernel from AITER (#15001)
- Support AITER MLA (#15893)
- Upstream prefix prefill speed up for vLLM V1 (#13305)
- Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
- Add skinny gemms for unquantized linear on ROCm (#15830)
- Follow-ups for Skinny Gemms on ROCm. (#17011)
Documentation
- Add open-webui example (#16747)
- Document Matryoshka Representation Learning support (#16770)
- Add a security guide (#17230)
- Add example to run DeepSeek with Ray Serve LLM (#17134)
- Benchmarks for audio models (#16505)
Security and Dependency Updates
- Don't bind tcp zmq socket to all interfaces (#17197)
- Use safe serialization and fix zmq setup for mooncake pipe (#17192)
- Bump Transformers to 4.51.3 (#17116)
Build and testing
- Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)
Breaking changes π¨
--enable-chunked-prefill
,--multi-step-stream-outputs
,--disable-chunked-mm-input
can no longer explicitly be set toFalse
. Instead, addno-
to the start of the argument (i.e.--enable-chunked-prefill
and--no-enable-chunked-prefill
) (#16533)
What's Changed
- Improve configs -
SchedulerConfig
by @hmellor in #16533 - [Misc] remove warning if triton>=3.2.0 by @DefTruth in #16553
- [Misc] refactor examples by @reidliu41 in #16563
- [Misc] Update usage with mooncake lib for kv transfer by @ShangmingCai in #16523
- [fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet by @Shafi-Hussain in #16048
- [Bugfix] Multi-modal caches not acting like LRU caches by @DarkLight1337 in #16593
- [TPU][V1] Fix exponential padding when
max-num-batched-tokens
is not a power of 2 by @NickLucche in #16596 - Fix triton install condition on CPU by @hmellor in #16600
- s390x: Fix PyArrow build and add CPU test script for Buildkite CI by @Nash-123 in #16036
- [Model][VLM] Add Kimi-VL model support by @courage17340 in #16387
- [Hardware][TPU] Add torchvision to tpu dependency file by @lsy323 in #16616
- [DOC][TPU] Add core idea about avoiding recompilation after warmup by @yaochengji in #16614
- config check sleep mode support oot platforms by @celestialli in #16562
- [Core][Bugfix] Fix Offline MM Beam Search by @alex-jw-brooks in #16390
- [Kernel] moe wna16 marlin kernel by @jinzhen-lin in #14447
- [BugFix]: Update minimum
pyzmq
version by @taneem-ibrahim in #16549 - [Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py by @tlrmchlsmth in #16623
- [Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) by @pooyadavoodi in #16631
- Add
vllm bench [latency, throughput]
CLI commands by @mgoin in #16508 - Fix vLLM x torch.compile config caching by @zou3519 in #16491
- [Misc] refactor argument parsing in examples by @reidliu41 in #16635
- [CI/Build] Fix LoRA OOM by @jeejeelee in #16624
- Add "/server_info" endpoint in api_server to retrieve the vllm_config.Β by @Cangxihui in #16572
- [Kernel] Remove redundant Exp calculations by @DefTruth in #16123
- [Misc] Update
compressed-tensors
WNA16 to support zero-points by @dsikka in #14211 - [Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server by @angkywilliam in #10546
- [Model] Add PLaMo2 by @Alnusjaponica in #14323
- [Bugfix] fix gpu docker image mis benchmarks dir by @lengrongfu in #16628
- [Misc] Modify LRUCache touch by @jeejeelee in #16689
- Disable remote caching when calling compile_fx by @zou3519 in #16611
- [Feature] add model aware kv ops helper by @billishyahao in #16020
- [ROCM] Bind triton version to 3.2 in requirements-built.txt by @SageMoore in #16664
- [V1][Structured Output] Move xgrammar related utils to
backend_xgrammar.py
by @shen-shanshan in #16578 - [CI] Cleanup
additional_dependencies: [toml]
for pre-commit yapf hook by @yankay in #16405 - [Misc] refactor examples series by @reidliu41 in #16708
- [Doc] Improve OOM troubleshooting by @DarkLight1337 in #16704
- [Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel by @DefTruth in #16693
- [Model] support modernbert by @xsank in #16648
- [Hardware] Add processor inputs to platform validation by @joerunde in #16680
- Improve error for structured output backend selection by @hmellor in #16717
- [Misc] Remove redundant comment by @jianzs in #16703
- Help user create custom model for Transformers backend remote code models by @hmellor in #16719
- [V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] by @p88h in #16432
- [V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification by @luyuzhe111 in #16636
- Adding vllm buildkite job for IBM Power by @AaruniAggarwal in #16679
- [V1][Frontend] Improve Shutdown And Logs by @robertgshaw2-redhat in #11737
- [rocm][V0] fix selection logic for custom PA in V0 by @divakar-amd in #16426
- [Bugfix] Update Florence-2 tokenizer to make grounding tasks work by @Isotr0py in #16734
- [Bugfix] Revert max_prompt_len validation for decoder-only models. by @davidheineman in #16741
- [V1] Remove log noise when idle by @russellb in #16735
- [Ray] Improve documentation on batch inference by @richardliaw in #16609
- [misc] ignore marlin_moe_wna16 local gen codes by @DefTruth in #16760
- [Doc] Add more tips to avoid OOM by @DarkLight1337 in #16765
- [doc] add open-webui example by @reidliu41 in #16747...
v0.8.4
This release contains 180 commits from 84 contributors (25 new contributors!).
Highlights
This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update.
Model
- Llama4 (#16113,#16509) bug fix and enhancements:
- qknorm should be not shared across head (#16311)
- Enable attention temperature tuning by default for long context (>32k) (#16439)
- Index Error When Single Request Near Max Context (#16209)
- Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (#16488)
- Update to transformers==4.51.1 (#16257)
- Added chat templates for LLaMa4 pythonic tool calling (#16463)
- Optimized topk for topk=1(#16512)
- Add warning for Attention backends that do not support irope yet (#16212)
- Support Qwen3 and Qwen3MoE (#15289), smolvlm (#16017), jinaai/jina-embeddings-v3 (#16120), InternVL3 (#16495), GLM-4-0414 (#16338)
API
- Estimate max-model-len use available KV cache memory. The error message nows hints at how to set
--max-model-len
(#16168) - Add hf_token to EngineArgs (#16093)
- Enable regex support with xgrammar in V0 engine (#13228)
- Support matryoshka representation / support embedding API dimensions (#16331)
- Add bucket for
request_latency
,time_to_first_token
andtime_per_output_token
(#15202) - Support for TorchAO quantization (#14231)
Hardware
- Intel-Gaudi: Multi-step scheduling implementation for HPU (#12779)
- TPU:
Performance
- DeepSeek MLA: a new merge_attn_states CUDA kernel, 3x speedup (#16173)
- MoE: Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366)
- Add support to modelopt quantization of Mixtral model (#15961)
- Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (#16537)
V1 Engine Core
- Enable multi-input by default (#15799)
- Scatter and gather placeholders in the model runner (#16076)
- Set structured output backend to
auto
by default (#15724) - Zero-copy tensor/ndarray serialization/transmission (#13790)
- Eagle Model loading (#16035)
- KV cache slots for eagle heads (#16370)
- Add
supports_structured_output()
method to Platform (#16148)
Developer Facing
- Add sampling parameters to benchmark_serving. (#16022)
- AutoWeightsLoader refacotring (#16383, #16325, #16088, #16203, #16103)
- Unifieid configuration with engine args:
LoadConfig
(#16422),ParallelConfig
(#16332)
What's Changed
- [Misc] Auto detect bitsandbytes pre-quantized models by @tristanleclercq in #16027
- [CI] Fix benchmark script level by @khluu in #16089
- fix: support clang17 for macos and fix the real libomp by @yihong0618 in #16086
- [doc] fix 404 by @reidliu41 in #16082
- Revert "doc: add info for macos clang errors (#16049)" by @yihong0618 in #16091
- Fix some capitalisations in generated examples doc titles by @hmellor in #16094
- [Misc] format output for encoder_decoder.py by @reidliu41 in #16095
- [Misc] Remove redundant code by @chaunceyjiang in #16098
- [Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine by @jinzhen-lin in #15946
- [Model] use AutoWeightsLoader for phi, gemma, deepseek by @jonghyunchoe in #16088
- [Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 by @luccafong in #16112
- [Benchmark] Add sampling parameters to benchmark_serving. by @hyeygit in #16022
- [Frontend] Fix typo in tool chat templates for llama3.2 and toolace by @bjj in #14501
- [CI][V1] Fix passing
tokenizer
as kwarg tovalidate_guidance_grammar
by @ywang96 in #16117 - [Misc] refactor example eagle by @reidliu41 in #16100
- [Doc][Bugfix] Add missing EOF in k8s deploy doc by @psschwei in #16025
- [Misc] Improve model redirect to accept json dictionary by @Isotr0py in #16119
- [Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 by @lengrongfu in #16103
- [Bugfix] LoRA : Fix the order in which the kernels process LoRAs by @varun-sundar-rabindranath in #16040
- [Bugfix] add hf_token to EngineArgs by @paolovic in #16093
- [Misc] update requires-python in pyproject.toml by @reidliu41 in #16116
- [TPU] Update PyTorch/XLA by @yaochengji in #16130
- [V1][Minor] Optimize get_cached_block by @WoosukKwon in #16135
- Fix requires-python by @martinhoyer in #16132
- [Metrics] Add bucket for
request_latency
,time_to_first_token
andtime_per_output_token
by @yankay in #15202 - [V1][Minor] Minor simplification for get_computed_blocks by @WoosukKwon in #16139
- [Misc] Update Mistral-3.1 example by @DarkLight1337 in #16147
- [Bugfix] Make dummy encoder prompt padding alternative and add missing warnings by @Isotr0py in #16129
- [CI] Set max transformers version for Ultravox model test by @ywang96 in #16149
- doc: fix some typos in doc by @yihong0618 in #16154
- [VLM] Florence-2 supports online serving by @Isotr0py in #16164
- [V1][Structured Output] Add
supports_structured_output()
method to Platform by @shen-shanshan in #16148 - [Model] Add Qwen3 and Qwen3MoE by @YamPengLi in #15289
- [Misc] improve example mlpspeculator and llm_engine_example by @reidliu41 in #16175
- [Doc]Update image to latest version by @WangErXiao in #16186
- Upstream Llama4 Support to Main by @houseroad in #16113
- [Bugfix] Re-enable support for
ChatGLMForConditionalGeneration
by @DarkLight1337 in #16187 - [V1] Revert the default
max_num_seqs
to V0 values for most hardware by @DarkLight1337 in #16158 - [Misc] Print encoder seq len to short warning only once by @gshtras in #16193
- [Misc] Human-readable
max-model-len
cli arg by @NickLucche in #16181 - [Misc] Move Llama 4 projector call into encoder execution by @ywang96 in #16201
- [Bugfix] Fix guidance backend for Qwen models by @benchislett in #16210
- [V1][BugFix] Exit properly if engine core fails during startup by @njhill in #16137
- [Misc] add description attribute in CLI by @reidliu41 in #15921
- [Bugfix][V0] XGrammar structured output supports Enum by @leon-seidel in #15878
- Torchao by @drisspg in #14231
- [ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping by @mgoin in #16031
- [core] do not send error across process by @youkaichao in #16174
- [Misc] Update compressed-tensors to version 0.9.3 by @mlsw in #16196
- Update BASE_IMAGE to 2.22 release of Neuron by @aws-satyajith in #16218
- [V1] Scatter and gather placeholders in the model runner by @ywang96 in #16076
- [Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 by @zxfan-cpu in #16161
- Add warning for Attention backends that do not support irope yet by @sarckk in #16212
- [Bugfix] Do not skip "empty" parts of chats that are parsable by @mgoin in #16219
- [Bugfix] Fix and reorganize broken GGUF tests and bump gguf version by @Isotr0py in #16194
- [torch.compile][TPU] Make @support_torch_compile work for XLA backend by @lsy323 in #15782
- [V1] Add
disable_chunked_mm_input
arg to disable partial mm input prefill by @mgoin in #15837 - [Misc] Merge the logs of pp layers partitions by @kebe7jun in #16225
- [Docs] Add Slides from Singapore Meetup by @simon-mo in #16213
- [Misc] format and refactor some examples by @reidliu41 in #16252
- [Misc] Add warning for multimodal data in LLM.beam_search by @alex-jw-brooks in #16241
- [Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe b...
v0.8.3
Highlights
This release features 260 commits, 109 contributors, 38 new contributors.
- We are excited to announce Day 0 Support for Llama 4 Scout and Maverick (#16104). Please see our blog for detailed user guide.
- Please note that Llama4 is only supported in V1 engine only for now.
- V1 engine now supports native sliding window attention (#14097) with the hybrid memory allocator.
Cluster Scale Serving
- Single node data parallel with API server support (#13923)
- Multi-node offline DP+EP example (#15484)
- Expert parallelism enhancements
- Support XpYd disaggregated prefill with MooncakeStore (#12957)
Model Supports
- Llama 4 (#16104), Aya Vision (#15441), MiniMaxText01(#13454), Skywork-R1V (#15397), jina-reranker-v2 (#15876)
- Add Reasoning Parser for Granite Models (#14202)
- Add Phi-4-mini function calling support (#14886)
V1 Engine
- Collective RPC (#15444)
- Faster top-k only implementation (#15478)
- BitsAndBytes support (#15611)
- Speculative Decoding: metrics (#15151), Eagle Proposer (#15729), n-gram interface update (#15750), EAGLE Architecture with Proper RMS Norms (#14990)
Features
API
- Support Enum for xgrammar based structured output in V1. (#15594, #15757)
- A new tags parameter for
wake_up
(#15500) - V1 LoRA support CPU offload (#15843)
- Prefix caching support: FIPS enabled machines with MD5 hashing (#15299), SHA256 as alternative hashing algorithm (#15297)
- Addition of http service metrics (#15657)
Performance
- LoRA Scheduler optimization bridging V1 and V0 performance (#15422).
Hardwares
- AMD:
- CPU:
- CPU MLA (#14744)
- TPU
Doc, Build, Ecosystem
- V1 user guide update: fp8 kv cache support (#15585), multi-modality (#15460)
- Recommend developing with Python 3.12 in developer guide (#15811)
- Clean up: move dockerfiles into their own directory (#14549)
- Add minimum version for
huggingface_hub
to enable Xet downloads (#15873) - TPU CI: Add basic perf regression test (#15414)
What's Changed
- Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 by @houseroad in #15160
- [Hardware][TPU][Bugfix] Fix v1 mp profiler by @lsy323 in #15409
- [Kernel][CPU] CPU MLA by @gau-nernst in #14744
- Dockerfile.ppc64le changes to move to UBI by @Shafi-Hussain in #15402
- [Misc] Clean up MiniCPM-V/O code by @DarkLight1337 in #15337
- [Misc] Remove redundant
num_embeds
by @DarkLight1337 in #15443 - [Doc] Update V1 user guide for multi-modality by @DarkLight1337 in #15460
- [Kernel] Fix conflicting macro names for gguf kernels by @SzymonOzog in #15456
- [bugfix] fix inductor cache on max_position_embeddings by @youkaichao in #15436
- [CI/Build] Add tests for the V1 tpu_model_runner. by @yarongmu-google in #14843
- [Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility) by @oteroantoniogom in #15471
- [bugfix] add supports_v1 platform interface by @joerunde in #15417
- Add workaround for shared field_names in pydantic model class by @maxdebayser in #13925
- [TPU][V1] Fix Sampler recompilation by @NickLucche in #15309
- [V1][Minor] Use
SchedulerInterface
type for engine scheduler field by @njhill in #15499 - [V1] Support long_prefill_token_threshold in v1 scheduler by @houseroad in #15419
- [core] add bucket padding to tpu_model_runner by @Chenyaaang in #14995
- [Core] LoRA: V1 Scheduler optimization by @varun-sundar-rabindranath in #15422
- [CI/Build] LoRA: Delete long context tests by @varun-sundar-rabindranath in #15503
- Transformers backend already supports V1 by @hmellor in #15463
- [Model] Support multi-image for Molmo by @DarkLight1337 in #15438
- [Misc] Warn about v0 in benchmark_paged_attn.py by @tlrmchlsmth in #15495
- [BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results) by @LucasWilkinson in #15492
- [misc] LoRA - Skip LoRA kernels when not required by @varun-sundar-rabindranath in #15152
- Fix raw_request extraction in load_aware_call decorator by @daniel-salib in #15382
- [Feature] Enhance EAGLE Architecture with Proper RMS Norms by @luyuzhe111 in #14990
- [FEAT][ROCm] Integrate Fused MoE Kernels from AITER by @vllmellm in #14967
- [Misc] Enhance warning information to user-defined chat template by @wwl2755 in #15408
- [Misc] improve example script output by @reidliu41 in #15528
- Separate base model from
TransformersModel
by @hmellor in #15467 - Apply torchfix by @cyyever in #15532
- Improve validation of TP in Transformers backend by @hmellor in #15540
- [Model] Add Reasoning Parser for Granite Models by @alex-jw-brooks in #14202
- multi-node offline DP+EP example by @youkaichao in #15484
- Fix weight loading for some models in Transformers backend by @hmellor in #15544
- [Refactor] Remove passthrough
backend
when generate grammar by @aarnphm in #15317 - [V1][Sampler] Faster top-k only implementation by @njhill in #15478
- Support SHA256 as hash function in prefix caching by @dr75 in #15297
- Applying some fixes for K8s agents in CI by @Alexei-V-Ivanov-AMD in #15493
- [V1] TPU - Revert to exponential padding by default by @alexm-redhat in #15565
- [V1] TPU CI - Fix test_compilation.py by @alexm-redhat in #15570
- Use Cache Hinting for fused_moe kernel by @wrmedford in #15511
- [TPU] support disabling xla compilation cache by @yaochengji in #15567
- Support FIPS enabled machines with MD5 hashing by @MattTheCuber in #15299
- [Kernel] CUTLASS grouped gemm fp8 MoE kernel by @ElizaWszola in #13972
- Add automatic tpu label to mergify.yml by @mgoin in #15560
- add platform check back by @Chenyaaang in #15578
- [misc] LoRA: Remove unused long context test data by @varun-sundar-rabindranath in #15558
- [Doc] Update V1 user guide for fp8 kv cache support by @wayzeng in #15585
- [moe][quant] add weight name case for offset by @MengqingCao in #15515
- [V1] Refactor num_computed_tokens logic by @comaniac in #15307
- Allow torchao quantization in SiglipMLP by @jerryzh168 in #15575
- [ROCm] Env variable to trigger custom PA by @gshtras in #15557
- [TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS by @yaochengji in #15583
- [Misc] Restrict ray version dependency and update PP feature warning in V1 by @ruisearch42 in #15556
- [TPU] Avoid Triton Import by @robertgshaw2-redhat in #15589
- [Misc] Consolidate LRUCache implementations by @Avabowler in #15481
- [Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM by @robertgshaw2-redhat in #15587
- [Misc] Clean up
scatter_patch_features
by @DarkLight1337 in #15559 - [Misc] Use model_redirect to redirect the model name to a local folder. by @noooop in https://github.com/vllm-proj...
v0.8.3rc1
What's Changed
- Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 by @houseroad in #15160
- [Hardware][TPU][Bugfix] Fix v1 mp profiler by @lsy323 in #15409
- [Kernel][CPU] CPU MLA by @gau-nernst in #14744
- Dockerfile.ppc64le changes to move to UBI by @Shafi-Hussain in #15402
- [Misc] Clean up MiniCPM-V/O code by @DarkLight1337 in #15337
- [Misc] Remove redundant
num_embeds
by @DarkLight1337 in #15443 - [Doc] Update V1 user guide for multi-modality by @DarkLight1337 in #15460
- [Kernel] Fix conflicting macro names for gguf kernels by @SzymonOzog in #15456
- [bugfix] fix inductor cache on max_position_embeddings by @youkaichao in #15436
- [CI/Build] Add tests for the V1 tpu_model_runner. by @yarongmu-google in #14843
- [Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility) by @oteroantoniogom in #15471
- [bugfix] add supports_v1 platform interface by @joerunde in #15417
- Add workaround for shared field_names in pydantic model class by @maxdebayser in #13925
- [TPU][V1] Fix Sampler recompilation by @NickLucche in #15309
- [V1][Minor] Use
SchedulerInterface
type for engine scheduler field by @njhill in #15499 - [V1] Support long_prefill_token_threshold in v1 scheduler by @houseroad in #15419
- [core] add bucket padding to tpu_model_runner by @Chenyaaang in #14995
- [Core] LoRA: V1 Scheduler optimization by @varun-sundar-rabindranath in #15422
- [CI/Build] LoRA: Delete long context tests by @varun-sundar-rabindranath in #15503
- Transformers backend already supports V1 by @hmellor in #15463
- [Model] Support multi-image for Molmo by @DarkLight1337 in #15438
- [Misc] Warn about v0 in benchmark_paged_attn.py by @tlrmchlsmth in #15495
- [BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results) by @LucasWilkinson in #15492
- [misc] LoRA - Skip LoRA kernels when not required by @varun-sundar-rabindranath in #15152
- Fix raw_request extraction in load_aware_call decorator by @daniel-salib in #15382
- [Feature] Enhance EAGLE Architecture with Proper RMS Norms by @luyuzhe111 in #14990
- [FEAT][ROCm] Integrate Fused MoE Kernels from AITER by @vllmellm in #14967
- [Misc] Enhance warning information to user-defined chat template by @wwl2755 in #15408
- [Misc] improve example script output by @reidliu41 in #15528
- Separate base model from
TransformersModel
by @hmellor in #15467 - Apply torchfix by @cyyever in #15532
- Improve validation of TP in Transformers backend by @hmellor in #15540
- [Model] Add Reasoning Parser for Granite Models by @alex-jw-brooks in #14202
- multi-node offline DP+EP example by @youkaichao in #15484
- Fix weight loading for some models in Transformers backend by @hmellor in #15544
- [Refactor] Remove passthrough
backend
when generate grammar by @aarnphm in #15317 - [V1][Sampler] Faster top-k only implementation by @njhill in #15478
- Support SHA256 as hash function in prefix caching by @dr75 in #15297
- Applying some fixes for K8s agents in CI by @Alexei-V-Ivanov-AMD in #15493
- [V1] TPU - Revert to exponential padding by default by @alexm-redhat in #15565
- [V1] TPU CI - Fix test_compilation.py by @alexm-redhat in #15570
- Use Cache Hinting for fused_moe kernel by @wrmedford in #15511
- [TPU] support disabling xla compilation cache by @yaochengji in #15567
- Support FIPS enabled machines with MD5 hashing by @MattTheCuber in #15299
- [Kernel] CUTLASS grouped gemm fp8 MoE kernel by @ElizaWszola in #13972
- Add automatic tpu label to mergify.yml by @mgoin in #15560
- add platform check back by @Chenyaaang in #15578
- [misc] LoRA: Remove unused long context test data by @varun-sundar-rabindranath in #15558
- [Doc] Update V1 user guide for fp8 kv cache support by @wayzeng in #15585
- [moe][quant] add weight name case for offset by @MengqingCao in #15515
- [V1] Refactor num_computed_tokens logic by @comaniac in #15307
- Allow torchao quantization in SiglipMLP by @jerryzh168 in #15575
- [ROCm] Env variable to trigger custom PA by @gshtras in #15557
- [TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS by @yaochengji in #15583
- [Misc] Restrict ray version dependency and update PP feature warning in V1 by @ruisearch42 in #15556
- [TPU] Avoid Triton Import by @robertgshaw2-redhat in #15589
- [Misc] Consolidate LRUCache implementations by @Avabowler in #15481
- [Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM by @robertgshaw2-redhat in #15587
- [Misc] Clean up
scatter_patch_features
by @DarkLight1337 in #15559 - [Misc] Use model_redirect to redirect the model name to a local folder. by @noooop in #14116
- Fix incorrect filenames in vllm_compile_cache.py by @zou3519 in #15494
- [Doc] update --system for transformers installation in docker doc by @reidliu41 in #15616
- [Model] MiniCPM-V/O supports V1 by @DarkLight1337 in #15487
- [Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1 by @h-sugi in #15211
- [Doc] Link to onboarding tasks by @DarkLight1337 in #15629
- [Misc] Replace
is_encoder_decoder_inputs
withsplit_enc_dec_inputs
by @DarkLight1337 in #15620 - [Feature] Add middleware to log API Server responses by @terrytangyuan in #15593
- [Misc] Avoid direct access of global
mm_registry
incompute_encoder_budget
by @DarkLight1337 in #15621 - [Doc] Use absolute placement for Ask AI button by @hmellor in #15628
- [Bugfix][TPU][V1] Fix recompilation by @NickLucche in #15553
- Correct PowerPC to modern IBM Power by @clnperez in #15635
- [CI] Update rules for applying
tpu
label. by @russellb in #15634 - [V1] AsyncLLM data parallel by @njhill in #13923
- [TPU] Lazy Import by @robertgshaw2-redhat in #15656
- [Quantization][V1] BitsAndBytes support V1 by @jeejeelee in #15611
- [Bugfix] Fix failure to launch in Tensor Parallel TP mode on macOS. by @kebe7jun in #14948
- [Doc] Fix dead links in Job Board by @wwl2755 in #15637
- [CI][TPU] Temporarily Disable Quant Test on TPU by @robertgshaw2-redhat in #15649
- Revert "Use Cache Hinting for fused_moe kernel (#15511)" by @wrmedford in #15645
- [Misc]add coding benchmark for speculative decoding by @CXIAAAAA in #15303
- [Quantization][FP8] Adding support for fp8 gemm layer input in fp8 by @gshtras in #14578
- Refactor error handling for multiple exceptions in preprocessing by @JasonZhu1313 in #15650
- [Bugfix] Fix
mm_hashes
forgetting to be passed by @DarkLight1337 in #15668 - [V1] Remove legacy input registry by @DarkLight1337 in #15673
- [TPU][CI] Fix TPUModelRunner Test by @robertgshaw2-redhat in...
v0.8.2
This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading!
Highlights
- Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377)
- Remove openvino support in favor of external plugin (#15339)
V1 Engine
- Fix V1 Engine crash while handling requests with duplicate request id (#15043)
- Support FP8 KV Cache (#14570, #15191)
- Add flag to disable cascade attention (#15243)
- Scheduler Refactoring: Add Scheduler Interface (#15250)
- Structured Output
- Spec Decode
- AMD
- Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
- TPU
Features
- Integrate
fastsafetensors
loader for loading model weights (#10647) - Add guidance backend for structured output (#14589)
Others
- Add Kubernetes deployment guide with CPUs (#14865)
- Support reset prefix cache by specified device (#15003)
- Support tool calling and reasoning parser (#14511)
- Support --disable-uvicorn-access-log parameters (#14754)
- Support Tele-FLM Model (#15023)
- Add pipeline parallel support to
TransformersModel
(#12832) - Enable CUDA graph support for llama 3.2 vision (#14917)
What's Changed
- [FEAT]Support reset prefix cache by specified device by @maobaolong in #15003
- [BugFix][V1] Update stats.py by @WrRan in #15139
- [V1][TPU] Change kv cache shape. by @vanbasten23 in #15145
- [FrontEnd][Perf]
merge_async_iterators
fast-path for single-prompt requests by @njhill in #15150 - [Docs] Annouce Ollama and Singapore Meetups by @simon-mo in #15161
- [V1] TPU - Tensor parallel MP support by @alexm-redhat in #15059
- [BugFix] Lazily import XgrammarBackend to avoid early cuda init by @njhill in #15171
- [Doc] Clarify run vllm only on one node in distributed inference by @ruisearch42 in #15148
- Fix broken tests by @jovsa in #14713
- [Bugfix] Fix embedding assignment for InternVL-based models by @DarkLight1337 in #15086
- fix "Total generated tokens:" is 0 if using --backend tgi and --endpo⦠by @sywangyi in #14673
- [V1][TPU] Support V1 Sampler for ragged attention by @NickLucche in #14227
- [Benchmark] Allow oversample request in benchmark dataset by @JenZhao in #15170
- [Core][V0] Add guidance backend for structured output by @russellb in #14589
- [Doc] Update Mistral Small 3.1/Pixtral example by @ywang96 in #15184
- [Misc] support --disable-uvicorn-access-log parameters by @chaunceyjiang in #14754
- [Attention] Flash Attention 3 - fp8 by @mickaelseznec in #14570
- [Doc] Update README.md by @DarkLight1337 in #15187
- Enable CUDA graph support for llama 3.2 vision by @mritterfigma in #14917
- typo: Update config.py by @WrRan in #15189
- [Frontend][Bugfix] support prefill decode disaggregation on deepseek by @billishyahao in #14824
- [release] Tag vllm-cpu with latest upon new version released by @khluu in #15193
- Fixing Imprecise Type Annotations by @WrRan in #15192
- [macOS] Ugrade pytorch to 2.6.0 by @linktohack in #15129
- [Bugfix] Multi-video inference on LLaVA-Onevision by @DarkLight1337 in #15082
- Add user forum to README by @hmellor in #15220
- Fix env vars for running Ray distributed backend on GKE by @richardsliu in #15166
- Replace
misc
issues with link to forum by @hmellor in #15226 - [ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 by @vermouth1992 in #15172
- [Bugfix] fix V1 Engine crash while handling requests with duplicate request id by @JasonJ2021 in #15043
- [V1] Add flag to disable cascade attention by @WoosukKwon in #15243
- Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. by @fabianlim in #14617
- [V1] Scheduler Refactoring [1/N] - Add Scheduler Interface by @WoosukKwon in #15250
- [CI/Build] LoRA : make add_lora_test safer by @varun-sundar-rabindranath in #15181
- Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 by @houseroad in #15159
- [Misc] Clean up the BitsAndBytes arguments by @jeejeelee in #15140
- [ROCM] Upgrade torch to 2.6 by @SageMoore in #15244
- [Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation by @Isotr0py in #15200
- Mention
extra_body
as a way top pass vLLM only parameters using the OpenAI client by @hmellor in #15240 - [V1][TPU] Speed up top-k on TPU by using torch.topk by @hyeygit in #15242
- [Bugfix] detect alibi and revert to FA2 by @tjohnson31415 in #15231
- [Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies by @cyang49 in #14857
- [Docs] Trim the latest news in README by @WoosukKwon in #15261
- [Misc] Better RayExecutor and multiprocessing compatibility by @comaniac in #14705
- Add an example for reproducibility by @WoosukKwon in #15262
- [Hardware][TPU] Add check for no additional graph compilation during runtime by @lsy323 in #14710
- [V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs by @Isotr0py in #14071
- [Doc] Update LWS docs by @Edwinhr716 in #15163
- [V1] Avoid redundant input processing in n>1 case by @njhill in #14985
- [Feature] specify model in config.yaml by @wayzeng in #14855
- [Bugfix] Add int8 torch dtype for KVCache by @shen-shanshan in #15260
- [Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL by @Isotr0py in #15273
- [Bugfix] Fix incorrect resolving order for transformers fallback by @Isotr0py in #15279
- [V1] Fix wrong import path of get_flash_attn_version by @lhtin in #15280
- [Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend by @Isotr0py in #15282
- [Misc] Add cProfile helpers by @russellb in #15074
- [v1] Refactor KVCacheConfig by @heheda12345 in #14079
- [Bugfix][VLM] fix llava processor by @MengqingCao in #15285
- Revert "[Feature] specify model in config.yaml (#14855)" by @DarkLight1337 in #15293
- [TPU][V1] MHA Pallas backend by @NickLucche in #15288
- [Build/CI] Fix env var typo by @russellb in #15305
- [Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout by @ruisearch42 in #15301
- [Bugfix][V0] Multi-sequence logprobs streaming edge case by @andylolu2 in #15259
- [FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature by @tjtanaa in #14959
- [Doc] add load_format items in docs by @wwl2755 in #14804
- [Bugfix] Fix torch.compile raise FileNotFoundError by @jeejeelee in #15278
- [Bugfix] LoRA V0 - Fix case where
max_num_seqs
is between cudagraph capture sizes by @varun-sundar-rabindranath in #15308 - [Model] Support Tele-FLM Model by @atone in #15023
- [V1] Add
disable-any-whitespace
option support for xgrammar by @russellb in #15316 - [BugFix][Typing] Fix Imprecise Type Annotations by @WrRan in #15...