8000 Releases Β· vllm-project/vllm Β· GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Releases: vllm-pr 8000 oject/vllm

v0.9.1

10 Jun 18:30
b6553be
Compare
Choose a tag to compare

Highlights

This release features 274 commits, from 123 contributors (27 new contributors!)

  • Progress in large scale serving
    • DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762)
    • Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090)
    • DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ray (#18779), allow AsyncLLMEngine.generate to target a specific DP rank (#19102), data parallel rank to KVEventBatch (#18925)
    • Tooling: Simplify EP kernels installation (#19412)
  • RLHF workflow: Support inplace model weights loading (#18745)
  • Initial full support for Hybrid Memory Allocator (#17996), support cross-layer KV sharing (#18212)
  • Add FlexAttention to vLLM V1 (#16078)
  • Various production hardening related to full cuda graph mode (#19171, 19106, #19321)

Model Support

  • Support Magistral (#19193), LoRA support for InternVL (#18842), minicpm eagle support (#18943), NemotronH support (#18863, #19249)
  • Enable data parallel for Llama4 vision encoder (#18368)
  • Add DeepSeek-R1-0528 function call chat template (#18874)

Hardware Support & Performance Optimizations

  • Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205), Qwen3-235B-A22B (#19315)
  • Blackwell: Add Cutlass MLA backend (#17625), Tunings for SM100 FP8 CUTLASS kernel (#18778), Use FlashInfer by default on Blackwell GPUs (#19118), Tune scaled_fp8_quant by increasing vectorization (#18844)
  • FP4: Add compressed-tensors NVFP4 support (#18312), FP4 MoE kernel optimization (#19110)
  • CPU: V1 support for the CPU backend (#16441)
  • ROCm: Add AITER grouped topk for DeepSeekV2 (#18825)
  • POWER: Add IBM POWER11 Support to CPU Extension Detection (#19082)
  • TPU: Initial support of model parallelism with single worker using SPMD (#18011), Multi-LoRA Optimizations for the V1 TPU backend (#15655)
  • Neuron: Add multi-LoRA support for Neuron. (#18284), Add Multi-Modal model support for Neuron (#18921), Support quantization on neuron (#18283)
  • Platform: Make torch distributed process group extendable (#18763)

Engine features

  • Add Lora Support to Beam Search (#18346)
  • Add rerank support to run_batch endpoint (#16278)
  • CLI: add run batch (#18804)
  • Server: custom logging (#18403), allowed_token_ids in ChatCompletionRequest (#19143)
  • LLM API: make use_tqdm accept a callable for custom progress bars (#19357)
  • perf: [KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)

API Deprecations

  • Disallow pos-args other than model when initializing LLM (#18802)
  • Remove inputs arg fallback in Engine classes (#18799)
  • Remove fallbacks for Embeddings API (#18795)
  • Remove mean pooling default for Qwen2EmbeddingModel (#18913)
  • Require overriding get_dummy_text and get_dummy_mm_data (#18796)
  • Remove metrics that were deprecated in 0.8 (#18837)

Documentation

  • Add CLI doc (#18871)
  • Update SECURITY.md with link to our security guide (#18961), Add security warning to bug report template (#19365)

What's Changed

Read more

v0.9.1rc1

09 Jun 23:48
3a7cd62
Compare
Choose a tag to compare
v0.9.1rc1 Pre-release
Pre-release

What's Changed

Read more

v0.9.0.1

30 May 16:11
Compare
Choose a tag to compare

This patch release contains important bugfix for DeepSeek family of models on NVIDIA Ampere and below (#18807)

Full Changelog: v0.9.0...v0.9.0.1

v0.9.0

15 May 03:38
5873877
Compare
Choose a tag to compare

Highlights

This release features 649 commits, from 215 contributors (82 new contributors!)

  • vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
    • The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
    • As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
  • Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
    • You can use our docker image or install FlashInfer nightly wheel pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl then set VLLM_ATTENTION_BACKEND=FLASHINFER for better performance.
    • Upgraded support for the new FlashInfer main branch. (#15777)
    • Please checkout #18153 for the full roadmap
  • Initial DP, EP, PD support for large scale inference
    • EP:
      • Permute and unpermute kernel for moe optimization (#14568)
      • Modularize fused experts and integrate PPLX kernels (#15956)
      • Refactor pplx init logic to make it modular (prepare for deepep) (#18200)
      • Add ep group and all2all interface (#18077)
    • DP:
      • Decouple engine process management and comms (#15977)
    • PD:
      • NIXL Integration (#17751)
      • Local attention optimization for NIXL (#18170)
      • Support multiple kv connectors (#17564)
  • Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)

Notable Changes

  • Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
  • Change top_k to be disabled with 0 (still accept -1 for now) (#17773)
  • The seed is now set to 0 by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even if temperature > 0. This does not modify the random state in user code since workers are run in separate processes unless VLLM_USE_V1_MULTIPROCESSING=0. (#17929, #18741)

Model Enhancements

  • Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
    • Please install the development version of transformers (from source) to use Falcon-H1.
  • Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
  • Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
  • DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
  • Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
  • Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
  • InternVL models with Qwen2.5 backbone now support video inputs (#18499)

Performance, Production and Scaling

  • Support full cuda graph in v1 (#16072)
  • Pipeline Parallelism: MultiprocExecutor support (#14219), torchrun (#17827)
  • Support sequence parallelism combined with pipeline parallelism (#18243)
  • Async tensor parallelism using compilation pass (#17882)
  • Perf: Use small max_num_batched_tokens for A100 (#17885)
  • Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
  • Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)

Security

  • Prevent side-channel attacks via cache salting (#17045)
  • Fix image hash collision in certain edge cases (#17378)
  • Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
  • Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)

Features

  • CLI: deprecated=True (#17426)
  • Frontend: progress bar for adding requests (#17525), chat_template_kwargs in LLM.chat (#17356), /classify endpoint (#17032), truncation control for embedding models (#14776), cached_tokens in response usage (#18149)
  • LoRA: default local directory LoRA resolver plugin. (#16855)
  • Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
  • Quantization: nvidia/DeepSeek-R1-FP4 (#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models with AOPerModuleConfig (#17826), CUDA Graph support for V1 GGUF support (#18646)
  • Reasoning: deprecate --enable-reasoning (#17452)
  • Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
  • Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466), tool_choice: required for Xgrammar (#17845), Structural Tag with Guidance backend (#17333)
  • Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)

Hardwares

  • NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
  • TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
  • Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
  • AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
  • Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)

Documentation

  • Update quickstart and install for cu128 using --torch-backend=auto (#18505)
  • NVIDIA TensorRT Model Optimizer (#17561)
  • Usage of Qwen3 thinking (#18291)

Developer Facing

What's Changed

Read more

v0.8.5.post1

02 May 18:03
Compare
Choose a tag to compare

This post release contains two bug fix for memory leak and model accuracy

  • Fix Memory Leak in _cached_reqs_data (#17567)
  • Fix sliding window attention in V1 giving incorrect results (#17574)

Full Changelog: v0.8.5...v0.8.5.post1

v0.8.5

28 Apr 21:13
Compare
Choose a tag to compare

This release contains 310 commits from 143 contributors (55 new contributors!).

Highlights

This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.

Model Support

  • Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
  • Add ModernBERT (#16648)
  • Add Granite Speech Support (#16246)
  • Add PLaMo2 (#14323)
  • Add Kimi-VL model support (#16387)
  • Add Qwen2.5-Omni model support (thinker only) (#15130)
  • Snowflake Arctic Embed (Family) (#16649)
  • Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)

V1 Engine

  • Add structural_tag support using xgrammar (#17085)
  • Disaggregated serving:
    • KV Connector API V1 (#15960)
    • Adding LMCache KV connector for v1 (#16625)
  • Clean up: Remove Sampler from Model Code (#17084)
  • MLA: Simplification to batch P/D reordering (#16673)
  • Move usage stats to worker and start logging TPU hardware (#16211)
  • Support FlashInfer Attention (#16684)
  • Faster incremental detokenization (#15137)
  • EAGLE-3 Support (#16937)

Features

  • Validate urls object for multimodal content parts (#16990)
  • Prototype support sequence parallelism using compilation pass (#16155)
  • Add sampling params to v1/audio/transcriptions endpoint (#16591)
  • Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
  • Add vllm bench [latency, throughput] CLI commands (#16508)

Performance

  • Attention:
    • FA3 decode perf improvement - single mma warp group support for head dim 128 (#16864)
    • Update to lastest FA3 code (#13111)
    • Support Cutlass MLA for Blackwell GPUs (#16032)
  • MoE:
    • Add expert_map support to Cutlass FP8 MOE (#16861)
    • Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 (#16753)
  • Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
  • Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)

Hardwares

  • TPU:
    • Enable structured decoding on TPU V1 (#16499)
    • Capture multimodal encoder during model compilation (#15051)
    • Enable Top-P (#16843)
  • AMD:
    • AITER Fused MOE V1 Support (#16752)
    • Integrate Paged Attention Kernel from AITER (#15001)
    • Support AITER MLA (#15893)
    • Upstream prefix prefill speed up for vLLM V1 (#13305)
    • Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
    • Add skinny gemms for unquantized linear on ROCm (#15830)
    • Follow-ups for Skinny Gemms on ROCm. (#17011)

Documentation

  • Add open-webui example (#16747)
  • Document Matryoshka Representation Learning support (#16770)
  • Add a security guide (#17230)
  • Add example to run DeepSeek with Ray Serve LLM (#17134)
  • Benchmarks for audio models (#16505)

Security and Dependency Updates

  • Don't bind tcp zmq socket to all interfaces (#17197)
  • Use safe serialization and fix zmq setup for mooncake pipe (#17192)
  • Bump Transformers to 4.51.3 (#17116)

Build and testing

  • Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)

Breaking changes 🚨

  • --enable-chunked-prefill, --multi-step-stream-outputs, --disable-chunked-mm-input can no longer explicitly be set to False. Instead, add no- to the start of the argument (i.e. --enable-chunked-prefill and --no-enable-chunked-prefill) (#16533)

What's Changed

Read more

v0.8.4

14 Apr 06:14
dc1b4a6
Compare
Choose a tag to compare

This release contains 180 commits from 84 contributors (25 new contributors!).

Highlights

This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update.

Model

  • Llama4 (#16113,#16509) bug fix and enhancements:
    • qknorm should be not shared across head (#16311)
    • Enable attention temperature tuning by default for long context (>32k) (#16439)
    • Index Error When Single Request Near Max Context (#16209)
    • Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (#16488)
    • Update to transformers==4.51.1 (#16257)
    • Added chat templates for LLaMa4 pythonic tool calling (#16463)
    • Optimized topk for topk=1(#16512)
    • Add warning for Attention backends that do not support irope yet (#16212)
  • Support Qwen3 and Qwen3MoE (#15289), smolvlm (#16017), jinaai/jina-embeddings-v3 (#16120), InternVL3 (#16495), GLM-4-0414 (#16338)

API

  • Estimate max-model-len use available KV cache memory. The error message nows hints at how to set --max-model-len (#16168)
  • Add hf_token to EngineArgs (#16093)
  • Enable regex support with xgrammar in V0 engine (#13228)
  • Support matryoshka representation / support embedding API dimensions (#16331)
  • Add bucket for request_latency, time_to_first_token and time_per_output_token (#15202)
  • Support for TorchAO quantization (#14231)

Hardware

  • Intel-Gaudi: Multi-step scheduling implementation for HPU (#12779)
  • TPU:
    • Make @support_torch_compile work for XLA backend (#15782)
    • Use language_model interface for getting text backbone in MM (#16410)

Performance

  • DeepSeek MLA: a new merge_attn_states CUDA kernel, 3x speedup (#16173)
  • MoE: Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366)
  • Add support to modelopt quantization of Mixtral model (#15961)
  • Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (#16537)

V1 Engine Core

  • Enable multi-input by default (#15799)
  • Scatter and gather placeholders in the model runner (#16076)
  • Set structured output backend to auto by default (#15724)
  • Zero-copy tensor/ndarray serialization/transmission (#13790)
  • Eagle Model loading (#16035)
  • KV cache slots for eagle heads (#16370)
  • Add supports_structured_output() method to Platform (#16148)

Developer Facing

What's Changed

Read more

v0.8.3

06 Apr 04:11
Compare
Choose a tag to compare

Highlights

This release features 260 commits, 109 contributors, 38 new contributors.

  • We are excited to announce Day 0 Support for Llama 4 Scout and Maverick (#16104). Please see our blog for detailed user guide.
    • Please note that Llama4 is only supported in V1 engine only for now.
  • V1 engine now supports native sliding window attention (#14097) with the hybrid memory allocator.

Cluster Scale Serving

  • Single node data parallel with API server support (#13923)
  • Multi-node offline DP+EP example (#15484)
  • Expert parallelism enhancements
    • CUTLASS grouped gemm fp8 MoE kernel (#13972)
    • Fused experts refactor (#15914)
    • Fp8 Channelwise Dynamic Per Token GroupedGEMM (#15587)
    • Adding support for fp8 gemm layer input in fp8 (#14578)
    • Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. (#13932)
  • Support XpYd disaggregated prefill with MooncakeStore (#12957)

Model Supports

V1 Engine

  • Collective RPC (#15444)
  • Faster top-k only implementation (#15478)
  • BitsAndBytes support (#15611)
  • Speculative Decoding: metrics (#15151), Eagle Proposer (#15729), n-gram interface update (#15750), EAGLE Architecture with Proper RMS Norms (#14990)

Features

API

  • Support Enum for xgrammar based structured output in V1. (#15594, #15757)
  • A new tags parameter for wake_up (#15500)
  • V1 LoRA support CPU offload (#15843)
  • Prefix caching support: FIPS enabled machines with MD5 hashing (#15299), SHA256 as alternative hashing algorithm (#15297)
  • Addition of http service metrics (#15657)

Performance

  • LoRA Scheduler optimization bridging V1 and V0 performance (#15422).

Hardwares

  • AMD:
    • Add custom allreduce support for ROCM (#14125)
    • Quark quantization documentation (#15861)
    • AITER integration: int8 scaled gemm kernel (#15433), fused moe (#14967)
    • Paged attention for V1 (#15720)
  • CPU:
  • TPU
    • Improve Memory Usage Estimation (#15671)
    • Optimize the all-reduce performance (#15903)
    • Support sliding window and logit soft capping in the paged attention kernel. (#15732)
    • TPU-optimized top-p implementation (avoids scattering). (#15736)

Doc, Build, Ecosystem

  • V1 user guide update: fp8 kv cache support (#15585), multi-modality (#15460)
  • Recommend developing with Python 3.12 in developer guide (#15811)
  • Clean up: move dockerfiles into their own directory (#14549)
  • Add minimum version for huggingface_hub to enable Xet downloads (#15873)
  • TPU CI: Add basic perf regression test (#15414)

What's Changed

Read more

v0.8.3rc1

05 Apr 19:46
63375f0
Compare
Choose a tag to compare
v0.8.3rc1 Pre-release
Pre-release

What's Changed

Read more

v0.8.2

23 Mar 21:05
25f560a
Compare
Choose a tag to compare

This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading!

Highlights

  • Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377)
  • Remove openvino support in favor of external plugin (#15339)

V1 Engine

  • Fix V1 Engine crash while handling requests with duplicate request id (#15043)
  • Support FP8 KV Cache (#14570, #15191)
  • Add flag to disable cascade attention (#15243)
  • Scheduler Refactoring: Add Scheduler Interface (#15250)
  • Structured Output
    • Add disable-any-whitespace option support for xgrammar (#15316)
    • guidance backend for structured output + auto fallback mode (#14779)
  • Spec Decode
    • Enable spec decode for top-p & top-k sampling (#15063)
    • Use better defaults for N-gram (#15358)
    • Update target_logits in place for rejection sampling (#15427)
  • AMD
    • Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
  • TPU
    • Support V1 Sampler for ragged attention (#14227)
    • Tensor parallel MP support (#15059)
    • MHA Pallas backend (#15288)

Features

  • Integrate fastsafetensors loader for loading model weights (#10647)
  • Add guidance backend for structured output (#14589)

Others

  • Add Kubernetes deployment guide with CPUs (#14865)
  • Support reset prefix cache by specified device (#15003)
  • Support tool calling and reasoning parser (#14511)
  • Support --disable-uvicorn-access-log parameters (#14754)
  • Support Tele-FLM Model (#15023)
  • Add pipeline parallel support to TransformersModel (#12832)
  • Enable CUDA graph support for llama 3.2 vision (#14917)

What's Changed

Read more
0