8000 Release v0.9.1 Β· vllm-project/vllm Β· GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

v0.9.1

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 10 Jun 18:30
· 90 commits to main since this release
b6553be

Highlights

This release features 274 commits, from 123 contributors (27 new contributors!)

  • Progress in large scale serving
    • DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762)
    • Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090)
    • DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ray (#18779), allow AsyncLLMEngine.generate to target a specific DP rank (#19102), data parallel rank to KVEventBatch (#18925)
    • Tooling: Simplify EP kernels installation (#19412)
  • RLHF workflow: Support inplace model weights loading (#18745)
  • Initial full support for Hybrid Memory Allocator (#17996), support cross-layer KV sharing (#18212)
  • Add FlexAttention to vLLM V1 (#16078)
  • Various production hardening related to full cuda graph mode (#19171, 19106, #19321)

Model Support

  • Support Magistral (#19193), LoRA support for InternVL (#18842), minicpm eagle support (#18943), NemotronH support (#18863, #19249)
  • Enable data parallel for Llama4 vision encoder (#18368)
  • Add DeepSeek-R1-0528 function call chat template (#18874)

Hardware Support & Performance Optimizations

  • Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205), Qwen3-235B-A22B (#19315)
  • Blackwell: Add Cutlass MLA backend (#17625), Tunings for SM100 FP8 CUTLASS kernel (#18778), Use FlashInfer by default on Blackwell GPUs (#19118), Tune scaled_fp8_quant by increasing vectorization (#18844)
  • FP4: Add compressed-tensors NVFP4 support (#18312), FP4 MoE kernel optimization (#19110)
  • CPU: V1 support for the CPU backend (#16441)
  • ROCm: Add AITER grouped topk for DeepSeekV2 (#18825)
  • POWER: Add IBM POWER11 Support to CPU Extension Detection (#19082)
  • TPU: Initial support of model parallelism with single worker using SPMD (#18011), Multi-LoRA Optimizations for the V1 TPU backend (#15655)
  • Neuron: Add multi-LoRA support for Neuron. (#18284), Add Multi-Modal model support for Neuron (#18921), Support quantization on neuron (#18283)
  • Platform: Make torch distributed process group extendable (#18763)

Engine features

  • Add Lora Support to Beam Search (#18346)
  • Add rerank support to run_batch endpoint (#16278)
  • CLI: add run batch (#18804)
  • Server: custom logging (#18403), allowed_token_ids in ChatCompletionRequest (#19143)
  • LLM API: make use_tqdm accept a callable for custom progress bars (#19357)
  • perf: [KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)

API Deprecations

  • Disallow pos-args other than model when initializing LLM (#18802)
  • Remove inputs arg fallback in Engine classes (#18799)
  • Remove fallbacks for Embeddings API (#18795)
  • Remove mean pooling default for Qwen2EmbeddingModel (#18913)
  • Require overriding get_dummy_text and get_dummy_mm_data (#18796)
  • Remove metrics that were deprecated in 0.8 (#18837)

Documentation

  • Add CLI doc (#18871)
  • Update SECURITY.md with link to our security guide (#18961), Add security warning to bug report template (#19365)

What's Changed

New Contributors

Full Changelog: v0.9.0...v0.9.1

0