8000 PagedAttention integration in MLX · Issue #2228 · ml-explore/mlx · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
PagedAttention integration in MLX #2228
Open
@EricLBuehler

Description

@EricLBuehler

Hey all!

I implemented some PagedAttention kernels for Metal and have been seeing some nice throughput improvements. The kernels are here. These gains are especially pronounced on Metal, as the kernels have been optimized for avoiding memory pressure.

During testing using mistralrs-server, the kernels enabled some significant throughput gains over llama.cpp's llama-server with continuous batching. I ran these tests on an M3 Max with Qwen 3 30B A3B in 4-bit and GGUF Q4_K_M and Llama 3.2 3B in 8-bit ISQ and GGUF Q8_0

To land this in the MLX ecosystem, mlx_lm would need to be updated to handle block tables and other PagedAttention mechanisms.

Would be happy to make a PR to add these kernels if that makes sense!

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0