PagedAttention integration in MLX

Hey all!

I implemented some PagedAttention kernels for Metal and have been seeing some nice throughput improvements. The kernels are here. These gains are especially pronounced on Metal, as the kernels have been optimized for avoiding memory pressure.

During testing using mistralrs-server, the kernels enabled some significant throughput gains over llama.cpp's llama-server with continuous batching. I ran these tests on an M3 Max with Qwen 3 30B A3B in 4-bit and GGUF Q4_K_M and Llama 3.2 3B in 8-bit ISQ and GGUF Q8_0

To land this in the MLX ecosystem, mlx_lm would need to be updated to handle block tables and other PagedAttention mechanisms.

Would be happy to make a PR to add these kernels if that makes sense!

Qwen 3 30B A3B (4-bit): 9.24 -> 16.34 T/s (+77%)
- llama.cpp: GGUF Q4_K_M
- mistral.rs: MLX 4-bit
Llama 3.2 3B (8-bit): 10.08 -> 23.28 T/s (+131%)
- llama.cpp: GGUF Q8_0
- mistral.rs: 8-bit ISQ (AFQ)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions