Open
Description
Hey all!
I implemented some PagedAttention kernels for Metal and have been seeing some nice throughput improvements. The kernels are here. These gains are especially pronounced on Metal, as the kernels have been optimized for avoiding memory pressure.
During testing using mistralrs-server
, the kernels enabled some significant throughput gains over llama.cpp's llama-server
with continuous batching. I ran these tests on an M3 Max with Qwen 3 30B A3B in 4-bit and GGUF Q4_K_M and Llama 3.2 3B in 8-bit ISQ and GGUF Q8_0
To land this in the MLX ecosystem, mlx_lm
would need to be updated to handle block tables and other PagedAttention mechanisms.
Would be happy to make a PR to add these kernels if that makes sense!
- Qwen 3 30B A3B (4-bit): 9.24 -> 16.34 T/s (+77%)
- llama.cpp: GGUF Q4_K_M
- mistral.rs: MLX 4-bit
- Llama 3.2 3B (8-bit): 10.08 -> 23.28 T/s (+131%)
- llama.cpp: GGUF Q8_0
- mistral.rs: 8-bit ISQ (AFQ)
Metadata
Metadata
Assignees
Labels
No labels