remove simd, improves perf on arm64 cpu by x2 #11

smpurkis · 2024-10-16T09:42:09Z

On Arm64 cpu, removing the SIMD in matmul allows the compiler to o 8000 ptimise it for arm64. The performance in my experiments goes from around ~2 t/s to ~4.5 t/s.

I'm not sure if this has any performance implications for running on x86, as I don't have one to test on. If it does make it slower, than I'm happy to put this version behind a target flag for arm64 e.g. https://docs.rs/core_arch/latest/core_arch/#static-cpu-feature-detection

Before

❯ cargo build --release --bin chat && ./target/release/chat --model model/llama3.2-3b-it-q40.lmrs --tokenizer model/tokenizer.bin --show-metrics --temperature 0
   Compiling lmrs v0.1.0 (../lm.rs)
    Finished `release` profile [optimized] target(s) in 22.27s

    L      M     M  RRRR    ssss
    L      MM   MM  R   R  s
    L      M M M M  RRRR    sss
    L      M  M  M  R  R       s
    LLLL   M     M  R   R  sssss
    
LMRS version: 4
Model type: LLAMA

Using Q4_0 quantization.
Loading weights...
Done.

You: hello there
Assistant:
Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat for a bit?
Speed: 1.98 tok/s
You:

After

❯ cargo build --release --bin chat && ./target/release/chat --model model/llama3.2-3b-it-q40.lmrs --tokenizer model/tokenizer.bin --show-metrics --temperature 0
   Compiling lmrs v0.1.0 (../lm.rs)
    Finished `release` profile [optimized] target(s) in 21.72s

    L      M     M  RRRR    ssss
    L      MM   MM  R   R  s
    L      M M M M  RRRR    sss
    L      M  M  M  R  R       s
    LLLL   M     M  R   R  sssss
    
LMRS version: 4
Model type: LLAMA

Using Q4_0 quantization.
Loading weights...
Done.

You: hello there
Assistant:
Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat for a bit?
Speed: 4.67 tok/s
You:

samuel-vitorino · 2024-10-16T09:51:06Z

Good morning. Thank you for the feedback. Does this happen on the 8-bit quantization models too? On x86, SIMD helps improve performance significantly on Q8. Anyways, Q4 is still being improved and I will test today on my x86 machine if I get the same results.

smpurkis · 2024-10-16T09:55:26Z

Good question on Q8, I'll see if it makes a similar difference 👍

smpurkis · 2024-10-16T10:55:52Z

On Q8 it gives a smaller improvement. 2.90 t/s -> 3.70 t/s.
Edit: Using the same Llama 3.2 3B model.

Edit2: Doing the same for matmul f32 slows it down. 4.2 t/s -> 2.8 t/s using llama 3.2 1B model.

samuel-vitorino · 2024-10-16T11:46:05Z

Without having other people testing on arm64 systems it's hard to say if it's justified adding a target flag for the Q4 and Q8 matmuls. The reason I used the wide crate was that it falls back to normal instructions even when SIMD isn't available (which shouldn't be your case), so I'm not sure what's causing the slowdown.

samuel-vitorino · 2024-10-16T11:53:27Z

Just to be sure, are you using the RUSTFLAGS="-C target-cpu=native" flag when compiling?

smpurkis · 2024-10-16T12:59:21Z

I wasn't, however rerunning all of the above on main and on my branch doesn't change the tokens per second

remove simd, improves perf on arm64 cpu by x2

e2c82f6

change matmul q8 as well

463ece9

Merge branch 'samuel-vitorino:main' into optimize_matmul_arm64

f86c902

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove simd, improves perf on arm64 cpu by x2 #11

remove simd, improves perf on arm64 cpu by x2 #11

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

remove simd, improves perf on arm64 cpu by x2 #11

Are you sure you want to change the base?

remove simd, improves perf on arm64 cpu by x2 #11

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!