8000 remove simd, improves perf on arm64 cpu by x2 by smpurkis · Pull Request #11 · samuel-vitorino/lm.rs · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

remove simd, improves perf on arm64 cpu by x2 #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

smpurkis
Copy link

On Arm64 cpu, removing the SIMD in matmul allows the compiler to o 8000 ptimise it for arm64. The performance in my experiments goes from around ~2 t/s to ~4.5 t/s.

I'm not sure if this has any performance implications for running on x86, as I don't have one to test on. If it does make it slower, than I'm happy to put this version behind a target flag for arm64 e.g. https://docs.rs/core_arch/latest/core_arch/#static-cpu-feature-detection

Before

❯ cargo build --release --bin chat && ./target/release/chat --model model/llama3.2-3b-it-q40.lmrs --tokenizer model/tokenizer.bin --show-metrics --temperature 0
   Compiling lmrs v0.1.0 (../lm.rs)
    Finished `release` profile [optimized] target(s) in 22.27s

    L      M     M  RRRR    ssss
    L      MM   MM  R   R  s
    L      M M M M  RRRR    sss
    L      M  M  M  R  R       s
    LLLL   M     M  R   R  sssss
    
LMRS version: 4
Model type: LLAMA

Using Q4_0 quantization.
Loading weights...
Done.

You: hello there
Assistant:
Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat for a bit?
Speed: 1.98 tok/s
You: 

After

❯ cargo build --release --bin chat && ./target/release/chat --model model/llama3.2-3b-it-q40.lmrs --tokenizer model/tokenizer.bin --show-metrics --temperature 0
   Compiling lmrs v0.1.0 (../lm.rs)
    Finished `release` profile [optimized] target(s) in 21.72s

    L      M     M  RRRR    ssss
    L      MM   MM  R   R  s
    L      M M M M  RRRR    sss
    L      M  M  M  R  R       s
    LLLL   M     M  R   R  sssss
    
LMRS version: 4
Model type: LLAMA

Using Q4_0 quantization.
Loading weights...
Done.

You: hello there
Assistant:
Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat for a bit?
Speed: 4.67 tok/s
You:

@samuel-vitorino
Copy link
Owner

Good morning. Thank you for the feedback. Does this happen on the 8-bit quantization models too? On x86, SIMD helps improve performance significantly on Q8. Anyways, Q4 is still being improved and I will test today on my x86 machine if I get the same results.

@smpurkis
Copy link
Author

Good question on Q8, I'll see if it makes a similar difference 👍

@smpurkis
Copy link
Author
smpurkis commented Oct 16, 2024

On Q8 it gives a smaller improvement. 2.90 t/s -> 3.70 t/s.
Edit: Using the same Llama 3.2 3B model.

Edit2: Doing the same for matmul f32 slows it down. 4.2 t/s -> 2.8 t/s using llama 3.2 1B model.

@samuel-vitorino
Copy link
Owner
samuel-vitorino commented Oct 16, 2024

Without having other people testing on arm64 systems it's hard to say if it's justified adding a target flag for the Q4 and Q8 matmuls. The reason I used the wide crate was that it falls back to normal instructions even when SIMD isn't available (which shouldn't be your case), so I'm not sure what's causing the slowdown.

@samuel-vitorino
Copy link
Owner
samuel-vitorino commented Oct 16, 2024

Just to be sure, are you using the RUSTFLAGS="-C target-cpu=native" flag when compiling?

@smpurkis
Copy link
Author

I wasn't, however rerunning all of the above on main and on my branch doesn't change the tokens per second

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0