-
Notifications
You must be signed in to change notification settings - Fork 13
Comparing changes
Open a pull request
base repository: jameswdelancey/llama3.c
base: master
head repository: ryao/llama3.c
compare: master
- 18 commits
- 4 files changed
- 1 contributor
Commits on Nov 29, 2024
-
Align memory allocations via new calloc_aligned() helper
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for 9522fda - Browse repository at this point
Copy the full SHA 9522fdaView commit details -
Precompute expensive floating point operations
Some cosf, sinf, expf, sqrt and floating point division operations that only need to be computed once are repeated for each layer. We precalculate the values outside the layer loop for some miniscule savings. This will matter more when a prompt processing routine is added based on this. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for 4495f4b - Browse repository at this point
Copy the full SHA 4495f4bView commit details -
Implement AVX2 optimized matmul()
The matmul() function is really a simplified sgemv rather than the sgemm that you would expect. This version gives us about 30% faster token generation on a Ryzen 7 5800X and outperforms both Intel and OpenBLAS' cblas_sgemv() functions. It relies on the CPU's prefetch rather than manually prefetching and the main innovation that enables its high performance is an extremely efficient horizontal addition. Instead of using 5 AVX2 instructions for horizontal addition or switching to scalar arithmetic, it does 11 AVX2 instructions to perform 8 horizontal additions in parallel. The end result is a single vector containing the results of 8 horizontal additions that we can then write to main memory directly. The function assumes 32-byte memory alignment, which the earlier calloc_aligned() patch gives us. It has handling for non multiple of 8 matrix/vector dimensions, but in practice, all matrices and vectors passed to it by llama3.c have dimensons that are multiples of 8, so the handling is unnecessary and is written in plain C rather than intrinsics, since doing performance work on effectively dead code gives zero benefit. The code to handle non-powers of 8 is only included for completeness, although it is untested. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for ec54b3a - Browse repository at this point
Copy the full SHA ec54b3aView commit details -
Add initial OpenBLAS and MKL support
Passing -DUSE_BLAS_SGEMV will switch mathmul() to use cblas_sgemv() and include either the mkl.h or cblas.h header depending on whether USE_MKL is specified. Passing -DUSE_MKL will include the mkl.h header. Passing -DUSE_OPENBLAS will include the cblas.h header. Linking must be done against either OpenBlas or the MKL. Coincidentally, even if the cblas.h header is used, linking against the MKL still works, as they are ABI compatible. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for c900416 - Browse repository at this point
Copy the full SHA c900416View commit details -
Add cblas_sgemm_batch() shim for future use
The Intel MKL's cblas_sgemm_batch() is more performant than calling multiple cblas_sgemm() calls, so I want to be able to use this, but this would break OpenBLAS compatibility unless an extremely recent version of OpenBLAS is used, so I am adding a compatibility shim. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for 3cc4fd6 - Browse repository at this point
Copy the full SHA 3cc4fd6View commit details -
Implement matrix_multiply() and batched_matrix_multiply()
These will be used later. We have both BLAS and non-BLAS versions. The non-BLAS versions are untested at the moment and are not expected to perform well. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for 436d4e8 - Browse repository at this point
Copy the full SHA 436d4e8View commit details -
Add prompt processing; non-chat only for now
This also changes the timing code to properly measure the performance. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for 58a0e7c - Browse repository at this point
Copy the full SHA 58a0e7cView commit details -
Use cblas_sgemm_batch() for attention calculations
Note that I am not actually sure why the second cblas_sgemm_batch() works as gemm shouldn't be able to calculate what is being calculated, but the output is the same in testing and performance increases substantially, so I am going with it for now. I will research it more later. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for aabe715 - Browse repository at this point
Copy the full SHA aabe715View commit details -
Use batched_matrix_multiply() in token generation
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for 148954e - Browse repository at this point
Copy the full SHA 148954eView commit details -
This is really slow and is not intended to be used right now. Future work will hopefully make it faster so we no longer need the MKL. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for ee001ec - Browse repository at this point
Copy the full SHA ee001ecView commit details -
Update README with fork information
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for e6f718c - Browse repository at this point
Copy the full SHA e6f718cView commit details
Commits on Dec 2, 2024
-
Replace part of attention loop in prompt processing using SGEMM
Rather than multiplying a vector by a matrix, for each position/head combination, we can multiply a matrix by a matrix for each head. This is much faster because it is not memory bandwidth bound. This leaves some room on the table for futur 10000 e optimization as we are technically multiplying a rectangular matrix by a matrix. About half of the calculations are unnecessary. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for eba7c40 - Browse repository at this point
Copy the full SHA eba7c40View commit details
Commits on Dec 3, 2024
-
Turn last remaining SGEMV calculation in pp into SGEMM calculation
We also tighten memory usage slightly. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for df0d63c - Browse repository at this point
Copy the full SHA df0d63cView commit details
Commits on Dec 9, 2024
-
Use cosine/sine addition identities in RoPE calculations
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for 0acbe35 - Browse repository at this point
Copy the full SHA 0acbe35View commit details
Commits on Dec 19, 2024
-
Use multiplication in place of powf() in RoPE calculations when possible
We use basical mathematical identities to call powf() twice and then multiply to get subsequent powf() calculations. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for 83b6335 - Browse repository at this point
Copy the full SHA 83b6335View commit details -
Add WIP CUDA version rung.c/rung.cu
We use it this way: nvcc -ptx -arch=sm_86 rung.cu -o rung.ptx gcc -I/opt/cuda/targets/x86_64-linux/include -L/opt/cuda/lib64 -O3 -g -lcuda -lcublas -lcudart -lm -lmvec -o rung rung.c ./rung "llama3_8b_base.bin" -z "tokenizer.bin" -t 0 -i "Once upon a time" This is being pushed early to share with others in the community. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for dc27f2b - Browse repository at this point
Copy the full SHA dc27f2bView commit details
Commits on Dec 20, 2024
-
rung.c: Remove cudaMemcpyAsync() from forward()
I read that CUDA calls are asynchronous, but are executed serially within a stream. I already read that cudaMemcpy() is a synchronization point. This means that copying asynchronously in a separate stream could allow a subsequent iteration of the forward function to overwrite the array before it has been used, which is undefined behavior that would cause incorrect calculation results. Until I am sure about the safety, I am reverting this. The use of the asynchronous function in a separate cuda stream did not make a noticeably difference in performance in the first place, so this difference probably will not be noticeable in any benchmark. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for a1f366a - Browse repository at this point
Copy the full SHA a1f366aView commit details -
rung.c: Improve and cleanup fp32_to_bf16_array()
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Configuration menu - View commit details
-
Copy full SHA for 89417a5 - Browse repository at this point
Copy the full SHA 89417a5View commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff master...master