Comparing changes

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Some cosf, sinf, expf, sqrt and floating point division operations that only need to be computed once are repeated for each layer. We precalculate the values outside the layer loop for some miniscule savings. This will matter more when a prompt processing routine is added based on this. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

The matmul() function is really a simplified sgemv rather than the sgemm that you would expect. This version gives us about 30% faster token generation on a Ryzen 7 5800X and outperforms both Intel and OpenBLAS' cblas_sgemv() functions. It relies on the CPU's prefetch rather than manually prefetching and the main innovation that enables its high performance is an extremely efficient horizontal addition. Instead of using 5 AVX2 instructions for horizontal addition or switching to scalar arithmetic, it does 11 AVX2 instructions to perform 8 horizontal additions in parallel. The end result is a single vector containing the results of 8 horizontal additions that we can then write to main memory directly. The function assumes 32-byte memory alignment, which the earlier calloc_aligned() patch gives us. It has handling for non multiple of 8 matrix/vector dimensions, but in practice, all matrices and vectors passed to it by llama3.c have dimensons that are multiples of 8, so the handling is unnecessary and is written in plain C rather than intrinsics, since doing performance work on effectively dead code gives zero benefit. The code to handle non-powers of 8 is only included for completeness, although it is untested. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Passing -DUSE_BLAS_SGEMV will switch mathmul() to use cblas_sgemv() and include either the mkl.h or cblas.h header depending on whether USE_MKL is specified. Passing -DUSE_MKL will include the mkl.h header. Passing -DUSE_OPENBLAS will include the cblas.h header. Linking must be done against either OpenBlas or the MKL. Coincidentally, even if the cblas.h header is used, linking against the MKL still works, as they are ABI compatible. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

The Intel MKL's cblas_sgemm_batch() is more performant than calling multiple cblas_sgemm() calls, so I want to be able to use this, but this would break OpenBLAS compatibility unless an extremely recent version of OpenBLAS is used, so I am adding a compatibility shim. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

These will be used later. We have both BLAS and non-BLAS versions. The non-BLAS versions are untested at the moment and are not expected to perform well. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

This also changes the timing code to properly measure the performance. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Note that I am not actually sure why the second cblas_sgemm_batch() works as gemm shouldn't be able to calculate what is being calculated, but the output is the same in testing and performance increases substantially, so I am going with it for now. I will research it more later. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

This is really slow and is not intended to be used right now. Future work will hopefully make it faster so we no longer need the MKL. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Rather than multiplying a vector by a matrix, for each position/head combination, we can multiply a matrix by a matrix for each head. This is much faster because it is not memory bandwidth bound. This leaves some room on the table for futur 10000 e optimization as we are technically multiplying a rectangular matrix by a matrix. About half of the calculations are unnecessary. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

We also tighten memory usage slightly. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

We use basical mathematical identities to call powf() twice and then multiply to get subsequent powf() calculations. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

We use it this way: nvcc -ptx -arch=sm_86 rung.cu -o rung.ptx gcc -I/opt/cuda/targets/x86_64-linux/include -L/opt/cuda/lib64 -O3 -g -lcuda -lcublas -lcudart -lm -lmvec -o rung rung.c ./rung "llama3_8b_base.bin" -z "tokenizer.bin" -t 0 -i "Once upon a time" This is being pushed early to share with others in the community. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

I read that CUDA calls are asynchronous, but are executed serially within a stream. I already read that cudaMemcpy() is a synchronization point. This means that copying asynchronously in a separate stream could allow a subsequent iteration of the forward function to overwrite the array before it has been used, which is undefined behavior that would cause incorrect calculation results. Until I am sure about the safety, I am reverting this. The use of the asynchronous function in a separate cuda stream did not make a noticeably difference in performance in the first place, so this difference probably will not be noticeable in any benchmark. Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Uh oh!

Commits on Nov 29, 2024

Commits on Dec 2, 2024

Commits on Dec 3, 2024

Commits on Dec 9, 2024

Commits on Dec 19, 2024

Commits on Dec 20, 2024

This comparison is taking too long to generate.

Uh oh!