8000 Comparing jameswdelancey:master...ryao:master · jameswdelancey/llama3.c · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: jameswdelancey/llama3.c
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: ryao/llama3.c
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 18 commits
  • 4 files changed
  • 1 contributor

Commits on Nov 29, 2024

  1. Align memory allocations via new calloc_aligned() helper

    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    9522fda View commit details
    Browse the repository at this point in the history
  2. Precompute expensive floating point operations

    Some cosf, sinf, expf, sqrt and floating point division operations that
    only need to be computed once are repeated for each layer. We
    precalculate the values outside the layer loop for some miniscule
    savings. This will matter more when a prompt processing routine is added
    based on this.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    4495f4b View commit details
    Browse the repository at this point in the history
  3. Implement AVX2 optimized matmul()

    The matmul() function is really a simplified sgemv rather than the sgemm
    that you would expect. This version gives us about 30% faster token
    generation on a Ryzen 7 5800X and outperforms both Intel and OpenBLAS'
    cblas_sgemv() functions. It relies on the CPU's prefetch rather than
    manually prefetching and the main innovation that enables its high
    performance is an extremely efficient horizontal addition.
    
    Instead of using 5 AVX2 instructions for horizontal addition or
    switching to scalar arithmetic, it does 11 AVX2 instructions to perform
    8 horizontal additions in parallel. The end result is a single vector
    containing the results of 8 horizontal additions that we can then write
    to main memory directly.
    
    The function assumes 32-byte memory alignment, which the earlier
    calloc_aligned() patch gives us. It has handling for non multiple of 8
    matrix/vector dimensions, but in practice, all matrices and vectors
    passed to it by llama3.c have dimensons that are multiples of 8, so the
    handling is unnecessary and is written in plain C rather than
    intrinsics, since doing performance work on effectively dead code gives
    zero benefit. The code to handle non-powers of 8 is only included for
    completeness, although it is untested.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    ec54b3a View commit details
    Browse the repository at this point in the history
  4. Add initial OpenBLAS and MKL support

    Passing -DUSE_BLAS_SGEMV will switch mathmul() to use cblas_sgemv() and
    include either the mkl.h or cblas.h header depending on whether USE_MKL
    is specified.
    
    Passing -DUSE_MKL will include the mkl.h header.
    
    Passing -DUSE_OPENBLAS will include the cblas.h header.
    
    Linking must be done against either OpenBlas or the MKL. Coincidentally,
    even if the cblas.h header is used, linking against the MKL still works,
    as they are ABI compatible.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    c900416 View commit details
    Browse the repository at this point in the history
  5. Add cblas_sgemm_batch() shim for future use

    The Intel MKL's cblas_sgemm_batch() is more performant than calling
    multiple cblas_sgemm() calls, so I want to be able to use this, but this
    would break OpenBLAS compatibility unless an extremely recent version of
    OpenBLAS is used, so I am adding a compatibility shim.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    3cc4fd6 View commit details
    Browse the repository at this point in the history
  6. Implement matrix_multiply() and batched_matrix_multiply()

    These will be used later. We have both BLAS and non-BLAS versions. The
    non-BLAS versions are untested at the moment and are not expected to
    perform well.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    436d4e8 View commit details
    Browse the repository at this point in the history
  7. Add prompt processing; non-chat only for now

    This also changes the timing code to properly measure the performance.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    58a0e7c View commit details
    Browse the repository at this point in the history
  8. Use cblas_sgemm_batch() for attention calculations

    Note that I am not actually sure why the second cblas_sgemm_batch()
    works as gemm shouldn't be able to calculate what is being calculated,
    but the output is the same in testing and performance increases
    substantially, so I am going with it for now. I will research it more
    later.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    aabe715 View commit details
    Browse the repository at this point in the history
  9. Use batched_matrix_multiply() in token generation

    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    148954e View commit details
    Browse the repository at this point in the history
  10. Add generic cblas_sgemm()

    This is really slow and is not intended to be used right now. Future
    work will hopefully make it faster so we no longer need the MKL.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    ee001ec View commit details
    Browse the repository at this point in the history
  11. Update README with fork information

    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    e6f718c View commit details
    Browse the repository at this point in the history

Commits on Dec 2, 2024

  1. Replace part of attention loop in prompt processing using SGEMM

    Rather than multiplying a vector by a matrix, for each position/head
    combination, we can multiply a matrix by a matrix for each head. This is
    much faster because it is not memory bandwidth bound.
    
    This leaves some room on the table for futur
    10000
    e optimization as we are
    technically multiplying a rectangular matrix by a matrix. About half of
    the calculations are unnecessary.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Dec 2, 2024
    Configuration menu
    Copy the full SHA
    eba7c40 View commit details
    Browse the repository at this point in the history

Commits on Dec 3, 2024

  1. Turn last remaining SGEMV calculation in pp into SGEMM calculation

    We also tighten memory usage slightly.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Dec 3, 2024
    Configuration menu
    Copy the full SHA
    df0d63c View commit details
    Browse the repository at this point in the history

Commits on Dec 9, 2024

  1. Use cosine/sine addition identities in RoPE calculations

    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Dec 9, 2024
    Configuration menu
    Copy the full SHA
    0acbe35 View commit details
    Browse the repository at this point in the history

Commits on Dec 19, 2024

  1. Use multiplication in place of powf() in RoPE calculations when possible

    We use basical mathematical identities to call powf() twice and then
    multiply to get subsequent powf() calculations.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Dec 19, 2024
    Configuration menu
    Copy the full SHA
    83b6335 View commit details
    Browse the repository at this point in the history
  2. Add WIP CUDA version rung.c/rung.cu

    We use it this way:
    
    nvcc -ptx -arch=sm_86 rung.cu -o rung.ptx
    
    gcc -I/opt/cuda/targets/x86_64-linux/include -L/opt/cuda/lib64 -O3 -g -lcuda -lcublas -lcudart -lm -lmvec -o rung rung.c
    
    ./rung "llama3_8b_base.bin" -z "tokenizer.bin" -t 0 -i "Once upon a time"
    
    This is being pushed early to share with others in the community.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Dec 19, 2024
    Configuration menu
    Copy the full SHA
    dc27f2b View commit details
    Browse the repository at this point in the history

Commits on Dec 20, 2024

  1. rung.c: Remove cudaMemcpyAsync() from forward()

    I read that CUDA calls are asynchronous, but are executed serially
    within a stream. I already read that cudaMemcpy() is a synchronization
    point. This means that copying asynchronously in a separate stream could
    allow a subsequent iteration of the forward function to overwrite the
    array before it has been used, which is undefined behavior that would
    cause incorrect calculation results. Until I am sure about the safety, I
    am reverting this. The use of the asynchronous function in a separate
    cuda stream did not make a noticeably difference in performance in the
    first place, so this difference probably will not be noticeable in any
    benchmark.
    
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Dec 20, 2024
    Configuration menu
    Copy the full SHA
    a1f366a View commit details
    Browse the repository at this point in the history
  2. rung.c: Improve and cleanup fp32_to_bf16_array()

    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    ryao committed Dec 20, 2024
    Configuration menu
    Copy the full SHA
    89417a5 View commit details
    Browse the repository at this point in the history
Loading
0