This statement is the "Can it run Doom" equivalent for AI.
This seems like an exercise in futility, but nonetheless benchmarks the progress of quantization techniques, as well as the portability of AI models to edge devices.
-
Update and install git:
sudo apt update && sudo apt install git
-
Clone the chatbot repository:
git clone https://github.com/ggerganov/llama.cpp
-
Install required Python packages:
python3 -m pip install torch numpy sentencepiece
- If the build doesn't go through, run this command beforehand:
rm /usr/lib/python3.11/EXTERNALLY-MANAGED
- If the build doesn't go through, run this command beforehand:
-
Install essential build tools:
sudo apt install g++ build-essential
-
Navigate to the cloned directory and build:
cd llama.cpp make
Note: This only works on Linux, not Mac. Edit the
MODEL_SIZE
line to choose the model size you want (e.g., 7B, 13B). Otherwise, it will install all of them. -
Convert the 7B model to GGML FP16 format:
python3 convert-pth-to-ggml.py models/7B/ 1
Note: Ensure
tokenizer.model
and.consoldidates.00.pth
file/files are present. -
Quantize the model to 4-bits using the
q4_0
method. You can also try using theq3_0
model for better performance on the Raspberry Pi../quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0
Read more about quantization here.
-
Run the inference in interactive mode, or converse with Bob using the examples argument:
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
Note: Bob is cool. Ask him nicely, and he'll tell you if you're using the 7B model or the 7B quantized.
-
Test the chat interface:
./examples/chat.sh
-
Transfer the generated files in the
/7B
folder (excluding the pre-transformed files) to the Raspberry Pi. Place these files in the same directory structurellama.cpp/models/7B/
, and then simply run:./examples/chat.sh
Happy tinkering!
Final Verdict - It's not very good, but this is nonetheless the right type of benchmark to run. Next steps: I'll be buying 128GB+ Gbs of Ram to play with bigger engines for the time being.