Description: This repo aims to look at different techniques that can allow for LLMs (language models over 1 billion parameters) to be scaled to run and train on consumer grade hardware.
- In this case, consumer grade hardware is referring to desktops that have between 8 to 32 GB of memory (RAM), with an average of 16 GB as well as 8 to 16 GB of GPU memory (VRAM).
- My M2 Macbook Pro
- OS: MacOS Sonoma
- Apple M2 Silicon
- 16 GB RAM
- 8 GB VRAM
- My Dell XPS Desktop
- OS: Windows 11
- Intel i7 CPU
- 16 GB RAM
- Nvidia 2060 SUPER (8 GB VRAM)
- My Darkstar GPU server
- OS: Ubuntu 22.04
- Dual Intel Xeon CPUs
- 64 GB RAM
- 3x Nvidia Tesla P100 (16 GB VRAM each)
- My M2 Macbook Pro
- Personal environment notes (software setup):
- Conda's conda-forge library does not currently have the correct version of
auto-gptq
orpeft
available - In addition, the conda-forge version of
bitsandbytes
may not be correct or up to date (was having an error where upon loading the module,bitsandbytes
could not find CUDA) - To counter this, use python's
virtualenv
package (documentation here)- install:
pip install virtualenv
- create new virtual environment:
python3 -m venv env-name
- activate virtual environment:
source env-name/bin/activate
- deactivate virtual environment:
deactivate
- install packages to virtual environment (while environment is active):
pip install package-name
- would be best to deactivate any/all existing conda enviornments before doing anything in python's virtualenv
- install:
- Install
pytorch
(see website for most up to date commands):- Linux (CUDA 11.8):
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- MacOS (MPS acceleration on MacOS 12.3+):
conda install pytorch::pytorch torchvision torchaudio -c pytorch
pip install torch torchvision torchaudio
- Windows (CUDA 11.8):
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Linux (CUDA 11.8):
- To alleviate confusion between conda and
virtualenv
:- Use
requirements.txt
to initialize the virtual environment. - Use
environment.yaml
to initialize the conda environment.
- Use
- Conda's conda-forge library does not currently have the correct version of
- Techniques to be explored in this repo:
- QLora (with
bitsandbytes
library in addition to huggingface'stransformers
&peft
) - GPTQ (with
auto-gptq
and/oroptimum
libraries in addition to huggingface'stransformers
&peft
) - GGML/GGUF
- QLora (with
- QLora
bitsandbytes
requires CUDA GPU to do quantization.bitsandbytes
benefits:- Can peform "zero-shot quantization" (does not require input data to calibrate the quantized model)
- Can quantize any model out of the box as long as it contains
torch.nn.Linear
modules - Quantization is peformed on model load, no need to run any post-processing or preparation step
- Zero peformance degredation when loading adapters and can also merge adapters in a dequantized model
bitsandbytes
shortcomings:- 4-bit quantization not serializable (means unable to save 4-bit quantized models).
- GPTQs
auto-gptq
requires CUDA or RoCM (AMD) GPU to do quantization.auto-gptq
benefits:- fast for text generation
- n-bit support
- easily serializable
auto-gptq
shortcomings:- requires a calibration dataset
- works for language models only (at the time of writing this 10/10/2023)
- GGML/GGUF
- Models quantized with GGML/GGUF do NOT currently support any form of finetuning (and are therefore limited to just inference).
- AWQ
autoawq
requires CUDA GPU to do quantization
- Can finetune quantized models with
peft
library from huggingface (for GPTQ and QLora quantization)- PEFT stands for "parameter efficient finetuning".
- Training on quantized models is not possible.
- Overall notes on my experience with quantization and finetuning
- The reliance on (primarily CUDA) GPUs is a detriment to the democratization of AI models. Current quantization methods (QLora and GPTQ) do a lot to allow for LLMs to be fine tuned and run on consumer desktops. However, not everyone has the budget for a GPU with the necessary VRAM to perform the quantization (looks to be 12-16+ GB VRAM for things like a 7b parameter model) on LLMs. This is only exasterbated by the reliance on NVIDIA cards. I understand
auto-gptq
allows for RoCm AMD cards to work as well but there is a blatant emphasis on CUDA support. Apple silicon is not viable for any quantization package at this time (10/21/2023) as support for MPS is not included in any of the major packages (bitsandbytes
andauto-gptq
), which cuts out another group of potential users (especially as Apple puts more effort into creating powerful hardware like their mac studio or mac station). - Speaking of neglected users, Windows users have to rely on other packages and work arounds to get
bitsandbytes
working for their environments. While the community has banded together to come up with the necessary work arounds, it should not be up to them to create guides that just "duct tape" around the core problem. - Quantizing Falcon 7B (with
auto-gptq
) took more resources than my Dell XPS Desktop was able to give (would usually OOM for the GPU). On Colab instances (free tier for Google and Kaggle) there would be other issues concerning not enough memory (regular RAM) or setting up the environment with the necessary modules (Kaggle was having a weird time importing/usingauto-gptq
despite the import command being run several times).- Note on Colab and Kaggle free tier instances:
- Colab free tier has 12.7GB RAM and 16GB VRAM (Nvidia T4).
- Kaggle free tier has 29GB RAM and 2x 16GB VRAM (Nvidia T4).
- Note on Colab and Kaggle free tier instances:
- Why learn how to quantize a model when others are already doing it for you?
- It's good to know how things work.
- You have more control over your models (other people's models/copies can be faulty, corrupted, tampered, or censored).
- What should you do if you cannot quantize a model (ie package error messages, insufficient hardware, etc)?
- Wait. Waiting for package updates and tutorials gives you the benefit of having someone else go through the trouble of debugging or handling your specific use case.
- File a bug. In the case of errors from running your quantization + finetuning, there are probably a lot of other people who have encountered the same bug on their system. Keeping up to date or subscribed to those posts may give you either a work around or a note that your issue is going to be resolved in the next update.
- Use a model that's already been quantized. While I mentioned above why you may want to perform your own quantization, there are still many people in the community who have gone through the trouble of quantizing the desired model for you (such as TheBloke). You will still have to judge how reputable/trustworthy those files are (even if they're using
safetensors
) but there are generally a lot of good actors in the community hoping to enable others to continue their research or work with different large scale models (like LLMs or Stable Diffusion). - Find another way/model. There are now many versions of models that come out, each version the same architecture but with a different number of parameters (usually as a result of using different number of layers and hyperparameters). Using smaller or distilled models at the cost of accuracy is still a viable option for your work. While this isnt quantization, it can be of tremendous benefit to you. You can also look to see if there is another way of doing things such as quantization. This repo explores QLora, GPTQ, and GGML/GGUF, but is primarily focused on the first two since the last one doesnt allow for finetuning after model quantization. While newer and more efficient methods will be developed, these three give you many options for how to quantize large models.
- The reliance on (primarily CUDA) GPUs is a detriment to the democratization of AI models. Current quantization methods (QLora and GPTQ) do a lot to allow for LLMs to be fine tuned and run on consumer desktops. However, not everyone has the budget for a GPU with the necessary VRAM to perform the quantization (looks to be 12-16+ GB VRAM for things like a 7b parameter model) on LLMs. This is only exasterbated by the reliance on NVIDIA cards. I understand
- Notes regarding LLMs + RAG with
langchain.js
- Currently,
langchain.js
doesn't have a recipe/instructions on how it supportstransformers.js
or ONNXruntime for JS, and primarily uses (usually paid) API calls to LLM hosting services (ie OpenAI, Google Vertex, Huggingface Inference). The python version of langchain allows for custom models (Huggingface Hub) and integration with local Huggingface pipelines. This presents a serious problem to the goal of locally hosted LLMs with RAG on JS applications (desktop, mobile, web extensions).- Possible Solution 1: Deconstruct the langchain library (JS & python) to figure out how it works. Create a custom framework that strips out 90% of the compatibility with the other LLM API endpoints and works soley with either Huggingface Inference, OpenAI, and
transformers.js
, withtransformers.js
being the only one that works locally.- Downsides:
- A LOT of work to do (reverse engineering & coding) - very much a "from scratch" approach.
- Limits functionality with the rest of the
langchain.js
library or outright removes a lot of features.
- Downsides:
- Possible Solution 1.5: Implement a local server that uses
transformer.js
and/or ONNXruntime. This server is pinged bylangchain.js
whenever it needs to make a call to the LLM. This still requires a bit of reverse engineering & tinkering but limits the scope to just the LLMs module/component.- Downsides:
- Limits downstream applications to just desktop apps (maybe mobile). Implementing web extensions would be a definite "no go".
- Still requires a lot of work for the reverse engineering. May have to hack in a way for
langchain.js
to support calls to the local server.
- Downsides:
- Possible Solution 2: Convert all final models to GGUF/GGML and use Ollama (or LlamaCPP) to host a local endpoint. Ollama and LlamaCPP both have support in
langchain.js
.- Downsides:
- Not a fan of GGUF/GGML because of it's limits on retraining models.
- Complicates workflow for model training and finetuning. Not sure if the above line disallows for finetuning overall since that would use some quantization to do finetuning AND THEN the model would have to be converted to GGUF/GGML.
- Downsides:
- Possible Solution 3; Wait until langchain supports
transformers.js
and ONNXruntime. There is not GitHub issue/ticket tracking this, and timeline for that is unknown given how smalltransformers.js
is.- Downsides:
- Indefinite wait
- Downsides:
- Possible Solution 1: Deconstruct the langchain library (JS & python) to figure out how it works. Create a custom framework that strips out 90% of the compatibility with the other LLM API endpoints and works soley with either Huggingface Inference, OpenAI, and
- Currently,
- YouTube video from Code_Your_Own_AI: New Tutorial on LLM Quantization w/ QLoRA, GPTQ and Llamacpp, LLama 2
- Huggingface Transformers documentation on quantization
- Huggingface Transformers overview of quantization on GitHub
- Huggingface Optimum documentation on quantization
- Huggingface Transformers documentation on quantization
- Huggingface Accelerate documentation on estimating model footprint on memory