a tiny vision language model
Build a high-quality, low-hallucination vision language model small enough to run on an edge device without a GPU.
Initial prototype built using SigLIP, Phi-1.5, and the LLaVa training dataset. The model is for research purposes only, and is subject to the Phi and LLaVa license restrictions.
Examples
Usage
Clone this repository and install the dependencies:
pip install -r requirements.txt
Use the sample.py
script to run the model on CPU:
python sample.py --image [IMAGE_PATH] [--interactive]
When the --interactive
flag is not set, the script will predict three questions and try
to answer them.
Limitations
- The model may generate inaccurate statements.
- It may struggle to adhere to intricate or nuanced instructions.
- It is primarily designed to understand English. Informal English, slang, and non-English languages may not work well.
- The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
- The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.