This application is run from a docker container
Prerequisites:
- vLLM image ( can be built from here)
- HuggingFace token pertaining to the model you want to run (only needed for gated models)
-
export HF token
export HF_TOKEN=<>
-
Run your container in detached mode , with the following variables
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN=$HF_TOKEN --cpus <> -m <>GB --name <vllm-image> cpu-test
-
Copy the script to run the client application in the container
docker cp streamlit-chat-app.py cpu-test:vllm/examples/
-
Run your vLLM server with desired model and your chat client
docker exec cpu-test bash -c " /opt/conda/bin/python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --dtype half --max-model-len 4000 pip install streamlit streamlit run vllm/examples/streamlit-chat-app.py"
-
Access your chatbot on this link in your browser
http://<ip of host vm where container is running>:8501