10000 GitHub - zhaohb/ollama_ov: Add genai backend for ollama to run generative AI models using OpenVINO Runtime.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add genai backend for ollama to run generative AI models using OpenVINO Runtime.

License

Notifications You must be signed in to change notification settings

zhaohb/ollama_ov

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenVINO Integration with Ollama

Click here to expand/collapse content

Ollama-ov

Getting started with large language models and using the GenAI backend.

Ollama-OV

We provide two ways to download the executable file of Ollama, one is to download it from Google Drive, and the other is to download it from Baidu Drive:

Google Driver

Windows

Download exe + Download OpenVINO GenAI

Linux(Ubuntu 22.04)

Download + Donwload OpenVINO GenAI

百度云盘

Windows

Download exe + Download OpenVINO GenAI

Linux(Ubuntu 22.04)

Download + Donwload OpenVINO GenAI

Docker

Linux

We also prepared a Dockerfile to help developers quickly build Docker images: Dockerfile

docker build -t ollama_openvino:v1 -f Dockerfile_genai .

then

docker run --rm -it ollama_openvino:v1

Model library

The native Ollama only supports models in the GGUF format, the Ollama-OV invoke OpenVINO GenAI which requires models in the OpenVINO format. Therefore, we have enabled support for OpenVINO model files in Ollama. For public LLMs, you can access and download OpenVINO IR model from HuggingFace or ModelScope:

Model Parameters Size Compression Download Device
Qwen3-8B-int4-asym-ov 8B 4.9GB INT4_ASYM_128 ModelScope CPU, GPU, NPU(base)
Qwen3-4B-int4-ov 4B 2.7GB INT4_ASYM_128 ModelScope CPU, GPU, NPU(base)
Qwen3-1.7B-int4-ov 1.7B 1.2GB INT4_ASYM_128 ModelScope CPU, GPU, NPU(base)
Qwen3-0.6B-int4-ov 0.6B 0.5GB INT4_ASYM_128 ModelScope CPU, GPU, NPU(base)
DeepSeek-R1-Distill-Qwen-1.5B-int4-ov 1.5B 1.4GB INT4_ASYM_32 ModelScope CPU, GPU, NPU(base)
DeepSeek-R1-Distill-Qwen-1.5B-int4-ov-npu 1.5B 1.1GB INT4_SYM_CW ModelScope NPU(best)
DeepSeek-R1-Distill-Qwen-7B-int4-ov 7B 4.3GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
DeepSeek-R1-Distill-Qwen-7B-int4-ov-npu 7B 4.1GB INT4_SYM_CW ModelScope NPU(best)
DeepSeek-R1-Distill-Qwen-14B-int4-ov 14B 8.0GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
DeepSeek-R1-Distill-llama-8B-int4-ov 8B 4.5GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
DeepSeek-R1-Distill-llama-8B-int4-ov-npu 8B 4.2GB INT4_SYM_CW ModelScope NPU(best)
llama-3.2-1b-instruct-int4-ov 1B 0.8GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
llama-3.2-3b-instruct-int4-ov 3B 1.9GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
llama-3.2-3b-instruct-int4-ov-npu 3B 1.8GB INT4_SYM_CW ModelScope NPU(best)
Phi-3.5-mini-instruct-int4-ov 3.8B 2.1GB INT4_ASYM HF, ModelScope CPU, GPU
Phi-3-mini-128k-instruct-int4-ov 3.8B 2.5GB INT4_ASYM HF, ModelScope CPU, GPU
Phi-3-mini-4k-instruct-int4-ov 3.8B 2.2GB INT4_ASYM HF, ModelScope CPU, GPU
Phi-3-medium-4k-instruct-int4-ov 14B 7.4GB INT4_ASYM HF, ModelScope CPU, GPU
Qwen2.5-0.5B-Instruct-openvino-ovms-int4 0.5B 0.3GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
Qwen2.5-1.5B-Instruct-openvino-ovms-int4 1.5B 0.9GB INT4_SYM_128 ModelScope CPU, GPU, NPU(base)
Qwen2.5-3B-Instruct-gptq-ov 3B 2.7GB INT4_GPTQ ModelScope CPU, GPU
Qwen2.5-7B-Instruct-int4-ov 7B 4.3GB INT4_ASYM ModelScope CPU, GPU
minicpm-1b-sft-int4-ov 1B 0.7GB INT4_SYM ModelScope CPU, GPU, NPU(base)
gemma-2-9b-it-int4-ov 9B 5.3GB INT4_ASYM HF, ModelScope CPU, GPU
gemma-3-1b-it-int4-ov 1B 0.7G INT4_SYM_128 ModelScope CPU, GPU
TinyLlama-1.1B-Chat-v1.0-int4-ov 1.1B 0.6GB INT4_ASYM HF, ModelScope CPU, GPU
  • INT4_SYM_128: INT4 symmetric compression with NNCF, group size 128, all linear layer compressed. Similar to Q4_0 compression.
  • INT4_SYM_CW: INT4 symmetric compression with NNCF, channel wise compression for NPU best performance.
  • INT4_ASYM: INT4 asymmetric compression with NNCF, has better accuracy than symmetric, NPU not support asymmetric compression.
  • INT4_GPTQ: INT4 GPTQ compression by NNCF which aligned with Huggingface.

Just provide above model link as example for part models, for other LLMs, you can check OpenVINO GenAI model support list. If you have customized LLM, please follow model conversion step of GenAI.

Ollama Model File

We added two new parameters to Modelfile based on the original parameters:

Parameter Description Value Type Example Usage
stop_id Sets the stop ids to use int stop_id 151643
max_new_token The maximum number of tokens generated by genai (Default: 2048) int max_new_token 4096

Quick start

Start Ollama

  1. First, set GODEBUG=cgocheck=0 env:

    Linux

    export GODEBUG=cgocheck=0

    Windows

    set GODEBUG=cgocheck=0
  2. Next, ollama serve is used when you want to start ollama (you must use the Ollama compiled by ollama_ov to start the serve):

    ollama serve

Import from openVINO IR

How to create an Ollama model based on Openvino IR

How to create an Ollama model based on Openvino IR

Example

Let's take deepseek-ai/DeepSeek-R1-Distill-Qwen-7B as an example.

  1. Download the OpenVINO model

    1. Download from ModelScope: DeepSeek-R1-Distill-Qwen-7B-int4-ov

      pip install modelscope
      modelscope download --model zhaohb/DeepSeek-R1-Distill-Qwen-7B-int4-ov --local_dir ./DeepSeek-R1-Distill-Qwen-7B-int4-ov
    2. If the OpenVINO model exists in HF, we can also download it from HF. Here we take TinyLlama-1.1B-Chat-v1.0-int4-ov as an example to introduce how to download the model from HF.

      If your network access to HuggingFace is unstable, you can try to use a proxy image to pull the model.

      set HF_ENDPOINT=https://hf-mirror.com
      pip install -U huggingface_hub
      huggingface-cli download --resume-download OpenVINO/TinyLlama-1.1B-Chat-v1.0-int4-ov  --local-dir  TinyLlama-1.1B-Chat-v1.0-int4-ov --local-dir-use-symlinks False
      
  2. Package OpenVINO IR into a tar.gz file

    tar -zcvf DeepSeek-R1-Distill-Qwen-7B-int4-ov.tar.gz DeepSeek-R1-Distill-Qwen-7B-int4-ov
  3. Create a file named Modelfile, with a FROM instruction with the local filepath to the model you want to import. For convenience, we have put the model file of DeepSeek-R1-Distill-Qwen-7B-int4-ov model under example dir: Modelfile for DeepSeek, we can use it directly.

    Note:

    1. The ModelType "OpenVINO" parameter is mandatory and must be explicitly set.
    2. The InferDevice parameter is optional. If not specified, the system will prioritize using the GPU by default. If no GPU is available, it will automatically fall back to using the CPU. If InferDevice is explicitly set, the system will strictly use the specified device. If the specified device is unavailable, the system will follow the same fallback strategy as when InferDevice is not set (i.e., GPU first, then CPU).
    3. For more information on working with a Modelfile, see the Modelfile documentation.
  4. Unzip OpenVINO GenAI package and set environment

    cd openvino_genai_windows_2025.2.0.0.dev20250320_x86_64
    setupvars.bat
  5. Create the model in Ollama

    ollama create DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1 -f Modelfile

    You will see output similar to the following:

       gathering model components 
       copying file sha256:77acf6474e4cafb67e57fe899264e9ca39a215ad7bb8f5e6b877dcfa0fabf919 100% 
       using existing layer sha256:77acf6474e4cafb67e57fe899264e9ca39a215ad7bb8f5e6b877dcfa0fabf919 
       creating new layer sha256:9b345e4ef9f87ebc77c918a4a0cee4a83e8ea78049c0ee2dc1ddd2a337cf7179 
       creating new layer sha256:ea49523d744c40bc900b4462c43132d1c8a8de5216fa8436cc0e8b3e89dddbe3 
       creating new layer sha256:b6bf5bcca7a15f0a06e22dcf5f41c6c0925caaab85ec837067ea98b843bf917a 
       writing manifest 
       success 
  6. Run the model

    ollama run DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1

CLI Reference

Show model information

ollama show DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1 

List models on your computer

ollama list

List which models are currently loaded

ollama ps

Stop a model which is currently running

ollama stop DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1 

Building from source

Install prerequisites:

  • Go
  • C/C++ Compiler e.g. Clang on macOS, TDM-GCC (Windows amd64) or llvm-mingw (Windows arm64), GCC/Clang on Linux.

Then build and run Ollama from the root directory of the repository:

Windows

  1. clone repo

    git lfs install
    git lfs clone https://github.com/zhaohb/ollama_ov.git
  2. Enable CGO:

    go env -w CGO_ENABLED=1
  3. Initialize the GenAI environment

    Download GenAI runtime from GenAI, then extract it to a directory openvino_genai_windows_2025.2.0.0.dev20250320_x86_64 .

    cd openvino_genai_windows_2025.2.0.0.dev20250320_x86_64
    setupvars.bat
  4. Setting cgo environment variables

    set CGO_LDFLAGS=-L%OpenVINO_DIR%\..\lib\intel64\Release
    set CGO_CFLAGS=-I%OpenVINO_DIR%\..\include 
  5. building Ollama

    go build -o ollama.exe
  6. If you don't want to recompile ollama, you can choose to directly use the compiled executable file, and then initialize the genai environment in step 3 to run ollama directly ollama.

    But if you encounter the error when executing ollama.exe, it is recommended that you recompile from source code.

    This ver
    9D51
    sion of ollama.exe is not compatible with the version of Windows you're running. Check your computer's system information and then contact the software publisher.'

Linux

  1. clone repo

    git lfs install
    git lfs clone https://github.com/zhaohb/ollama_ov.git
  2. Enable CGO:

    go env -w CGO_ENABLED=1
  3. Initialize the GenAI environment

    Download GenAI runtime from GenAI, then extract it to a directory openvino_genai_ubuntu22_2025.2.0.0.dev20250320_x86_64.

    cd openvino_genai_ubuntu22_2025.2.0.0.dev20250320_x86_64 
    source setupvars.sh
  4. Setting cgo environment variables

    export CGO_LDFLAGS=-L$OpenVINO_DIR/../lib/intel64/
    export CGO_CFLAGS=-I$OpenVINO_DIR/../include
  5. building Ollama

    go build -o ollama
  6. If you don't want to recompile ollama, you can choose to directly use the compiled executable file, and then initialize the genai environment in step 3 to run ollama directly ollama.

    If you encounter problems during use, it is recommended to rebuild from source.

Running local builds

  1. First, set GODEBUG=cgocheck=0 env:

    Linux

    export GODEBUG=cgocheck=0

    Windows

    set GODEBUG=cgocheck=0
  2. Next, start the server:

    ollama serve
  3. Finally, in a separate shell, run a model:

    ollama run DeepSeek-R1-Distill-Qwen-7B-int4-ov:v1 

Community Integrations

Terminal

Web & Desktop

Future Development Plan

Here are some features and improvements planned for future releases:

  1. Multimodal models: Support for multimodal models that can process both text and image data.

About

Add genai backend for ollama to run generative AI models using OpenVINO Runtime.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  
0