Vision-Lang & Inference (including LoRA) #1174

optas · 2025-01-18T21:31:31Z

Adds the absolute-minimal required changes in inference configs for Vision-Lllama-based models trained with LoRA.

Specifically, for meta-llama/Llama-3.2-11B-Vision:

Native engine fully works
- With trained LorA adapters per Oumi's training recipe, we saw the responses of the model on questions to be very concise following the nature of the finetuning dataset. E.g., for image and prompt "How many sinks in the bathroom?"
  LoRA finetuned model responds: "2"
  Vs. Original Lllama: "There are two sinks in this bathroom."
vLLM appears to not support yet MllamaForConditionalGeneration see.
- We laid some groundwork for such requests in the future (see comments inside) and verified the feasibility of inference with the full meta-llama/Llama-3.2-11B-Vision model with VLM (see parameter added num_sequences).
Note: SGLang LoRA inference is not addressed in this PR.

Towards OPE-681

Description

Related issues

Fixes # (issue)

Before submitting

This PR only changes documentation. (You can ignore the following checks in that case)
Did you read the contributor guideline Pull Request guidelines?
Did you link the issue(s) related to this PR in the section above?
Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

linear · 2025-01-18T21:31:34Z

OPE-681

configs/recipes/vision/llama3_2_vision/inference/11b_rvllm_infer.yaml

src/oumi/inference/vllm_inference_engine.py

src/oumi/inference/remote_vllm_inference_engine.py

configs/recipes/vision/llama3_2_vision/inference/11b_vllm_infer.yaml

src/oumi/inference/remote_vllm_in 10000 ference_engine.py

src/oumi/inference/remote_vllm_inference_engine.py

nikg4 · 2025-01-19T01:23:28Z

src/oumi/inference/vllm_inference_engine.py

@@ -40,6 +40,7 @@ def __init__(
        enable_prefix_caching: bool = True,
        gpu_memory_utilization: float = 1.0,
        enforce_eager: bool = True,
+        max_num_seqs: int = 2,


Changing default value for max_num_seqs may affect other models.

Can we define this param as None ? max_num_seqs: int = None
then do something like this in the function:

if max_num_seqs is not None: vllm_kwargs["max_num_seqs"] = max_num_seqs

similarly to "max_lora_rank"

Then override max_num_seq in Llama VLLM inference config (example:

oumi/docs/user_guides/infer/inference_engines.md

Line 138 in 2cd47a1

model_kwargs={

)

Issue discussed https://linear.app/oumi/issue/OPE-923

src/oumi/inference/remote_vllm_inference_engine.py

optas added 7 commits January 18, 2025 10:50

our chat-template is not used by vllm

c7ef738

adapter_model in native

af5f712

pass adapter model in VLLM if exists

b4f0c7d

10000 max_num_seqs 2 instead of VLLM default (256)

253d274

explicit vllm inference config

78d1194

not supported by VLLM yet

57853d0

notes on LoRA-VLM config

03a6a08

optas requested review from nikg4, oelachqar and wizeng23 and removed request for nikg4 January 18, 2025 21:31