-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jina does not pass the right GPU in to clipseg #135
Comments
Hey @mchaker , What is the backend you are using? what does |
Hey @mchaker , Are you sure ur cuda version support MIG access? https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#cuda-baremetal In this documentation, you see the drivers version that support this feature, plus the syntax to be used |
Can you try changing your YAML to: - name: clipseg
env:
CUDA_VISIBLE_DEVICES: "MIG-GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
XLA_PYTHON_CLIENT_ALLOCATOR: platform
replicas: 1
timeout_ready: -1
uses: executors/clipseg/config.yml or - name: clipseg
env:
CUDA_VISIBLE_DEVICES: "MIG-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
XLA_PYTHON_CLIENT_ALLOCATOR: platform
replicas: 1
timeout_ready: -1
uses: executors/clipseg/config.yml ? |
My NVIDIA driver version is 515, so it supports MIG. I'll try the MIG prefix and report back.
|
this is weird, do you have the source code of
What Jina does is simply to set the env vars for each of Executor process, so wether or not this is respected by the Executor should be the Executor or upstream problem. |
I see - will check the |
Hey @mchaker , any news about it? |
@JoanFM yes - However Jina crashes with:
|
Hey @mchaker , This problem is on the Executor and how they load into GPU, where are u getting it from? maybe we can open an issue on that repo and fix there? |
I see - let me check with the developer and see where they are getting the executor from. Maybe it is custom. |
I believe the issue may come from how the model was stored or something like this. in this case Jina has made sure that ur |
I see -- I'll follow up with the executor authors and dig into the executor source. Thanks for your help! |
@JoanFM actually it looks like the executor is from Jina: |
The device for the model is simply mapped with: model.load_state_dict(
torch.load(
f'{cache_path}/{WEIGHT_FOLDER_NAME}/rd64-uni.pth',
map_location=torch.device('cuda'),
),
strict=False,
) In this case it appears that torch is unable to map the location. @mchaker before these lines in |
Hey @AmericanPresidentJimmyCarter, do you know what might be the problem why it cannot be loaded with that |
@JoanFM No, I will try to get you debug from the env. This appears to be a strange one. |
I transfer the issue to DALLE-FLOW because the issue is specific to the Executor in this project |
@AmericanPresidentJimmyCarter what do you need from the env? |
Hey @mchaker , @AmericanPresidentJimmyCarter , any progress on this ? |
I still do not know why it happens -- it's only this one specific executor that has the problem. We can upload to latest jina and see if it persists. |
I updated jina using
|
Describe the bug
Does not work:
Works:
Describe how you solve it
I use the numeric GPU ID (sad)
Environment
Screenshots
N/A
The text was updated successfully, but these errors were encountered: