jina does not pass the right GPU in to clipseg #135

mchaker · 2022-11-05T20:06:22Z

Describe the bug

Does not work:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

Works:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "6"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

Describe how you solve it

I use the numeric GPU ID (sad)

Environment

- jina 3.8.3
- docarray 0.16.2
- jcloud 0.0.35
- jina-hubble-sdk 0.18.0
- jina-proto 0.1.13
- protobuf 3.20.1
- proto-backend cpp
- grpcio 1.47.0
- pyyaml 6.0
- python 3.8.10
- platform Linux
- platform-release 5.15.0-52-generic
- platform-version jina-ai/jina#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022
- architecture x86_64
- processor x86_64
- uid 2485377892357
- session-id fcbedcc8-5d43-11ed-9251-0242ac110005
- uptime 2022-11-05T19:56:49.977485
- ci-vendor (unset)
* JINA_DEFAULT_HOST (unset)
* JINA_DEFAULT_TIMEOUT_CTRL (unset)
* JINA_DEPLOYMENT_NAME (unset)
* JINA_DISABLE_UVLOOP (unset)
* JINA_EARLY_STOP (unset)
* JINA_FULL_CLI (unset)
* JINA_GATEWAY_IMAGE (unset)
* JINA_GRPC_RECV_BYTES (unset)
* JINA_GRPC_SEND_BYTES (unset)
* JINA_HUB_NO_IMAGE_REBUILD (unset)
* JINA_LOG_CONFIG (unset)
* JINA_LOG_LEVEL (unset)
* JINA_LOG_NO_COLOR (unset)
* JINA_MP_START_METHOD (unset)
* JINA_OPTOUT_TELEMETRY (unset)
* JINA_RANDOM_PORT_MAX (unset)
* JINA_RANDOM_PORT_MIN (unset)

Screenshots

N/A

The text was updated successfully, but these errors were encountered:

JoanFM · 2022-11-07T10:29:35Z

Hey @mchaker ,

What is the backend you are using? what does clipseg do? It seems that the DL backend does not understand the UUID

JoanFM · 2022-11-07T11:38:56Z

Hey @mchaker ,

Are you sure ur cuda version support MIG access?

https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#cuda-baremetal

In this documentation, you see the drivers version that support this feature, plus the syntax to be used

JoanFM · 2022-11-07T11:40:05Z

Can you try changing your YAML to:

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "MIG-GPU-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

or

- name: clipseg
  env:
    CUDA_VISIBLE_DEVICES: "MIG-87ddc7ee-c3eb-1181-1857-368f4c2bb8be"
    XLA_PYTHON_CLIENT_ALLOCATOR: platform
  replicas: 1
  timeout_ready: -1
  uses: executors/clipseg/config.yml

?

mchaker · 2022-11-07T15:53:29Z

My NVIDIA driver version is 515, so it supports MIG.
However, I do not use MIG on my cards. I just use the main card UUID from nvidia-smi -L.

I'll try the MIG prefix and report back.

clipseg is an executor set up for Jina, I use the UUID GPU specification method with other executors and Jina passes the right GPU to the executor. For some reason it does not pass the right GPU to the clipseg executor. :(

JoanFM · 2022-11-07T15:56:11Z

this is weird, do you have the source code of clipseg? Can you check what is the value in the Executor when u do:

os.environ['CUDA_VISIBLE_DEVICES`]?

What Jina does is simply to set the env vars for each of Executor process, so wether or not this is respected by the Executor should be the Executor or upstream problem.

mchaker · 2022-11-07T15:57:16Z

I see - will check the os.environ value and report back.

JoanFM · 2022-11-11T11:51:59Z

Hey @mchaker , any news about it?

mchaker · 2022-11-11T14:57:55Z

@JoanFM yes - CUDA_VISIBLE_DEVICES is GPU-87d2c7e5-c3eb-1181-1857-368f4c2bbbbb in the container (proper GPU ID)

However Jina crashes with:

⠋ Waiting stablemulti clipseg upscalerp40 realesrgan... ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/6 0:00:18CRITI… clipseg/rep-0@61 can not load the executor from executors/clipseg/config.yml                          [11/11/22 14:54:57]
ERROR  clipseg/rep-0@61 RuntimeError('Attempting to deserialize object on CUDA device 0 but                  [11/11/22 14:54:57]
       torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an
       existing device.') during <class 'jina.serve.runtimes.worker.WorkerRuntime'> initialization
        add "--quiet-error" to suppress the exception details
       Traceback (most recent call last):
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/orchestrate/pods/__init__.py", line
       74, in run
           runtime = runtime_cls(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/worker/__init__.py",
       line 36, in __init__
           super().__init__(args, **kwargs)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/asyncio.py", line 80,
       in __init__
           self._loop.run_until_complete(self.async_setup())
         File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
           return future.result()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/worker/__init__.py",
       line 101, in async_setup
           self._data_request_handler = DataRequestHandler(
         File
       "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/request_handlers/data_reques…
       line 49, in __init__
           self._load_executor(
         File
       "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/runtimes/request_handlers/data_reques…
       line 139, in _load_executor
           self._executor: BaseExecutor = BaseExecutor.load_config(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 760, in
       load_config
           obj = JAML.load(tag_yml, substitute=False, runtime_args=runtime_args)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 174, in load
           r = yaml.load(stream, Loader=get_jina_loader_with_runtime(runtime_args))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/__init__.py", line 81, in load
           return loader.get_single_data()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 51, in
       get_single_data
           return self.construct_document(node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 55, in
       construct_document
           data = self.construct_object(node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/yaml/constructor.py", line 100, in
       construct_object
           data = constructor(self, node)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/__init__.py", line 582, in
       _from_yaml
           return get_parser(cls, version=data.get('version', None)).parse(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/jaml/parsers/executor/legacy.py",
       line 45, in parse
           obj = cls(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/executors/decorators.py", line
       63, in arg_wrapper
           f = func(self, *args, **kwargs)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/jina/serve/helper.py", line 71, in
       arg_wrapper
           f = func(self, *args, **kwargs)
         File "/dalle/dalle-flow/executors/clipseg/executor.py", line 71, in __init__
           torch.load(
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 789, in load
           return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1131, in
       _load
           result = unpickler.load()
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1101, in
       persistent_load
           load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1083, in
       load_tensor
           wrap_storage=restore_location(storage, location),
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 1055, in
       restore_location
           return default_restore_location(storage, str(map_location))
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 215, in
       default_restore_location
           result = fn(storage, location)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 182, in
       _cuda_deserialize
           device = validate_cuda_device(location)
         File "/dalle/dalle-flow/env/lib/python3.10/site-packages/torch/serialization.py", line 173, in
       validate_cuda_device
           raise RuntimeError('Attempting to deserialize object on CUDA device '
       RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0.
       Please use torch.load with map_location to map your storages to an existing device.
DEBUG  clipseg/rep-0@61 process terminated

JoanFM · 2022-11-11T15:03:39Z

Hey @mchaker ,

This problem is on the Executor and how they load into GPU, where are u getting it from? maybe we can open an issue on that repo and fix there?

mchaker · 2022-11-11T15:04:21Z

I see - let me check with the developer and see where they are getting the executor from. Maybe it is custom.

JoanFM · 2022-11-11T15:11:31Z

I believe the issue may come from how the model was stored or something like this. in this case Jina has made sure that ur CUDA_VISIBLE_DEVICES env var is well passed to the Executor.

mchaker · 2022-11-11T15:12:28Z

I see -- I'll follow up with the executor authors and dig into the executor source. Thanks for your help!

mchaker · 2022-11-11T15:23:28Z

@JoanFM actually it looks like the executor is from Jina:
https://github.com/jina-ai/dalle-flow/blob/main/executors/clipseg/executor.py

AmericanPresidentJimmyCarter · 2022-11-11T15:24:19Z

The device for the model is simply mapped with:

        model.load_state_dict(
            torch.load(
                f'{cache_path}/{WEIGHT_FOLDER_NAME}/rd64-uni.pth',
                map_location=torch.device('cuda'),
            ),
            strict=False,
        )

In this case it appears that torch is unable to map the location. @mchaker before these lines in executors/clipseg/executor.py you can add print(os.environ.get('CUDA_VISIBLE_DEVICES))` to see what the environment actually is.

JoanFM · 2022-11-11T15:25:44Z

Hey @AmericanPresidentJimmyCarter, do you know what might be the problem why it cannot be loaded with that CUDA_VISIBLE_DEVICES setting?

AmericanPresidentJimmyCarter · 2022-11-11T15:27:09Z

@JoanFM No, I will try to get you debug from the env. This appears to be a strange one.

JoanFM · 2022-11-11T15:27:59Z

I transfer the issue to DALLE-FLOW because the issue is specific to the Executor in this project

mchaker · 2022-11-18T19:25:15Z

@AmericanPresidentJimmyCarter what do you need from the env?

JoanFM · 2022-11-30T08:59:32Z

Hey @mchaker , @AmericanPresidentJimmyCarter , any progress on this ?

AmericanPresidentJimmyCarter · 2022-11-30T14:02:37Z

I still do not know why it happens -- it's only this one specific executor that has the problem. We can upload to latest jina and see if it persists.

mchaker · 2022-11-30T14:51:54Z

I updated jina using pip install -U jina and the error still happens

RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0.

samsja mentioned this issue Nov 7, 2022

Round-Robin GPU scheduling does not follow CUDA_VISIBLE_DEVICES spec jina-ai/serve#5354

Closed

JoanFM transferred this issue from jina-ai/serve Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jina does not pass the right GPU in to clipseg #135

jina does not pass the right GPU in to clipseg #135

jina does not pass the right GPU in to clipseg #135

jina does not pass the right GPU in to clipseg #135

Comments