[Core] InternalKVPut retries incorrectly when encountering transient error

What happened + What you expected to happen

When running ray in our cluster, we observed a bug which raised duplicate submission_id exception incorrectly. As we used ray job submit SDK to run our workloads, there was a transient network failure on RPC InternalKVPut during its returning. Then it raised duplicate submission_id exception, even though that id is not the same as others.

File "/home/ray/anaconda3/lib/python3.12/site-packages/ray-3.0.1.dev0+dbg-py3.12-linux-aarch64.egg/ray/dashboard/modules/job/job_agent.py", line 46, in submit_job
    submission_id = await self.get_job_manager().submit_job(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray-3.0.1.dev0+dbg-py3.12-linux-aarch64.egg/ray/dashboard/modules/job/job_manager.py", line 505, in submit_job
    raise ValueError(
ValueError: Job with submission_id raysubmit_BwKc7baRSvxcpFYe already exists. Please use a different submission_id.

Here is the concrete call site of this RPC. call_site.log

The root cause may be that InternalKVPut, which is not idempotent here, retries incorrectly when encountering transient error, without checking whether the put operation is done or not.

Versions / Dependencies

Ray 3.0.0.dev, Kuberay 1.3.0

Reproduction script

Start a RayCluster using Kuberay. Then run the any script with ray job submit SDK.

Transient network failure can be reproduced with gRPC interceptor.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions