8000 [Core] `InternalKVPut` retries incorrectly when encountering transient error · Issue #53946 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Core] InternalKVPut retries incorrectly when encountering transient error #53946
Open
@qts0312

Description

@qts0312

What happened + What you expected to happen

When running ray in our cluster, we observed a bug which raised duplicate submission_id exception incorrectly. As we used ray job submit SDK to run our workloads, there was a transient network failure on RPC InternalKVPut during its returning. Then it raised duplicate submission_id exception, even though that id is not the same as others.

File "/home/ray/anaconda3/lib/python3.12/site-packages/ray-3.0.1.dev0+dbg-py3.12-linux-aarch64.egg/ray/dashboard/modules/job/job_agent.py", line 46, in submit_job
    submission_id = await self.get_job_manager().submit_job(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray-3.0.1.dev0+dbg-py3.12-linux-aarch64.egg/ray/dashboard/modules/job/job_manager.py", line 505, in submit_job
    raise ValueError(
ValueError: Job with submission_id raysubmit_BwKc7baRSvxcpFYe already exists. Please use a different submission_id.

Here is the concrete call site of this RPC. call_site.log

The root cause may be that InternalKVPut, which is not idempotent here, retries incorrectly when encountering transient error, without checking whether the put operation is done or not.

Versions / Dependencies

Ray 3.0.0.dev, Kuberay 1.3.0

Reproduction script

Start a RayCluster using Kuberay. Then run the any script with ray job submit SDK.

Transient network failure can be reproduced with gRPC interceptor.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CorestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0