Description
What happened + What you expected to happen
When running ray in our cluster, we observed a bug which raised duplicate submission_id
exception incorrectly. As we used ray job submit SDK to run our workloads, there was a transient network failure on RPC InternalKVPut
during its returning. Then it raised duplicate submission_id
exception, even though that id is not the same as others.
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray-3.0.1.dev0+dbg-py3.12-linux-aarch64.egg/ray/dashboard/modules/job/job_agent.py", line 46, in submit_job
submission_id = await self.get_job_manager().submit_job(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray-3.0.1.dev0+dbg-py3.12-linux-aarch64.egg/ray/dashboard/modules/job/job_manager.py", line 505, in submit_job
raise ValueError(
ValueError: Job with submission_id raysubmit_BwKc7baRSvxcpFYe already exists. Please use a different submission_id.
Here is the concrete call site of this RPC. call_site.log
The root cause may be that InternalKVPut
, which is not idempotent here, retries incorrectly when encountering transient error, without checking whether the put operation is done or not.
Versions / Dependencies
Ray 3.0.0.dev, Kuberay 1.3.0
Reproduction script
Start a RayCluster
using Kuberay
. Then run the any script with ray job submit SDK.
Transient network failure can be reproduced with gRPC interceptor.
Issue Severity
Medium: It is a significant difficulty but I can work around it.