8000 [Test][Autoscaler] deflaky unexpected dead actors in tests by setting max_restarts=-1 by rueian · Pull Request #3700 · ray-project/kuberay · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Test][Autoscaler] deflaky unexpected dead actors in tests by setting max_restarts=-1 #3700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

rueian
Copy link
Contributor
@rueian rueian commented May 27, 2025

Why are these changes needed?

Sometimes the autoscaler e2e test could fail like this
image

I found that was because some actors died unexpectedly. For example, we should have 2 active and 1 pending actor here, but we got 1 dead already, which was unexpected:
image
image

actor logs:

*** SIGTERM received at time=1748378840 on cpu 2 ***
PC: @     0xffff7fce1c28  (unknown)  syscall
    @     0xffff7ef0b718        464  absl::lts_20230802::AbslFailureSignalHandler()
    @     0xffff7fee67a0  1760940192  (unknown)
    @     0xffff7e453f50        112  std::future<>::get()
    @     0xffff7e69d7c8         80  ray::gcs::InternalKVAccessor::Get()
    @     0xffff7e38203c       1248  __pyx_pw_3ray_7_raylet_14InnerGcsClient_3internal_kv_get()
    @     0xaaaad59cd6c8        336  method_vectorcall
    @     0xaaaad583cfb4        288  _PyEval_EvalFrameDefault
    @     0xaaaad58e6360        128  _PyEval_EvalCode
    @     0xaaaad584d18c        240  _PyFunction_Vectorcall
    @     0xaaaad583c100        224  _PyEval_EvalFrameDefault
    @     0xaaaad58e6360        128  _PyEval_EvalCode
    @     0xaaaad58e6724        240  _PyEval_EvalCodeWithName
    @     0xaaaad58e6778         80  PyEval_EvalCodeEx
    @     0xaaaad58e67b8         48  PyEval_EvalCode
    @     0xaaaad5921e60         16  run_eval_code_obj
    @     0xaaaad59220f4         64  run_mod
    @     0xaaaad59253a0         64  pyrun_file
    @     0xaaaad592559c         96  PyRun_SimpleFileExFlags
    @     0xaaaad58406b8        256  Py_RunMain
    @     0xaaaad5840d0c        240  Py_BytesMain
    @     0xffff7fc273fc         32  (unknown)
    @     0xffff7fc274cc        272  __libc_start_main
    @     0xaaaad583f4e0         96  (unknown)
[2025-05-27 13:47:20,865 E 211 211] logging.cc:496: *** SIGTERM received at time=1748378840 on cpu 2 ***
[2025-05-27 13:47:20,865 E 211 211] logging.cc:496: PC: @     0xffff7fce1c28  (unknown)  syscall
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xffff7ef0b740        464  absl::lts_20230802::AbslFailureSignalHandler()
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xffff7fee67a0  1760940192  (unknown)
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xffff7e453f50        112  std::future<>::get()
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xffff7e69d7c8         80  ray::gcs::InternalKVAccessor::Get()
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xffff7e38203c       1248  __pyx_pw_3ray_7_raylet_14InnerGcsClient_3internal_kv_get()
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad59cd6c8        336  method_vectorcall
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad583cfb4        288  _PyEval_EvalFrameDefault
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad58e6360        128  _PyEval_EvalCode
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad584d18c        240  _PyFunction_Vectorcall
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad583c100        224  _PyEval_EvalFrameDefault
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad58e6360        128  _PyEval_EvalCode
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad58e6724        240  _PyEval_EvalCodeWithName
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad58e6778         80  PyEval_EvalCodeEx
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad58e67b8         48  PyEval_EvalCode
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad5921e60         16  run_eval_code_obj
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad59220f4         64  run_mod
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad59253a0         64  pyrun_file
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad592559c         96  PyRun_SimpleFileExFlags
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad58406b8        256  Py_RunMain
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad5840d0c        240  Py_BytesMain
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xffff7fc273fc         32  (unknown)
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xffff7fc274cc        272  __libc_start_main
[2025-05-27 13:47:20,867 E 211 211] logging.cc:496:     @     0xaaaad583f4e0         96  (unknown)

autoscaler logs

2025-05-27 13:46:54,510	INFO run_autoscaler.py:47 -- The Ray head is ready. Starting the autoscaler.
2025-05-27 13:46:54,578 - INFO - Refreshing K8s API client token and certs.
2025-05-27 13:46:54,578	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-05-27 13:46:54,607 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:46:54,607	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:46:54,612 - INFO - session_name: session_2025-05-27_13-46-51_877838_1
2025-05-27 13:46:54,612	INFO monitor.py:84 -- session_name: session_2025-05-27_13-46-51_877838_1
2025-05-27 13:46:54,613 - INFO - Starting autoscaler metrics server on port 44217
2025-05-27 13:46:54,613	INFO monitor.py:108 -- Starting autoscaler metrics server on port 44217
2025-05-27 13:46:54,616 - INFO - Using Autoscaling Config: 
auth: {}
available_node_types:
  headgroup:
    max_workers: 0
    min_workers: 0
    node_config: {}
    resources:
      CPU: 0
      memory: 4000000000
  test-group:
    max_workers: 2
    min_workers: 1
    node_config: {}
    resources:
      CPU: 1
      memory: 4000000000
cluster_name: ray-cluster
cluster_synced_files: []
file_mounts: {}
file_mounts_sync_continuously: false
head_node_type: headgroup
head_setup_commands: []
head_start_ray_commands: []
idle_timeout_minutes: 1.0
initialization_commands: []
max_workers: 2
provider:
  disable_launch_config_check: true
  disable_node_updaters: true
  foreground_node_launch: true
  namespace: test-ns-wfmfs
  type: kuberay
  worker_liveness_check: false
setup_commands: []
upscaling_speed: 1000
worker_setup_commands: []
worker_start_ray_commands: []

2025-05-27 13:46:54,616	INFO autoscaler.py:63 -- Using Autoscaling Config: 
auth: {}
available_node_types:
  headgroup:
    max_workers: 0
    min_workers: 0
    node_config: {}
    resources:
      CPU: 0
      memory: 4000000000
  test-group:
    max_workers: 2
    min_workers: 1
    node_config: {}
    resources:
      CPU: 1
      memory: 4000000000
cluster_name: ray-cluster
cluster_synced_files: []
file_mounts: {}
file_mounts_sync_continuously: false
head_node_type: headgroup
head_setup_commands: []
head_start_ray_commands: []
idle_timeout_minutes: 1.0
initialization_commands: []
max_workers: 2
provider:
  disable_launch_config_check: true
  disable_node_updaters: true
  foreground_node_launch: true
  namespace: test-ns-wfmfs
  type: kuberay
  worker_liveness_check: false
setup_commands: []
upscaling_speed: 1000
worker_setup_commands: []
worker_start_ray_commands: []

2025-05-27 13:46:54,674 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:46:54,674	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:46:54,675 - INFO - Refreshing K8s API client token and certs.
2025-05-27 13:46:54,675	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-05-27 13:46:54,688 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14656.
2025-05-27 13:46:54,688	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14656.
2025-05-27 13:46:54,694 - INFO - Fetched pod data at resource version 14656.
2025-05-27 13:46:54,694	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14656.
2025-05-27 13:46:54,695 - INFO - New instance ALLOCATED (id=7731d462-4c5d-4482-8f8c-bacbff73c306, type=headgroup, cloud_instance_id=, ray_id=): allocated unmanaged cloud instance :ray-cluster-head (HEAD) from cloud provider
2025-05-27 13:46:54,695	INFO instance_manager.py:246 -- New instance ALLOCATED (id=7731d462-4c5d-4482-8f8c-bacbff73c306, type=headgroup, cloud_instance_id=, ray_id=): allocated unmanaged cloud instance :ray-cluster-head (HEAD) from cloud provider
2025-05-27 13:46:54,695 - INFO - New instance ALLOCATED (id=0c26ad6e-9289-41db-b106-ee0dda6c4a72, type=test-group, cloud_instance_id=, ray_id=): allocated unmanaged cloud instance :ray-cluster-test-group-worker-ln4k6 (WORKER) from cloud provider
2025-05-27 13:46:54,695	INFO instance_manager.py:246 -- New instance ALLOCATED (id=0c26ad6e-9289-41db-b106-ee0dda6c4a72, type=test-group, cloud_instance_id=, ray_id=): allocated unmanaged cloud instance :ray-cluster-test-group-worker-ln4k6 (WORKER) from cloud provider
2025-05-27 13:46:59,761 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:46:59,761	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:46:59,772 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14656.
2025-05-27 13:46:59,772	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14656.
2025-05-27 13:46:59,778 - INFO - Fetched pod data at resource version 14685.
2025-05-27 13:46:59,778	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14685.
2025-05-27 13:47:04,864 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:04,864	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:04,879 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14656.
2025-05-27 13:47:04,879	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14656.
2025-05-27 13:47:04,884 - INFO - Fetched pod data at resource version 14685.
2025-05-27 13:47:04,884	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14685.
2025-05-27 13:47:04,885 - INFO - Update instance ALLOCATED->RAY_RUNNING (id=0c26ad6e-9289-41db-b106-ee0dda6c4a72, type=test-group, cloud_instance_id=ray-cluster-test-group-worker-ln4k6, ray_id=): ray node fe89d6c8e5bdd9c38c79f1faaf33a091774aa2cd7fd768a5f7fe7fda is IDLE
2025-05-27 13:47:04,885	INFO instance_manager.py:262 -- Update instance ALLOCATED->RAY_RUNNING (id=0c26ad6e-9289-41db-b106-ee0dda6c4a72, type=test-group, cloud_instance_id=ray-cluster-test-group-worker-ln4k6, ray_id=): ray node fe89d6c8e5bdd9c38c79f1faaf33a091774aa2cd7fd768a5f7fe7fda is IDLE
2025-05-27 13:47:10,072 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:10,072	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:10,085 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14656.
2025-05-27 13:47:10,085	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14656.
2025-05-27 13:47:10,090 - INFO - Fetched pod data at resource version 14704.
2025-05-27 13:47:10,090	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14704.
2025-05-27 13:47:15,174 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:15,174	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:15,187 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14656.
2025-05-27 13:47:15,187	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14656.
2025-05-27 13:47:15,192 - INFO - Fetched pod data at resource version 14704.
2025-05-27 13:47:15,192	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14704.
2025-05-27 13:47:15,192 - INFO - Update instance ALLOCATED->RAY_RUNNING (id=7731d462-4c5d-4482-8f8c-bacbff73c306, type=headgroup, cloud_instance_id=ray-cluster-head, ray_id=): ray node 96da6388a046fe19682d42d0bb5ffe11aa74073f92e1c87f3ddecf5a is IDLE
2025-05-27 13:47:15,192	INFO instance_manager.py:262 -- Update instance ALLOCATED->RAY_RUNNING (id=7731d462-4c5d-4482-8f8c-bacbff73c306, type=headgroup, cloud_instance_id=ray-cluster-head, ray_id=): ray node 96da6388a046fe19682d42d0bb5ffe11aa74073f92e1c87f3ddecf5a is IDLE
2025-05-27 13:47:20,260 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:20,260	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:20,274 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:20,274	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:20,279 - INFO - Fetched pod data at resource version 14730.
2025-05-27 13:47:20,279	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14730.
2025-05-27 13:47:25,313 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:25,313	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:25,325 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:25,325	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:25,330 - INFO - Fetched pod data at resource version 14730.
2025-05-27 13:47:25,330	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14730.
2025-05-27 13:47:25,336 - INFO - Adding 1 node(s) of type test-group.
2025-05-27 13:47:25,336	INFO event_logger.py:56 -- Adding 1 node(s) of type test-group.
2025-05-27 13:47:25,336 - INFO - New instance QUEUED (id=ca0724be-e210-4053-80b3-69021f5ef44f, type=test-group, cloud_instance_id=, ray_id=): queuing new instance of test-group from scheduler
2025-05-27 13:47:25,336	INFO instance_manager.py:246 -- New instance QUEUED (id=ca0724be-e210-4053-80b3-69021f5ef44f, type=test-group, cloud_instance_id=, ray_id=): queuing new instance of test-group from scheduler
2025-05-27 13:47:25,336 - INFO - Update instance QUEUED->REQUESTED (id=ca0724be-e210-4053-80b3-69021f5ef44f, type=test-group, cloud_instance_id=, ray_id=): requested to launch test-group with request id a928a6a5-92f3-4f41-8012-c01bb1986958
2025-05-27 13:47:25,336	INFO instance_manager.py:262 -- Update instance QUEUED->REQUESTED (id=ca0724be-e210-4053-80b3-69021f5ef44f, type=test-group, cloud_instance_id=, ray_id=): requested to launch test-group with request id a928a6a5-92f3-4f41-8012-c01bb1986958
2025-05-27 13:47:25,347 - INFO - Listing pods for RayCluster ray-cluster in namespace
8000
 test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:25,347	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:25,352 - INFO - Fetched pod data at resource version 14730.
2025-05-27 13:47:25,352	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14730.
2025-05-27 13:47:25,352 - INFO - Submitting a scale request: KubeRayProvider.ScaleRequest(desired_num_workers=defaultdict(<class 'int'>, {'test-group': 2}), workers_to_delete=defaultdict(<class 'list'>, {}), worker_groups_without_pending_deletes=set(), worker_groups_with_pending_deletes=set())
2025-05-27 13:47:25,352	INFO cloud_provider.py:331 -- Submitting a scale request: KubeRayProvider.ScaleRequest(desired_num_workers=defaultdict(<class 'int'>, {'test-group': 2}), workers_to_delete=defaultdict(<class 'list'>, {}), worker_groups_without_pending_deletes=set(), worker_groups_with_pending_deletes=set())
2025-05-27 13:47:30,385 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:30,385	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:30,396 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:30,396	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:30,401 - INFO - Fetched pod data at resource version 14763.
2025-05-27 13:47:30,401	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14763.
2025-05-27 13:47:30,401 - INFO - Update instance REQUESTED->ALLOCATED (id=ca0724be-e210-4053-80b3-69021f5ef44f, type=test-group, cloud_instance_id=, ray_id=): allocated unassigned cloud instance ray-cluster-test-group-worker-54jst
2025-05-27 13:47:30,401	INFO instance_manager.py:262 -- Update instance REQUESTED->ALLOCATED (id=ca0724be-e210-4053-80b3-69021f5ef44f, type=test-group, cloud_instance_id=, ray_id=): allocated unassigned cloud instance ray-cluster-test-group-worker-54jst
2025-05-27 13:47:35,444 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:35,444	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:35,457 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:35,457	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:35,463 - INFO - Fetched pod data at resource version 14776.
2025-05-27 13:47:35,463	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14776.
2025-05-27 13:47:40,491 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:40,491	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:40,503 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:40,503	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:40,509 - INFO - Fetched pod data at resource version 14786.
2025-05-27 13:47:40,509	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14786.
2025-05-27 13:47:40,510 - INFO - Update instance ALLOCATED->RAY_RUNNING (id=ca0724be-e210-4053-80b3-69021f5ef44f, type=test-group, cloud_instance_id=ray-cluster-test-group-worker-54jst, ray_id=): ray node f09af57429dce3ed682b9e7cdcb497448e64b7c53c433e5188e66d47 is RUNNING
2025-05-27 13:47:40,510	INFO instance_manager.py:262 -- Update instance ALLOCATED->RAY_RUNNING (id=ca0724be-e210-4053-80b3-69021f5ef44f, type=test-group, cloud_instance_id=ray-cluster-test-group-worker-54jst, ray_id=): ray node f09af57429dce3ed682b9e7cdcb497448e64b7c53c433e5188e66d47 is RUNNING
2025-05-27 13:47:45,539 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:45,539	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:45,552 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:45,552	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:45,558 - INFO - Fetched pod data at resource version 14802.
2025-05-27 13:47:45,558	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14802.
2025-05-27 13:47:50,588 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:50,588	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:50,600 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:50,600	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:50,606 - INFO - Fetched pod data at resource version 14804.
2025-05-27 13:47:50,606	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14804.
2025-05-27 13:47:55,619 - INFO - Refreshing K8s API client token and certs.
2025-05-27 13:47:55,619	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-05-27 13:47:55,650 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:55,650	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:47:55,650 - INFO - Refreshing K8s API client token and certs.
2025-05-27 13:47:55,650	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-05-27 13:47:55,663 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:55,663	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:47:55,669 - INFO - Fetched pod data at resource version 14804.
2025-05-27 13:47:55,669	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14804.
2025-05-27 13:48:00,703 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:00,703	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:00,717 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:00,717	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:00,722 - INFO - Fetched pod data at resource version 14823.
2025-05-27 13:48:00,722	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14823.
2025-05-27 13:48:05,748 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:05,748	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:05,761 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:05,761	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:05,766 - INFO - Fetched pod data at resource version 14834.
2025-05-27 13:48:05,766	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14834.
2025-05-27 13:48:10,787 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:10,787	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:10,799 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:10,799	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:10,804 - INFO - Fetched pod data at resource version 14844.
2025-05-27 13:48:10,804	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14844.
2025-05-27 13:48:15,836 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:15,836	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:15,848 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:15,848	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:15,854 - INFO - Fetched pod data at resource version 14853.
2025-05-27 13:48:15,854	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14853.
2025-05-27 13:48:20,895 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:20,895	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:20,908 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:20,908	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:20,914 - INFO - Fetched pod data at resource version 14862.
2025-05-27 13:48:20,914	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14862.
2025-05-27 13:48:25,964 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:25,964	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:25,977 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:25,977	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:25,983 - INFO - Fetched pod data at resource version 14873.
2025-05-27 13:48:25,983	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14873.
2025-05-27 13:48:31,020 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:31,020	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:31,035 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:31,035	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:31,040 - INFO - Fetched pod data at resource version 14883.
2025-05-27 13:48:31,040	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14883.
2025-05-27 13:48:36,085 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:36,085	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:36,098 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:36,098	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:36,104 - INFO - Fetched pod data at resource version 14892.
2025-05-27 13:48:36,104	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14892.
2025-05-27 13:48:41,131 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:41,131	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:41,149 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:41,149	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:41,155 - INFO - Fetched pod data at resource version 14903.
2025-05-27 13:48:41,155	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14903.
2025-05-27 13:48:46,188 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:46,188	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:46,201 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:46,201	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:46,206 - INFO - Fetched pod data at resource version 14913.
2025-05-27 13:48:46,206	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14913.
2025-05-27 13:48:51,237 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:51,237	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:51,249 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:51,249	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:51,254 - INFO - Fetched pod data at resource version 14922.
2025-05-27 13:48:51,254	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14922.
2025-05-27 13:48:56,263 - INFO - Refreshing K8s API client token and certs.
2025-05-27 13:48:56,263	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-05-27 13:48:56,292 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:56,292	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:48:56,292 - INFO - Refreshing K8s API client token and certs.
2025-05-27 13:48:56,292	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-05-27 13:48:56,307 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:56,307	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:48:56,313 - INFO - Fetched pod data at resource version 14931.
2025-05-27 13:48:56,313	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14931.
2025-05-27 13:49:01,347 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:01,347	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:01,361 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:01,361	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:01,367 - INFO - Fetched pod data at resource version 14931.
2025-05-27 13:49:01,367	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14931.
2025-05-27 13:49:06,407 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:06,407	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:06,421 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:06,421	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:06,426 - INFO - Fetched pod data at resource version 14942.
2025-05-27 13:49:06,426	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14942.
2025-05-27 13:49:11,465 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:11,465	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:11,477 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:11,477	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:11,483 - INFO - Fetched pod data at resource version 14952.
2025-05-27 13:49:11,483	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14952.
2025-05-27 13:49:16,529 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:16,529	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:16,543 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:16,543	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:16,549 - INFO - Fetched pod data at resource version 14962.
2025-05-27 13:49:16,549	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14962.
2025-05-27 13:49:21,583 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:21,583	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:21,596 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:21,596	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:21,601 - INFO - Fetched pod data at resource version 14972.
2025-05-27 13:49:21,601	INFO cloud_provider.py:481 -- Fetched pod data at resource version 14972.
2025-05-27 13:49:26,626 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:26,626	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:26,639 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:26,639	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:26,645 - INFO - Fetched pod data at resource version 15010.
2025-05-27 13:49:26,645	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15010.
2025-05-27 13:49:31,672 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:31,672	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:31,686 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:31,686	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:31,691 - INFO - Fetched pod data at resource version 15025.
2025-05-27 13:49:31,691	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15025.
2025-05-27 13:49:36,714 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:36,714	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:36,727 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:36,727	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:36,733 - INFO - Fetched pod data at resource version 15049.
2025-05-27 13:49:36,733	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15049.
2025-05-27 13:49:41,753 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:41,753	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:41,764 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:41,764	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:41,769 - INFO - Fetched pod data at resource version 15049.
2025-05-27 13:49:41,769	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15049.
2025-05-27 13:49:46,797 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:46,797	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:46,810 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:46,810	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:46,816 - INFO - Fetched pod data at resource version 15072.
2025-05-27 13:49:46,816	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15072.
2025-05-27 13:49:51,851 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:51,851	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:51,863 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:51,863	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:51,868 - INFO - Fetched pod data at resource version 15084.
2025-05-27 13:49:51,868	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15084.
2025-05-27 13:49:56,877 - INFO - Refreshing K8s API client token and certs.
2025-05-27 13:49:56,877	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-05-27 13:49:56,910 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:56,910	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:49:56,910 - INFO - Refreshing K8s API client token and certs.
2025-05-27 13:49:56,910	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-05-27 13:49:56,923 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:56,923	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:49:56,929 - INFO - Fetched pod data at resource version 15084.
2025-05-27 13:49:56,929	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15084.
2025-05-27 13:50:01,960 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:01,960	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:01,972 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:01,972	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:01,978 - INFO - Fetched pod data at resource version 15113.
2025-05-27 13:50:01,978	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15113.
2025-05-27 13:50:07,006 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:07,006	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:07,018 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:07,018	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:07,023 - INFO - Fetched pod data at resource version 15113.
2025-05-27 13:50:07,023	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15113.
2025-05-27 13:50:12,047 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:12,047	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:12,058 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:12,058	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:12,063 - INFO - Fetched pod data at resource version 15133.
2025-05-27 13:50:12,063	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15133.
2025-05-27 13:50:17,101 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:17,101	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:17,115 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:17,115	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:17,120 - INFO - Fetched pod data at resource version 15133.
2025-05-27 13:50:17,120	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15133.
2025-05-27 13:50:22,158 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:22,158	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:22,172 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:22,172	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:22,177 - INFO - Fetched pod data at resource version 15169.
2025-05-27 13:50:22,177	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15169.
2025-05-27 13:50:27,208 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:27,208	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:27,220 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:27,220	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:27,226 - INFO - Fetched pod data at resource version 15229.
2025-05-27 13:50:27,226	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15229.
2025-05-27 13:50:32,266 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:32,266	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:32,279 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:32,279	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:32,285 - INFO - Fetched pod data at resource version 15229.
2025-05-27 13:50:32,285	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15229.
2025-05-27 13:50:37,314 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:37,314	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:37,328 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:37,328	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:37,334 - INFO - Fetched pod data at resource version 15248.
2025-05-27 13:50:37,334	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15248.
2025-05-27 13:50:42,374 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:42,374	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:42,387 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:42,387	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:42,393 - INFO - Fetched pod data at resource version 15259.
2025-05-27 13:50:42,393	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15259.
2025-05-27 13:50:47,427 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:47,427	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:47,440 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:47,440	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:47,446 - INFO - Fetched pod data at resource version 15268.
2025-05-27 13:50:47,446	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15268.
2025-05-27 13:50:52,486 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:52,486	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:52,499 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:52,499	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:52,504 - INFO - Fetched pod data at resource version 15279.
2025-05-27 13:50:52,504	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15279.
2025-05-27 13:50:57,517 - INFO - Refreshing K8s API client token and certs.
2025-05-27 13:50:57,517	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-05-27 13:50:57,551 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:57,551	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:50:57,551 - INFO - Refreshing K8s API client token and certs.
2025-05-27 13:50:57,551	INFO node_provider.py:277 -- Refreshing K8s API client token and certs.
2025-05-27 13:50:57,563 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:57,563	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:50:57,569 - INFO - Fetched pod data at resource version 15288.
2025-05-27 13:50:57,569	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15288.
2025-05-27 13:51:02,615 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:02,615	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:02,627 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:02,627	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:02,633 - INFO - Fetched pod data at resource version 15305.
2025-05-27 13:51:02,633	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15305.
2025-05-27 13:51:07,671 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:07,671	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:07,684 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:07,684	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:07,689 - INFO - Fetched pod data at resource version 15314.
2025-05-27 13:51:07,689	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15314.
2025-05-27 13:51:12,722 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:12,722	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:12,735 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:12,735	INFO cloud_provider.py:463 -- Listing pods for Ra
8000
yCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:12,740 - INFO - Fetched pod data at resource version 15314.
2025-05-27 13:51:12,740	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15314.
2025-05-27 13:51:17,767 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:17,767	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:17,781 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:17,781	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:17,787 - INFO - Fetched pod data at resource version 15323.
2025-05-27 13:51:17,787	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15323.
2025-05-27 13:51:22,825 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:22,825	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:22,838 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:22,838	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:22,843 - INFO - Fetched pod data at resource version 15333.
2025-05-27 13:51:22,843	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15333.
2025-05-27 13:51:27,871 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:27,871	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:27,883 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:27,883	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:27,888 - INFO - Fetched pod data at resource version 15343.
2025-05-27 13:51:27,888	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15343.
2025-05-27 13:51:32,932 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:32,932	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:32,945 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:32,945	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:32,951 - INFO - Fetched pod data at resource version 15352.
2025-05-27 13:51:32,951	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15352.
2025-05-27 13:51:37,988 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:37,988	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:38,000 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:38,000	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:38,005 - INFO - Fetched pod data at resource version 15362.
2025-05-27 13:51:38,005	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15362.
2025-05-27 13:51:43,039 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:43,039	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:43,052 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:43,052	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:43,058 - INFO - Fetched pod data at resource version 15373.
2025-05-27 13:51:43,058	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15373.
2025-05-27 13:51:48,099 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:48,099	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:48,113 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:48,113	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:48,119 - INFO - Fetched pod data at resource version 15382.
2025-05-27 13:51:48,119	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15382.
2025-05-27 13:51:53,160 - INFO - Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:53,160	INFO config.py:183 -- Calculating hashes for file mounts and ray commands.
2025-05-27 13:51:53,173 - INFO - Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:53,173	INFO cloud_provider.py:463 -- Listing pods for RayCluster ray-cluster in namespace test-ns-wfmfs at pods resource version >= 14730.
2025-05-27 13:51:53,179 - INFO - Fetched pod data at resource version 15392.
2025-05-27 13:51:53,179	INFO cloud_provider.py:481 -- Fetched pod data at resource version 15392.

This PR tries to restart actors by setting max_restarts=-1.

Related issue number

Closes #3701
Related to ray-project/ray#40864

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

… max_restarts=-1

Signed-off-by: Rueian <rueiancsie@gmail.com>
@rueian rueian marked this pull request as ready for review May 27, 2025 22:03
@rueian
Copy link
Contributor Author
rueian commented May 27, 2025

Another example:
image

Note that actor0 and actor1 were placed on the same node. Since each test node can only run one actor due to limited resources, I believe that’s why actor0 was terminated.

@kevin85421
Copy link
Member

@rueian, we’re in favor of #3707, right?

@rueian
Copy link
Contributor Author
rueian commented May 31, 2025

@rueian, we’re in favor of #3707, right?

Yes. Actually, 3700 and 3707 both work and I think we can also have both of them.

@rueian rueian reopened this Jun 4, 2025
@rueian rueian force-pushed the e2eautoscaler-deflaky-dead-actor branch 2 times, most recently from 53aec73 to 4024279 Compare June 4, 2025 20:26
@rueian rueian force-pushed the e2eautoscaler-deflaky-dead-actor branch from 4024279 to 8c8ec66 Compare June 4, 2025 21:43
@rueian rueian force-pushed the e2eautoscaler-deflaky-dead-actor branch from 8c8ec66 to 6322758 Compare June 5, 2025 07:03
@kevin85421 kevin85421 merged commit c75997a into ray-project:master Jun 6, 2025
25 checks passed
pawelpaszki pushed a commit to opendatahub-io/kuberay that referenced this pull request Jun 10, 2025
edoakes pushed a commit to ray-project/ray that referenced this pull request Jun 13, 2025
…ity (#53782)

## Why are these changes needed?

While doing the #53562, we
[decided](#53562 (comment))
to refactor the `NodeManager` first to allow us to inject a
`WorkerPoolInterface` implementation to it from the `main.cc`. This PR
does the refactoring. That is:

1. Updated the `WorkerPoolInterface` to cover all methods of
`WorkerPool`. Previously the interface was only a subset.
2. Updated all the existing mock implementations of
`WorkerPoolInterface` to cover new missing methods.
3. Replaced `WorkerPool worker_pool_` to `WorkerPoolInterface
&worker_pool_` in the `NodeManger` so that we can swap it out for
testing, which is required by
#53562.
4. Modified the `NodeManager` constructor to accept a
`WorkerPoolInterface &worker_pool_`.
5. In addition to the new `WorkerPoolInterface &worker_pool_` injection,
we also need to inject all its dependencies. So we ended up with all the
following are constructed and owned in the `main.cc`:

```c
  std::shared_ptr<plasma::PlasmaClient> plasma_client;
  std::shared_ptr<ray::raylet::NodeManager> node_manager;
  std::shared_ptr<ray::rpc::ClientCallManager> client_call_manager;
  std::shared_ptr<ray::rpc::CoreWorkerClientPool> worker_rpc_pool;
  std::shared_ptr<ray::raylet::WorkerPoolInterface> worker_pool;
  std::shared_ptr<ray::raylet::LocalObjectManager> local_object_manager;
  std::shared_ptr<ray::ClusterResourceScheduler> cluster_resource_scheduler;
  std::shared_ptr<ray::raylet::LocalTaskManager> local_task_manager;
  std::shared_ptr<ray::raylet::ClusterTaskManagerInterface> cluster_task_manager;
  std::shared_ptr<ray::pubsub::SubscriberInterface> core_worker_subscriber;
  std::shared_ptr<ray::IObjectDirectory> object_directory;
  std::shared_ptr<ray::ObjectManagerInterface> object_manager;
  std::shared_ptr<ray::raylet::DependencyManager> dependency_manager;
  absl::flat_hash_map<WorkerID, std::shared_ptr<ray::raylet::WorkerInterface>> leased_workers;
```

This PR does not introduce any behavioral changes.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Related to 
#53562
#40864
ray-project/kuberay#3701 and
ray-project/kuberay#3700


## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Rueian <rueiancsie@gmail.com>
elliot-barn pushed a commit to ray-project/ray that referenced this pull request Jun 18, 2025
…ity (#53782)

## Why are these changes needed?

While doing the #53562, we
[decided](#53562 (comment))
to refactor the `NodeManager` first to allow us to inject a
`WorkerPoolInterface` implementation to it from the `main.cc`. This PR
does the refactoring. That is:

1. Updated the `WorkerPoolInterface` to cover all methods of
`WorkerPool`. Previously the interface was only a subset.
2. Updated all the existing mock implementations of
`WorkerPoolInterface` to cover new missing methods.
3. Replaced `WorkerPool worker_pool_` to `WorkerPoolInterface
&worker_pool_` in the `NodeManger` so that we can swap it out for
testing, which is required by
#53562.
4. Modified the `NodeManager` constructor to accept a
`WorkerPoolInterface &worker_pool_`.
5. In addition to the new `WorkerPoolInterface &worker_pool_` injection,
we also need to inject all its dependencies. So we ended up with all the
following are constructed and owned in the `main.cc`:

```c
  std::shared_ptr<plasma::PlasmaClient> plasma_client;
  std::shared_ptr<ray::raylet::NodeManager> node_manager;
  std::shared_ptr<ray::rpc::ClientCallManager> client_call_manager;
  std::shared_ptr<ray::rpc::CoreWorkerClientPool> worker_rpc_pool;
  std::shared_ptr<ray::raylet::WorkerPoolInterface> worker_pool;
  std::shared_ptr<ray::raylet::LocalObjectManager> local_object_manager;
  std::shared_ptr<ray::ClusterResourceScheduler> cluster_resource_scheduler;
  std::shared_ptr<ray::raylet::LocalTaskManager> local_task_manager;
  std::shared_ptr<ray::raylet::ClusterTaskManagerInterface> cluster_task_manager;
  std::shared_ptr<ray::pubsub::SubscriberInterface> core_worker_subscriber;
  std::shared_ptr<ray::IObjectDirectory> object_directory;
  std::shared_ptr<ray::ObjectManagerInterface> object_manager;
  std::shared_ptr<ray::raylet::DependencyManager> dependency_manager;
  absl::flat_hash_map<WorkerID, std::shared_ptr<ray::raylet::WorkerInterface>> leased_workers;
```

This PR does not introduce any behavioral changes.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Related to
#53562
#40864
ray-project/kuberay#3701 and
ray-project/kuberay#3700

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
rebel-scottlee pushed a commit to rebellions-sw/ray that referenced this pull request Jun 21, 2025
…ity (ray-project#53782)

## Why are these changes needed?

While doing the ray-project#53562, we
[decided](
9E88
ray-project#53562 (comment))
to refactor the `NodeManager` first to allow us to inject a
`WorkerPoolInterface` implementation to it from the `main.cc`. This PR
does the refactoring. That is:

1. Updated the `WorkerPoolInterface` to cover all methods of
`WorkerPool`. Previously the interface was only a subset.
2. Updated all the existing mock implementations of
`WorkerPoolInterface` to cover new missing methods.
3. Replaced `WorkerPool worker_pool_` to `WorkerPoolInterface
&worker_pool_` in the `NodeManger` so that we can swap it out for
testing, which is required by
ray-project#53562.
4. Modified the `NodeManager` constructor to accept a
`WorkerPoolInterface &worker_pool_`.
5. In addition to the new `WorkerPoolInterface &worker_pool_` injection,
we also need to inject all its dependencies. So we ended up with all the
following are constructed and owned in the `main.cc`:

```c
  std::shared_ptr<plasma::PlasmaClient> plasma_client;
  std::shared_ptr<ray::raylet::NodeManager> node_manager;
  std::shared_ptr<ray::rpc::ClientCallManager> client_call_manager;
  std::shared_ptr<ray::rpc::CoreWorkerClientPool> worker_rpc_pool;
  std::shared_ptr<ray::raylet::WorkerPoolInterface> worker_pool;
  std::shared_ptr<ray::raylet::LocalObjectManager> local_object_manager;
  std::shared_ptr<ray::ClusterResourceScheduler> cluster_resource_scheduler;
  std::shared_ptr<ray::raylet::LocalTaskManager> local_task_manager;
  std::shared_ptr<ray::raylet::ClusterTaskManagerInterface> cluster_task_manager;
  std::shared_ptr<ray::pubsub::SubscriberInterface> core_worker_subscriber;
  std::shared_ptr<ray::IObjectDirectory> object_directory;
  std::shared_ptr<ray::ObjectManagerInterface> object_manager;
  std::shared_ptr<ray::raylet::DependencyManager> dependency_manager;
  absl::flat_hash_map<WorkerID, std::shared_ptr<ray::raylet::WorkerInterface>> leased_workers;
```

This PR does not introduce any behavioral changes.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Related to
ray-project#53562
ray-project#40864
ray-project/kuberay#3701 and
ray-project/kuberay#3700

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Scott Lee <scott.lee@rebellions.ai>
edoakes pushed a commit to ray-project/ray that referenced this pull request Jun 24, 2025
By replacing the inaccurate `worker->IsDetachedActor()` with
`worker->GetAssignedTask().GetTaskSpecification().IsDetachedActor()`.

In the previous PR #14184, the
`worker.MarkDetachedActor()` that happened on assigning a task to a
worker was
[deleted](https://github.com/ray-project/ray/pull/14184/files#diff-d2f22b8f1bf5f9be47dacae8b467a72ee94629df12ffcc18b13447192ff3dbcfL1982).
<img width="496" alt="image"
src="https://github.com/user-attachments/assets/9510a564-909a-44cd-aa19-2d85fccaadd7"
/>
And that causes a leased worker for a detached worker can be killed by
[HandleUnexpectedWorkerFailure](https://github.com/ray-project/ray/blob/f5c59745d00982835feb145d14d1f9e0d4b0db6c/src/ray/raylet/node_manager.cc#L1059),
as mentioned in #40864, which
is also even triggered by a normal exit of driver. The reproducible
scripts can be found in [the
comment](#40864 (comment)).

I think actually `Worker::IsDetachedActor` and
`Worker::MarkDetachedActor` are redundant and better be removed because
we can access the info of whether the worker is detached or not through
its assigned task.

The info is first ready after `worker->SetAssignedTask(task)`(L962)
during `LocalTaskManager::Dispatch` and then the worker is inserted into
the `leased_workers` map (L972).

https://github.com/ray-project/ray/blob/118c37058ae2904a79da9be160633a6a8d3ee3b6/src/ray/raylet/local_task_manager.cc#L962-L972

Therefore, we can access the info through
`worker->GetAssignedTask().GetTaskSpecification().IsDetachedActor()`
safely while looping over the `leased_workers_` in the `NodeManager`. By
doing that, we don't need to worry about we could miss
`worker.MarkDetachedActor()` sometimes.

Closes #40864
Related to ray-project/kuberay#3701 and
ray-project/kuberay#3700

---------

Signed-off-by: Rueian <rueiancsie@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] Deflaky Autoscaler V2 e2e tests
3 participants
0