8000 [fix][operator] RayJob.Status.RayJobStatusInfo.EndTime nil deref error by davidxia · Pull Request #3742 · ray-project/kuberay · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[fix][operator] RayJob.Status.RayJobStatusInfo.EndTime nil deref error #3742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 4, 2025

Conversation

davidxia
Copy link
Contributor
@davidxia davidxia commented Jun 4, 2025

If users run the latest ray-operator code from master branch without updating their RayJob CRD, the operator panics from a nil pointer dereference error in rayjob_controller.go. rayJob.Status.RayJobStatusInfo.EndTime is nil when a RayJob's Job fails with RayJob CRD from v1.3.2.

We check if EndTime is nil and return false from checkTransitionGracePeriodAndUpdateStatusIfNeeded() if so.

operator logs before shows panic
{"level":"info","ts":"2025-06-04T20:45:21.874Z","logger":"controllers.RayJob","msg":"RayJob","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"c00b3547-8723-4655-b41a-bd08125760db","JobStatus":"FAILED","JobDeploymentStatus":"Running","SubmissionMode":"K8sJobMode"}
{"level":"info","ts":"2025-06-
10000
04T20:45:21.874Z","logger":"controllers.RayJob","msg":"AAAA rayJob.Status.RayJobStatusInfo","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"c00b3547-8723-4655-b41a-bd08125760db","RayJobStatusInfo":{}}
{"level":"info","ts":"2025-06-04T20:45:21.874Z","logger":"controllers.RayJob","msg":"BBBB rayJob.Status.RayJobStatusInfo.EndTime","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"c00b3547-8723-4655-b41a-bd08125760db","EndTime":"<nil>"}
{"level":"info","ts":"2025-06-04T20:45:21.874Z","logger":"controllers.RayJob","msg":"CCCC rayJobDeploymentGracePeriodTime","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"c00b3547-8723-4655-b41a-bd08125760db","rayJobDeploymentGracePeriodTime":300}
{"level":"error","ts":"2025-06-04T20:45:21.874Z","logger":"controllers.RayJob","msg":"Observed a panic","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"c00b3547-8723-4655-b41a-bd08125760db","panic":"runtime error: invalid memory address or nil pointer dereference","panicGoValue":"\"invalid memory address or nil pointer dereference\"","stacktrace":"goroutine 289 [running]:\nk8s.io/apimachinery/pkg/util/runtime.logPanic({0x20d8878, 0xc000ddf080}, {0x1b36580, 0x305b750})\n\t/go/pkg/mod/k8s.io/apimachinery@v0.33.1/pkg/util/runtime/runtime.go:132 +0xbc\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile.func1()\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:108 +0x112\npanic({0x1b36580?, 0x305b750?})\n\t/usr/local/go/src/runtime/panic.go:792 +0x132\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.checkTransitionGracePeriodAndUpdateStatusIfNeeded({0x20d8878?, 0xc000ddf080?}, 0xc001237408)\n\t/workspace/ray-operator/controllers/ray/rayjob_controller.go:941 +0x3da\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile(0xc0008efb40, {0x20d8878, 0xc000ddf080}, {{{0xc001478ac6?, 0x9?}, {0xc001478af6?, 0xa?}}})\n\t/workspace/ray-operator/controllers/ray/rayjob_controller.go:237 +0x7d3\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile(0xc000dde900?, {0x20d8878?, 0xc000ddf080?}, {{{0xc001478ac6?, 0x0?}, {0xc001478af6?, 0x0?}}})\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:119 +0xbf\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler(0x20efa40, {0x20d88b0, 0xc000027ae0}, {{{0xc001478ac6, 0x9}, {0xc001478af6, 0xa}}}, 0x0)\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:334 +0x3ad\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem(0x20efa40, {0x20d88b0, 0xc000027ae0})\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:294 +0x21b\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2()\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:255 +0x85\ncreated by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2 in goroutine 92\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:251 +0x6b5\n","stacktrace":"k8s.io/apimachinery/pkg/util/runtime.logPanic\n\t/go/pkg/mod/k8s.io/apimachinery@v0.33.1/pkg/util/runtime/runtime.go:142\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:108\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:792\nruntime.panicmem\n\t/usr/local/go/src/runtime/panic.go:262\nruntime.sigpanic\n\t/usr/local/go/src/runtime/signal_unix.go:925\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.checkTransitionGracePeriodAndUpdateStatusIfNeeded\n\t/workspace/ray-operator/controllers/ray/rayjob_controller.go:941\ngithub.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile\n\t/workspace/ray-operator/controllers/ray/rayjob_controller.go:237\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:334\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:255"}
{"level":"error","ts":"2025-06-04T20:45:21.874Z","logger":"controllers.RayJob","msg":"Reconciler error","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"c00b3547-8723-4655-b41a-bd08125760db","error":"panic: runtime error: invalid memory address or nil pointer dereference [recovered]","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:347\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:294\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.20.4/pkg/internal/controller/controller.go:255"}
operator logs after shows no panic
{"level":"info","ts":"2025-06-04T20:56:53.852Z","logger":"controllers.RayJob","msg":"Disregard changes in RayClusterSpec of RayJob","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"06d28647-6d8e-408d-83b7-23aa623323dd"}
{"level":"info","ts":"2025-06-04T20:56:53.856Z","logger":"controllers.RayJob","msg":"updateRayJobStatus","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"06d28647-6d8e-408d-83b7-23aa623323dd","oldRayJobStatus":{"rayJobInfo":{},"jobId":"v6e-16-job-9l8t8","rayClusterName":"v6e-16-job-6p4qq","dashboardURL":"v6e-16-job-6p4qq-head-svc.hyperkube.svc.cluster.local:8265","jobStatus":"FAILED","jobDeploymentStatus":"Running","message":"runtime_env setup failed: Failed to set up runtime environment.\nCould not create the actor because its associated runtime env failed to be created.\nTraceback (most recent call last):\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py\", line 384, in _create_runtime_env_with_retry\n    runtime_env_context = await asyncio.wait_for(\n  File \"/home/ray/anaconda3/lib/python3.9/asyncio/tasks.py\", line 479, in wait_for\n    return fut.result()\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py\", line 350, in _setup_runtime_env\n    await create_for_plugin_if_needed(\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/plugin.py\", line 254, in create_for_plugin_if_needed\n    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/pip.py\", line 309, in create\n    pip_dir_bytes = await task\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/pip.py\", line 289, in _create_for_hash\n    await PipProcessor(\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/pip.py\", line 191, in _run\n    await self._install_pip_packages(\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/pip.py\", line 167, in _install_pip_packages\n    await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/utils.py\", line 105, in check_output_cmd\n    raise SubprocessCalledProcessError(\nray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[9] failed with the following details.\nCommand '['/tmp/ray/session_2025-06-04_13-54-34_880243_1/runtime_resources/pip/b8ebe62e38d40ecc4d909509d3f858290eb2a8c3/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2025-06-04_13-54-34_880243_1/runtime_resources/pip/b8ebe62e38d40ecc4d909509d3f858290eb2a8c3/ray_runtime_env_internal_pip_requirements.txt']' returned non-zero exit status 1.\nLast 50 lines of stdout:\n    Looking in links: https://storage.googleapis.com/jax-releases/libtpu_releases.html\n    ERROR: Ignored the following yanked versions: 0.2.23, 0.3.18, 0.4.0, 0.4.15\n    ERROR: Ignored the following versions that require a different python version: 0.4.31 Requires-Python >=3.10; 0.4.32 Requires-Python >=3.10; 0.4.33 Requires-Python >=3.10; 0.4.34 Requires-Python >=3.10; 0.4.35 Requires-Python >=3.10; 0.4.36 Requires-Python >=3.10; 0.4.37 Requires-Python >=3.10; 0.4.38 Requires-Python >=3.10; 0.5.0 Requires-Python >=3.10; 0.5.1 Requires-Python >=3.10; 0.5.2 Requires-Python >=3.10; 0.5.3 Requires-Python >=3.10; 0.6.0 Requires-Python >=3.10; 0.6.1 Requires-Python >=3.10\n    ERROR: Could not find a version that satisfies the requirement jax==0.4.33 (from versions: 0.0, 0.1, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.1.10, 0.1.11, 0.1.12, 0.1.13, 0.1.14, 0.1.15, 0.1.16, 0.1.18, 0.1.19, 0.1.20, 0.1.21, 0.1.22, 0.1.23, 0.1.24, 0.1.25, 0.1.26, 0.1.27, 0.1.28, 0.1.29, 0.1.30, 0.1.31, 0.1.32, 0.1.33, 0.1.34, 0.1.35, 0.1.36, 0.1.37, 0.1.38, 0.1.39, 0.1.40, 0.1.41, 0.1.42, 0.1.43, 0.1.44, 0.1.45, 0.1.46, 0.1.47, 0.1.48, 0.1.49, 0.1.50, 0.1.51, 0.1.52, 0.1.53, 0.1.54, 0.1.55, 0.1.56, 0.1.57, 0.1.58, 0.1.59, 0.1.60, 0.1.61, 0.1.62, 0.1.63, 0.1.64, 0.1.65, 0.1.66, 0.1.67, 0.1.68, 0.1.69, 0.1.70, 0.1.71, 0.1.72, 0.1.73, 0.1.74, 0.1.75, 0.1.76, 0.1.77, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.2.10, 0.2.11, 0.2.12, 0.2.13, 0.2.14, 0.2.15, 0.2.16, 0.2.17, 0.2.18, 0.2.19, 0.2.20, 0.2.21, 0.2.22, 0.2.24, 0.2.25, 0.2.26, 0.2.27, 0.2.28, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10, 0.3.11, 0.3.12, 0.3.13, 0.3.14, 0.3.15, 0.3.16, 0.3.17, 0.3.19, 0.3.20, 0.3.21, 0.3.22, 0.3.23, 0.3.24, 0.3.25, 0.4.1, 0.4.2, 0.4.3, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.4.10, 0.4.11, 0.4.12, 0.4.13, 0.4.14, 0.4.16, 0.4.17, 0.4.18, 0.4.19, 0.4.20, 0.4.21, 0.4.22, 0.4.23, 0.4.24, 0.4.25, 0.4.26, 0.4.27, 0.4.28, 0.4.29, 0.4.30)\n    ERROR: No matching distribution found for jax==0.4.33\n","startTime":"2025-06-04T20:53:35Z","succeeded":0,"failed":0,"rayClusterStatus":{"state":"ready","desiredCPU":"104","desiredMemory":"840G","desiredGPU":"0","desiredTPU":"16","lastUpdateTime":"2025-06-04T20:56:25Z","stateTransitionTimes":{"ready":"2025-06-04T20:56:25Z"},"endpoints":{"client":"10001","dashboard":"8265","gcs-server":"6379","grpc":"8888","metrics":"8080"},"head":{"podIP":"10.160.193.2","serviceIP":"10.160.202.222","podName":"v6e-16-job-6p4qq-head","serviceName":"v6e-16-job-6p4qq-head-svc"},"conditions":[{"type":"HeadPodReady","status":"True","lastTransitionTime":"2025-06-04T20:54:42Z","reason":"HeadPodRunningAndReady","message":""},{"type":"RayClusterProvisioned","status":"True","lastTransitionTime":"2025-06-04T20:56:25Z","reason":"AllPodRunningAndReadyFirstTime","message":"All Ray Pods are ready for the first time"},{"type":"RayClusterSuspended","status":"False","lastTransitionTime":"2025-06-04T20:53:47Z","reason":"RayClusterSuspended","message":""},{"type":"RayClusterSuspending","status":"False","lastTransitionTime":"2025-06-04T20:53:47Z","reason":"RayClusterSuspending","message":""}],"readyWorkerReplicas":4,"availableWorkerReplicas":4,"desiredWorkerReplicas":4,"minWorkerReplicas":4,"maxWorkerReplicas":4,"observedGeneration":1}},"newRayJobStatus":{"rayJobInfo":{"startTime":"2025-06-04T20:56:32Z","endTime":"2025-06-04T20:56:50Z"},"jobId":"v6e-16-job-9l8t8","rayClusterName":"v6e-16-job-6p4qq","dashboardURL":"v6e-16-job-6p4qq-head-svc.hyperkube.svc.cluster.local:8265","jobStatus":"FAILED","jobDeploymentStatus":"Failed","reason":"AppFailed","message":"runtime_env setup failed: Failed to set up runtime environment.\nCould not create the actor because its associated runtime env failed to be created.\nTraceback (most recent call last):\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py\", line 384, in _create_runtime_env_with_retry\n    runtime_env_context = await asyncio.wait_for(\n  File \"/home/ray/anaconda3/lib/python3.9/asyncio/tasks.py\", line 479, in wait_for\n    return fut.result()\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py\", line 350, in _setup_runtime_env\n    await create_for_plugin_if_needed(\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/plugin.py\", line 254, in create_for_plugin_if_needed\n    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/pip.py\", line 309, in create\n    pip_dir_bytes = await task\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/pip.py\", line 289, in _create_for_hash\n    await PipProcessor(\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/pip.py\", line 191, in _run\n    await self._install_pip_packages(\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/pip.py\", line 167, in _install_pip_packages\n    await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)\n  File \"/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/runtime_env/utils.py\", line 105, in check_output_cmd\n    raise SubprocessCalledProcessError(\nray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[9] failed with the following details.\nCommand '['/tmp/ray/session_2025-06-04_13-54-34_880243_1/runtime_resources/pip/b8ebe62e38d40ecc4d909509d3f858290eb2a8c3/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2025-06-04_13-54-34_880243_1/runtime_resources/pip/b8ebe62e38d40ecc4d909509d3f858290eb2a8c3/ray_runtime_env_internal_pip_requirements.txt']' returned non-zero exit status 1.\nLast 50 lines of stdout:\n    Looking in links: https://storage.googleapis.com/jax-releases/libtpu_releases.html\n    ERROR: Ignored the following yanked versions: 0.2.23, 0.3.18, 0.4.0, 0.4.15\n    ERROR: Ignored the following versions that require a different python version: 0.4.31 Requires-Python >=3.10; 0.4.32 Requires-Python >=3.10; 0.4.33 Requires-Python >=3.10; 0.4.34 Requires-Python >=3.10; 0.4.35 Requires-Python >=3.10; 0.4.36 Requires-Python >=3.10; 0.4.37 Requires-Python >=3.10; 0.4.38 Requires-Python >=3.10; 0.5.0 Requires-Python >=3.10; 0.5.1 Requires-Python >=3.10; 0.5.2 Requires-Python >=3.10; 0.5.3 Requires-Python >=3.10; 0.6.0 Requires-Python >=3.10; 0.6.1 Requires-Python >=3.10\n    ERROR: Could not find a version that satisfies the requirement jax==0.4.33 (from versions: 0.0, 0.1, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.1.10, 0.1.11, 0.1.12, 0.1.13, 0.1.14, 0.1.15, 0.1.16, 0.1.18, 0.1.19, 0.1.20, 0.1.21, 0.1.22, 0.1.23, 0.1.24, 0.1.25, 0.1.26, 0.1.27, 0.1.28, 0.1.29, 0.1.30, 0.1.31, 0.1.32, 0.1.33, 0.1.34, 0.1.35, 0.1.36, 0.1.37, 0.1.38, 0.1.39, 0.1.40, 0.1.41, 0.1.42, 0.1.43, 0.1.44, 0.1.45, 0.1.46, 0.1.47, 0.1.48, 0.1.49, 0.1.50, 0.1.51, 0.1.52, 0.1.53, 0.1.54, 0.1.55, 0.1.56, 0.1.57, 0.1.58, 0.1.59, 0.1.60, 0.1.61, 0.1.62, 0.1.63, 0.1.64, 0.1.65, 0.1.66, 0.1.67, 0.1.68, 0.1.69, 0.1.70, 0.1.71, 0.1.72, 0.1.73, 0.1.74, 0.1.75, 0.1.76, 0.1.77, 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.2.10, 0.2.11, 0.2.12, 0.2.13, 0.2.14, 0.2.15, 0.2.16, 0.2.17, 0.2.18, 0.2.19, 0.2.20, 0.2.21, 0.2.22, 0.2.24, 0.2.25, 0.2.26, 0.2.27, 0.2.28, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10, 0.3.11, 0.3.12, 0.3.13, 0.3.14, 0.3.15, 0.3.16, 0.3.17, 0.3.19, 0.3.20, 0.3.21, 0.3.22, 0.3.23, 0.3.24, 0.3.25, 0.4.1, 0.4.2, 0.4.3, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.4.10, 0.4.11, 0.4.12, 0.4.13, 0.4.14, 0.4.16, 0.4.17, 0.4.18, 0.4.19, 0.4.20, 0.4.21, 0.4.22, 0.4.23, 0.4.24, 0.4.25, 0.4.26, 0.4.27, 0.4.28, 0.4.29, 0.4.30)\n    ERROR: No matching distribution found for jax==0.4.33\n","startTime":"2025-06-04T20:53:35Z","succeeded":0,"failed":1,"rayClusterStatus":{"state":"ready","desiredCPU":"104","desiredMemory":"840G","desiredGPU":"0","desiredTPU":"16","lastUpdateTime":"2025-06-04T20:56:25Z","stateTransitionTimes":{"ready":"2025-06-04T20:56:25Z"},"endpoints":{"client":"10001","dashboard":"8265","gcs-server":"6379","grpc":"8888","metrics":"8080"},"head":{"podIP":"10.160.193.2","serviceIP":"10.160.202.222","podName":"v6e-16-job-6p4qq-head","serviceName":"v6e-16-job-6p4qq-head-svc"},"conditions":[{"type":"HeadPodReady","status":"True","lastTransitionTime":"2025-06-04T20:54:42Z","reason":"HeadPodRunningAndReady","message":""},{"type":"RayClusterProvisioned","status":"True","lastTransitionTime":"2025-06-04T20:56:25Z","reason":"AllPodRunningAndReadyFirstTime","message":"All Ray Pods are ready for the first time"},{"type":"RayClusterSuspended","status":"False","lastTransitionTime":"2025-06-04T20:53:47Z","reason":"RayClusterSuspended","message":""},{"type":"RayClusterSuspending","status":"False","lastTransitionTime":"2025-06-04T20:53:47Z","reason":"RayClusterSuspending","message":""}],"readyWorkerReplicas":4,"availableWorkerReplicas":4,"desiredWorkerReplicas":4,"minWorkerReplicas":4,"maxWorkerReplicas":4,"observedGeneration":1}}}
{"level":"info","ts":"2025-06-04T20:56:53.856Z","logger":"controllers.RayJob","msg":"updateRayJobStatus","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"06d28647-6d8e-408d-83b7-23aa623323dd","old JobStatus":"FAILED","new JobStatus":"FAILED","old JobDeploymentStatus":"Running","new JobDeploymentStatus":"Failed"}
{"level":"info","ts":"2025-06-04T20:56:53.880Z","logger":"KubeAPIWarningLogger","msg":"unknown field \"spec.rayClusterSpec.headGroupSpec.template.metadata.creationTimestamp\""}
{"level":"info","ts":"2025-06-04T20:56:53.880Z","logger":"KubeAPIWarningLogger","msg":"unknown field \"spec.rayClusterSpec.workerGroupSpecs[0].template.metadata.creationTimestamp\""}
{"level":"info","ts":"2025-06-04T20:56:53.880Z","logger":"KubeAPIWarningLogger","msg":"unknown field \"status.rayJobInfo\""}
{"level":"info","ts":"2025-06-04T20:56:53.883Z","logger":"controllers.RayJob","msg":"RayJob","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"6ff5b239-3106-40d1-b0ef-ac503d6db01f","JobStatus":"FAILED","JobDeploymentStatus":"Failed","SubmissionMode":"K8sJobMode"}
{"level":"info","ts":"2025-06-04T20:56:53.884Z","logger":"controllers.RayJob","msg":"Failed","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"6ff5b239-3106-40d1-b0ef-ac503d6db01f","ShutdownAfterJobFinishes":false,"ClusterSelector":null,"ttlSecondsAfterFinished":0,"Status.endTime":"2025-06-04 20:56:53 +0000 UTC","Now":"2025-06-04T20:56:53.884Z","ShutdownTime":"2025-06-04T20:56:53.000Z"}
{"level":"info","ts":"2025-06-04T20:56:54.263Z","logger":"controllers.RayJob","msg":"RayJob","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"73140b4a-dd93-493a-aedc-bccc4ab14bf0","JobStatus":"FAILED","JobDeploymentStatus":"Failed","SubmissionMode":"K8sJobMode"}
{"level":"info","ts":"2025-06-04T20:56:54.263Z","logger":"controllers.RayJob","msg":"Failed","RayJob":{"name":"v6e-16-job","namespace":"hyperkube"},"reconcileID":"73140b4a-dd93-493a-aedc-bccc4ab14bf0","ShutdownAfterJobFinishes":false,"ClusterSelector":null,"ttlSecondsAfterFinished":0,"Status.endTime":"2025-06-04 20:56:53 +0000 UTC","Now":"2025-06-04T20:56:54.263Z","ShutdownTime":"2025-06-04T20:56:53.000Z"}
{"level":"info","ts":"2025-06-04T20:57:09.589Z","logger":"controllers.RayCluster","msg":"Reconciling Ingress","RayCluster":{"name":"ikennao-shared-test-a","namespace":"hyperkube"},"reconcileID":"5bc269eb-5f4a-47c4-8be1-f2da468d97f9"}
{"level":"info","ts":"2025-06-04T20:57:09.589Z","logger":"controllers.RayCluster","msg":"reconcileHeadService","RayCluster":{"name":"ikennao-shared-test-a","namespace":"hyperkube"},"reconcileID":"5bc269eb-5f4a-47c4-8be1-f2da468d97f9","1 head service found":"ikennao-shared-test-a-head-svc"}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ikennao-shared-test-a","namespace":"hyperkube"},"reconcileID":"5bc269eb-5f4a-47c4-8be1-f2da468d97f9","Found 1 head Pod":"ikennao-shared-test-a-head","Pod status":"Running","Pod status reason":"","Pod restart policy":"OnFailure","Ray container terminated status":"nil"}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ikennao-shared-test-a","namespace":"hyperkube"},"reconcileID":"5bc269eb-5f4a-47c4-8be1-f2da468d97f9","head Pod":"ikennao-shared-test-a-head","shouldDelete":false,"reason":"KubeRay does not need to delete the head Pod ikennao-shared-test-a-head. The Pod status is Running, and the Ray container terminated status is nil."}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ikennao-shared-test-a","namespace":"hyperkube"},"reconcileID":"5bc269eb-5f4a-47c4-8be1-f2da468d97f9","desired workerReplicas (always adhering to minReplicas/maxReplica)":0,"worker group":"gpuWorker","maxReplicas":0,"minReplicas":0,"replicas":0}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ikennao-shared-test-a","namespace":"hyperkube"},"reconcileID":"5bc269eb-5f4a-47c4-8be1-f2da468d97f9","removing the pods in the scaleStrategy of":"gpuWorker"}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ikennao-shared-test-a","namespace":"hyperkube"},"reconcileID":"5bc269eb-5f4a-47c4-8be1-f2da468d97f9","workerReplicas":0,"NumOfHosts":1,"runningPods":0,"diff":0}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"ikennao-shared-test-a","namespace":"hyperkube"},"reconcileID":"5bc269eb-5f4a-47c4-8be1-f2da468d97f9","all workers already exist for group":"gpuWorker"}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"Environment variable is not set, using default value of seconds","RayCluster":{"name":"ikennao-shared-test-a","namespace":"hyperkube"},"reconcileID":"5bc269eb-5f4a-47c4-8be1-f2da468d97f9","environmentVariable":"RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV","defaultValue":300}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"Unconditional requeue after","RayCluster":{"name":"ikennao-shared-test-a","namespace":"hyperkube"},"reconcileID":"5bc269eb-5f4a-47c4-8be1-f2da468d97f9","seconds":300}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"Reconciling Ingress","RayCluster":{"name":"spacious-spanish-pop","namespace":"ray-playground"},"reconcileID":"77870c1d-7a3f-4bb0-8bd3-7b2c501d2e67"}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"reconcileHeadService","RayCluster":{"name":"spacious-spanish-pop","namespace":"ray-playground"},"reconcileID":"77870c1d-7a3f-4bb0-8bd3-7b2c501d2e67","1 head service found":"spacious-spanish-pop-head-svc"}
{"level":"info","ts":"2025-06-04T20:57:09.590Z","logger":"controllers.RayCluster","msg":"reconcilePods","RayCluster":{"name":"spacious-spanish-pop","namespace":"ray-playground"},"reconcileID":"77870c1d-7a3f-4bb0-8bd3-7b2c501d2e67","Found 1 head Pod":"spacious-spanish-pop-head-llsld","Pod status":"Running","Pod status reason":"","Pod restart policy":"OnFailure","Ray container terminated status":"nil"}
## Checks
  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@davidxia davidxia marked this pull request as ready for review June 4, 2025 21:00
@kevin85421 kevin85421 requested a review from Copilot June 4, 2025 21:11
Copy link
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a nil pointer dereference when EndTime is nil in the status transition grace period check for RayJobs.

  • Add a nil check for EndTime before calling Add(...).
  • Return early from checkTransitionGracePeriodAndUpdateStatusIfNeeded() if EndTime is nil.
Comments suppressed due to low confidence (1)

ray-operator/controllers/ray/rayjob_controller.go:938

  • Add a unit test for checkTransitionGracePeriodAndUpdateStatusIfNeeded that covers the case when EndTime is nil to prevent regressions.
if rayJob.Status.RayJobStatusInfo.EndTime == nil || time.Now().Before(rayJob.Status.RayJobStatusInfo.EndTime.Add(time.Duration(rayJobDeploymentGracePeriodTime)*time.Second)) {

If users run the latest ray-operator code from master branch without updating
their RayJob CRD, the operator panics from a nil pointer dereference error in
`rayjob_controller.go`. `rayJob.Status.RayJobStatusInfo.EndTime` is nil when a
RayJob's Job fails with RayJob CRD from v1.3.2.

We check if `EndTime` is nil and return false from
`checkTransitionGracePeriodAndUpdateStatusIfNeeded()` if so.

Signed-off-by: David Xia <david@davidxia.com>
@kevin85421 kevin85421 merged commit 5ab8d7e into ray-project:master Jun 4, 2025
25 checks passed
pawelpaszki pushed a commit to opendatahub-io/kuberay that referenced this pull request Jun 10, 2025
ray-project#3742)

If users run the latest ray-operator code from master branch without updating
their RayJob CRD, the operator panics from a nil pointer dereference error in
`rayjob_controller.go`. `rayJob.Status.RayJobStatusInfo.EndTime` is nil when a
RayJob's Job fails with RayJob CRD from v1.3.2.

We check if `EndTime` is nil and return false from
`checkTransitionGracePeriodAndUpdateStatusIfNeeded()` if so.

Signed-off-by: David Xia <david@davidxia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0