Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob #2492

Diasker · 2025-03-10T07:25:29Z

Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob

Problem Description

When nproc_per_node is set to "auto" in a PyTorch TrainJob, PyTorch's default behavior is to set it to the number of available CPUs on the node (via os.cpu_count()). This can cause problems when:

The container has a CPU limit that is lower than the node's actual CPU count
No GPU is requested, so PyTorch tries to use all available CPUs
The training job exceeds its memory limit due to too many processes, leading to OOM errors or deadlocks

This issue was discussed in PR #2387 and formally documented in Issue #2407.

Solution

This PR modifies the Torch plugin to cap nproc_per_node based on CPU resources when it's set to "auto" and no GPU is requested:

First, check if GPU resources are requested (by examining both Limits and Requests in ResourcesPerNode)
If no GPU is requested, set nproc_per_node to the CPU limit (or request if no limit is set)
Default to 1 if no CPU resources are specified
Ensure at least 1 process, even if CPU limit is less than 1

This ensures that PyTorch distributed training only uses the CPU resources allocated to the container, rather than attempting to use all available CPUs on the node.

Code Changes

Modified the following files:

pkg/runtime/framework/plugins/torch/torch.go: Added logic to cap nproc_per_node based on CPU resources
pkg/runtime/framework/plugins/torch/torch_nproc_per_node_test.go: Added test cases to validate the fix

Test Cases

Added comprehensive test cases to verify the behavior:

When nproc_per_node=auto with a CPU limit, it should be capped to that limit
When nproc_per_node=auto with no CPU resources specified, it should default to 1
When nproc_per_node=auto with a low CPU limit, it should be capped to that limit even if the actual CPU count is higher
When nproc_per_node=auto with only CPU requests (no limits), it should use the request value
When nproc_per_node=auto with GPU resources, it should remain "auto"
When nproc_per_node is explicitly set, that value should be preserved

Related Links

Issue Cap nproc_per_node based on the CPU resources of the node for PyTorch TrainJob #2407: Cap nproc_per_node based on the CPU resources of the node for PyTorch TrainJob
PR KEP-2170: Add PyTorch DDP MNIST training example #2387: KEP-2170: Add PyTorch DDP MNIST training example (related discussion)

…rch TrainJob Signed-off-by: Diasker <kennardniko@foxmail.com>

astefanutti · 2025-03-10T07:37:52Z

pkg/runtime/framework/plugins/torch/torch.go

+			cpuLimit := int32(1) // Default to 1 if no CPU limit is specified
+
+			// First check limits, then requests
+			if cpuQuantity, ok := trainJob.Spec.Trainer.ResourcesPerNode.Limits[corev1.ResourceCPU]; ok {


It may be more robust to rely on the k8s/apimachinery/pkg/api/resource package and use https://pkg.go.dev/k8s.io/apimachinery/pkg/api/resource#MustParse to handle the millicore case.

astefanutti · 2025-03-10T08:05:06Z

pkg/runtime/framework/plugins/torch/torch.go

+			}
+
+			// Ensure at least 1 process
+			if cpuLimit < 1 {


Maybe the RoundUp function from the resource package can be used here.

Thank you for your valuable suggestions! I've implemented the changes you recommended:

Now using resource.MustParse to handle CPU resource values more robustly

Implemented RoundUp method to properly handle millicore cases as you suggested

The updated code now correctly handles various CPU resource formats:

Integer values (e.g., "4")

Decimal values (e.g., "2.5")

Millicore values (e.g., "2500m")

I've also added comprehensive test cases to verify these scenarios, including specific tests for millicore and fractional CPU limits. All tests are passing, confirming that the implementation correctly rounds up CPU values when determining the number of processes.

This implementation ensures that when nproc_per_node is set to "auto" and no GPU is requested, we'll properly cap the number of processes based on the CPU resources, even when those resources are specified in millicore format.

Thanks again for helping improve the robustness of this code!

Signed-off-by: Diasker <kennardniko@foxmail.com>

astefanutti · 2025-03-11T09:35:50Z

pkg/runtime/framework/plugins/torch/torch.go

+			cpuValue := cpuQuantity.Value()
+
+			numProcPerNode = intstr.FromInt(int(cpuValue))
+			fmt.Printf("CPU-only device detected with nproc_per_node=auto, capping to CPU limit: %d\n", cpuValue)


We may want to use the logger from the context, e.g. log := ctrl.LoggerFrom(ctx) but the context should be added first to EnforceMLPolicy similarly to other plugin APIs like ComponentBuilderPlugin.Build or TerminalConditionPlugin.TerminalCondition.

Thank you for your suggestion! I noticed that using ctrl.LoggerFrom(ctx) would require a significant change to the EnforceMLPolicyPlugin interface, as it currently doesn't accept a context parameter.

This change would affect:

The interface definition itself

All plugins implementing this interface (Torch, MPI, PlainML)

The framework code that calls these plugins

All related test code

Considering that the scope of this change might exceed the goals of the current PR, I'd like to ask:

How important is this log message? Could we temporarily keep fmt.Printf or use a logging library that doesn't require context (like klog)? Or even delete this log message?

Or do you think we should proceed with this interface change in the current PR?

Please let me know your thoughts. Thank you!

Yes, I agree. I'd be inclined to remove this log message for now.

@andreyvelich @tenzen-y if you have any suggestions.

Actually, we can add logger without interface changes in the following:

type Torch struct{ logger logr.Logger } func New(context.Context, client.Client, client.FieldIndexer) (framework.Plugin, error) { return &Torch{ logger: ctrl.LoggerFrom(ctx).WithValues("pluginName", Name) }, nil }

I was planning to add logger like this when I designed this interface.

I think the main difference with both approaches is with the context from the signature it's the child context that holds a reference to the reconciled object, while if that the context from the plugin, it's the parent context.

Thank you for your excellent suggestion about adding a logger to the Torch plugin without changing the interface. I've implemented your approach as follows:

Added a logger field to the Torch struct:

type Torch struct { logger logr.Logger }

Initialized the logger in the New function with proper null checks:

func New(ctx context.Context, client client.Client, indexer client.FieldIndexer) (framework.Plugin, error) { // Use logr.Discard() as the default value logger := logr.Discard() // Try to get logger from context if ctx != nil { ctxLogger := log.FromContext(ctx) // Check if logger is valid (by checking its sink) if ctxLogger.GetSink() != nil { logger = ctxLogger } } return &Torch{ logger: logger.WithValues("pluginName", Name), }, nil }

Replaced the fmt.Printf with structured logging in the EnforceMLPolicy method:

t.logger.Info("CPU-only device detected with nproc_per_node=auto, capping to CPU limit", "cpuLimit", cpuValue)

All tests are now passing, and this approach avoids the need for interface changes while still providing structured logging. Thank you for sharing this elegant solution - it's exactly what we needed!

I think the main difference with both approaches is with the context from the signature it's the child context that holds a reference to the reconciled object, while if that the context from the plugin, it's the parent context.

@astefanutti Thank you for pointing out the difference between the two approaches. I initially implemented the solution using the plugin initialization context as suggested by @tenzen-y, without seeing your latest comment. Sorry about that.
I understand your point about the context differences - using the context from the method signature would provide more specific information about the reconciled object, while the plugin initialization context is more general. So, do we need to keep or remove the log message? Actually, I think the necessity of this log message is limited.

@Diasker thanks I agree, let's remove the message to stick to the scope of this PR as you suggested, and re-consider this when the need for more important messages in plugins like this will arise.

@astefanutti Thanks for your guidance. I just removed the log message.

I agree, let's remove logger for now since it is out of scope of this PR.

Signed-off-by: Diasker <kennardniko@foxmail.com>

Diasker · 2025-03-11T15:31:21Z

@astefanutti

I noticed a test failure in pkg/runtime/core/trainingruntime_test.go in the test case succeeded to build JobSet with Torch values from the Runtime and envs within TestTrainingRuntimeNewObjects.

The current implementation logic for nproc_per_node is:

When set to "auto" with no GPU resources requested: use CPU resource limits to set the process count
When GPU resources are requested: keep "auto" value

The existing test case expects nproc_per_node to remain "auto", but since it only defines CPU resources (without GPU), the test is failing. Would you like me to change this test file?
Maybe we can:

Update the test case to include GPU resources to maintain the "auto" value, or
Update the test's expected value to "1" to match the CPU-based calculation?

I've checked other test files and confirmed that this is the only affected test case. The dedicated test file torch_nproc_per_node_test.go already covers all the CPU resource handling scenarios correctly.

astefanutti · 2025-03-11T15:43:03Z

@Diasker right I think that makes sense to update the test as you're suggesting.

Both option would work, but I'd be inclined to go with 2) change the expected value to "1" to match the CPU-based calculation.

Signed-off-by: Diasker <kennardniko@foxmail.com>

Diasker · 2025-03-11T15:54:45Z

@astefanutti Alright, change finished, thanks!

andreyvelich · 2025-03-11T19:55:47Z

/ok-to-test

andreyvelich

Thank you for working on this @Diasker!

andreyvelich · 2025-03-11T14:45:24Z

pkg/runtime/framework/plugins/torch/torch.go

@@ -72,6 +75,52 @@ func (t *Torch) EnforceMLPolicy(info *runtime.Info, trainJob *trainer.TrainJob)
              		numProcPerNode = ptr.Deref(trainJob.Spec.Trainer.NumProcPerNode, intstr.FromString("auto"))

            
 	}



You also need to remove it from the SDK: https://github.com/kubeflow/trainer/blob/master/sdk/kubeflow/trainer/utils/utils.py#L119

andreyvelich · 2025-03-11T19:55:40Z

pkg/runtime/framework/plugins/torch/torch.go

+			cpuValue := cpuQuantity.Value()
+
+			numProcPerNode = intstr.FromInt(int(cpuValue))
+			fmt.Printf("CPU-only device detected with nproc_per_node=auto, capping to CPU limit: %d\n", cpuValue)


I agree, let's remove logger for now since it is out of scope of this PR.

andreyvelich · 2025-03-11T19:57:53Z

pkg/runtime/framework/plugins/torch/torch.go

@@ -72,6 +75,52 @@ func (t *Torch) EnforceMLPolicy(info *runtime.Info, trainJob *trainer.TrainJob)
 		numProcPerNode = ptr.Deref(trainJob.Spec.Trainer.NumProcPerNode, intstr.FromString("auto"))
 	}

+	// Cap nproc_per_node based on CPU resources when set to "auto" and no GPU is requested
+	if numProcPerNode.String() == "auto" && trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.ResourcesPerNode != nil {


Should we also validate that numProcPerNode == "cpu" ?

andreyvelich · 2025-03-11T20:06:14Z

pkg/runtime/framework/plugins/torch/torch.go

+		hasGPU := false
+		for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Limits {
+			if strings.Contains(strings.ToLower(string(resourceName)), "gpu") {
+				hasGPU = true
+				break
+			}
+		}
+		if !hasGPU {
+			for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Requests {
+				if strings.Contains(strings.ToLower(string(resourceName)), "gpu") {
+					hasGPU = true
+					break
+				}
+			}
+		}
+
+		// If no GPU is requested, use CPU limit or default to 1
+		if !hasGPU {


@tenzen-y @astefanutti Is there any helper function in k8s.io/utils we can use to check whether resource (e.g. cpu, gpu) exists in container.resources ?

AFAIK, there is not such util.

andreyvelich · 2025-03-11T20:08:10Z

pkg/runtime/framework/plugins/torch/torch_nproc_per_node_test.go

+	utiltesting "github.com/kubeflow/trainer/pkg/util/testing"
+)
+
+func TestTorch_EnforceMLPolicy_NumProcPerNode_Issue2407(t *testing.T) {


I think, these test cases should be part of TestTorch similar to what we've done for MPI here: #2481

Signed-off-by: Diasker <kennardniko@foxmail.com>

Diasker · 2025-03-12T05:34:42Z

@andreyvelich Thank you for your feedback! I've done these:

Regarding the SDK comment: I've removed the comment.
For CPU support in torch.go: I've implemented the logic to handle numProcPerNode="cpu" which will always use CPU resources regardless of GPU presence. This ensures users can explicitly choose to use CPU cores for distributed training even when GPUs are available.
Following your suggestion and PR Implement MPI plugin UTs #2481, I've created a comprehensive torch_test.go file instead of a separate test file. The test cases cover various scenarios:
- numProcPerNode=auto with CPU limits but no GPU
- numProcPerNode=auto with GPU resources
- numProcPerNode=cpu with only CPU resources
- numProcPerNode=cpu with both CPU and GPU resources
- Handling of fractional CPU values (rounding up)
- Multi-node multi-GPU training configurations

Is the current CPU handling logic in torch.go robust enough? I've ensured it checks both limits and requests, handles fractional values by rounding up, and guarantees at least one process.

Also, regarding the SDK's get_num_proc_per_node function - should we update it to match the Go implementation, particularly to support the "cpu" option? The current SDK function prioritizes GPU resources and doesn't have special handling for the "cpu" value.

Regarding the helper function to check for resource existence: I couldn't find a specific utility function in k8s.io/utils for checking whether a resource exists in container.resources. If such a helper exists, I'd be happy to simplify the code by using it.

As a coding newcomer, I really appreciate your patience and guidance throughout this process. Your feedback has been invaluable in helping me improve the implementation.

andreyvelich · 2025-03-12T12:22:32Z

Also, regarding the SDK's get_num_proc_per_node function - should we update it to match the Go implementation, particularly to support the "cpu" option? The current SDK function prioritizes GPU resources and doesn't have special handling for the "cpu" value.

The goal of our issue was to move the logic to assign numProcPerNode from SDK to the Controller, that is why we should completely remove the get_num_proc_per_node util function, and rely on Torch plugin to generate correct value for this parameter.

Is the current CPU handling logic in torch.go robust enough? I've ensured it checks both limits and requests, handles fractional values by rounding up, and guarantees at least one process.

Thanks for improving it! I will take a look soon.

tenzen-y · 2025-03-13T02:05:05Z

pkg/runtime/framework/plugins/torch/torch_test.go

+				Scheduler: &runtime.Scheduler{TotalRequests: map[string]runtime.TotalResourceRequest{}},
+			},
+		},
+		// Issue #2407 test case - nproc_per_node=auto with CPU limit


Suggested change

// Issue #2407 test case - nproc_per_node=auto with CPU limit

// nproc_per_node=auto with CPU limit

It would be better not to add such a specific prefix.
Could you address those in other places, as well?

tenzen-y · 2025-03-13T02:17:50Z

pkg/runtime/framework/plugins/torch/torch_test.go

+			// If need to validate numProcPerNode
+			if tc.wantNumProcPerNode != "" && tc.info != nil {
+				// Find PET_NPROC_PER_NODE environment variable
+				var numProcPerNodeValue string
+				for _, env := range tc.info.Trainer.Env {
+					if env.Name != nil && *env.Name == constants.TorchEnvNumProcPerNode {
+						if env.Value != nil {
+							numProcPerNodeValue = *env.Value
+						}
+						break
+					}
+				}
+
+				if diff := cmp.Diff(tc.wantNumProcPerNode, numProcPerNodeValue); diff != "" {
+					t.Errorf("Torch.EnforceMLPolicy() numProcPerNode mismatch (-want +got):\n%s", diff)
+				}
+			}


We should always compare whole of expected and got runtime.Info.

tenzen-y · 2025-03-13T02:18:18Z

pkg/runtime/framework/plugins/torch/torch.go

+		hasGPU := false
+		for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Limits {
+			if strings.Contains(strings.ToLower(string(resourceName)), "gpu") {
+				hasGPU = true
+				break
+			}
+		}
+		if !hasGPU {
+			for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Requests {
+				if strings.Contains(strings.ToLower(string(resourceName)), "gpu") {
+					hasGPU = true
+					break
+				}
+			}
+		}
+
+		// If no GPU is requested, use CPU limit or default to 1
+		if !hasGPU {


AFAIK, there is not such util.

Signed-off-by: Diasker <kennardniko@foxmail.com>

astefanutti · 2025-03-13T07:36:05Z

sdk/kubeflow/trainer/utils/utils.py

@@ -111,8 +111,6 @@ def get_resources_per_node(
    return resources


-# TODO (andreyvelich): Move this part to the Kubeflow Trainer torch plugins.
-# Ref issue: https://github.com/kubeflow/trainer/issues/2407
 def get_num_proc_per_node(resources_per_node: dict) -> str:


I think this can be removed altogether.

Right, i'll remove the function later.

tenzen-y · 2025-03-13T07:36:10Z

@Diasker It looks like there are conflicts

tenzen-y · 2025-03-13T07:42:10Z

Part of #2468

astefanutti · 2025-03-13T07:46:55Z

pkg/runtime/framework/plugins/torch/torch.go

+		}
+		return fallbackNumProcPerNode, false
+	}
+	return intstr.FromInt32(int32(defaultCPU)), false


is the conversion to int32 really needed?

@astefanutti Yes, the conversion is necessary because FromInt32 requires an int32 parameter.
Or we can define defaultCPU as int32 from the beginning: var defaultCPU int32 = 1

Isn't it already declared as var defaultCPU int32 = 1?

@astefanutti OMG!!! You're absolutely right! It is already declared as var defaultCPU int32 = 1. I apologize for the confusion - I must have overlooked this when reviewing the code. Thank you for pointing this out.

Diasker · 2025-03-13T07:48:39Z

@Diasker It looks like there are conflicts

@tenzen-y Yes, I see the conflicts. This is similar to the issue we discussed earlier with the trainingruntime_test.go file.

I noticed a test failure in pkg/runtime/core/trainingruntime_test.go in the test case succeeded to build JobSet with Torch values from the Runtime and envs within TestTrainingRuntimeNewObjects.

The current implementation logic for nproc_per_node is:

When set to "auto" with no GPU resources requested: use CPU resource limits to set the process count

When GPU resources are requested: keep "auto" value

The existing test case expects nproc_per_node to remain "auto", but since it only defines CPU resources (without GPU), the test is failing. Would you like me to change this test file? Maybe we can:

Update the test case to include GPU resources to maintain the "auto" value, or

Update the test's expected value to "1" to match the CPU-based calculation?

I'd like to use the same approach for trainjob_controller_test.go as we did for trainingruntime_test.go, but I'm encountering a merge conflict.

Signed-off-by: Diasker <kennardniko@foxmail.com>

tenzen-y · 2025-03-13T10:10:16Z

I'd like to use the same approach for trainjob_controller_test.go as we did for trainingruntime_test.go

@Diasker yes, you can take the same approach. Is there any blocker for git rebase?

Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com>

Diasker · 2025-03-13T10:30:13Z

@tenzen-y No blockers for git rebase. I was just unfamiliar with how to resolve this specific conflict, but I've figured it out now. Thanks!

andreyvelich · 2025-03-13T10:49:22Z

sdk/kubeflow/trainer/utils/utils.py

@@ -101,31 +101,6 @@ def get_resources_per_node(
    return resources


-# TODO (andreyvelich): Move this part to the Kubeflow Trainer torch plugins.
-# Ref issue: https://github.com/kubeflow/trainer/issues/2407
-def get_num_proc_per_node(resources_per_node: dict) -> str:


Since you removed this func, you also need to remove it from here:

trainer/sdk/kubeflow/trainer/api/trainer_client.py

Lines 225 to 230 in b3eabd3

if trainer and trainer.resources_per_node:

trainer_crd.num_proc_per_node = (

models.IoK8sApimachineryPkgUtilIntstrIntOrString(

utils.get_num_proc_per_node(trainer.resources_per_node)

)

)

@andreyvelich Alright, thanks!

Signed-off-by: Diasker <kennardniko@foxmail.com>

…nd remove redundant code of sdk Signed-off-by: Diasker <kennardniko@foxmail.com>

Signed-off-by: Diasker <kennardniko@foxmail.com>

tenzen-y

@Diasker Thank you for this great contribution!
I left just a nit comment. Could you open another PR to address it?

/lgtm
/approve

tenzen-y · 2025-03-13T19:28:25Z

pkg/runtime/framework/plugins/torch/torch.go

+			}
+			fallbackNumProcPerNode = intstr.FromString("auto")
+		case "cpu":
+			shouldUseCPU = func(resources corev1.ResourceList) bool {


Suggested change

shouldUseCPU = func(resources corev1.ResourceList) bool {

shouldUseCPU = func(corev1.ResourceList) bool {

nits: This is not used anywhere.

Sure, I'll open another PR to address it later. Thanks for your Approval!

google-oss-prow · 2025-03-13T19:29:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2025-03-14T02:15:08Z

pkg/runtime/framework/plugins/torch/torch.go

+					if strings.Contains(strings.ToLower(resName.String()), "gpu") {
+						return false


This might not work for other devices (e.g. tpu, npu).
@tenzen-y @Diasker @astefanutti Should we exclude all of them ?

I think it's challenging to cover all corner case since we can specify arbitrary name for special device in resources fields.

So, I think it would be better to add device names based on user feedbacks. I meant I want to find common sense for actual used special device names based on feedbacks.

Additionally, there are limitLange problem and etc.

I agree with @tenzen-y's approach. Trying to cover all possible device types proactively would be challenging since resource names can be arbitrary. It makes more sense to start with GPU support (which is the most common case) and then extend to other accelerators like TPU/NPU based on actual user feedback and real-world usage patterns. If users report issues with specific devices, we can add support for those devices in future iterations.

sounds good.

astefanutti · 2025-03-14T07:23:34Z

@Diasker thanks for the fantastic work!

Diasker · 2025-03-14T14:03:54Z

@Diasker thanks for the fantastic work!
@astefanutti My pleasure. Thanks for your guidance and patience.

…rch TrainJob (kubeflow#2492) * Fix kubeflow#2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob Signed-off-by: Diasker <kennardniko@foxmail.com> * Refactor the fix using resource.MustParse and RoundUp Signed-off-by: Diasker <kennardniko@foxmail.com> * remove unecessary log message Signed-off-by: Diasker <kennardniko@foxmail.com> * update test case to expect nproc_per_node=1 without GPU Signed-off-by: Diasker <kennardniko@foxmail.com> * Implement numProcPerNode=cpu option and refactor tests. Signed-off-by: Diasker <kennardniko@foxmail.com> * refactor: improve torch test cases Signed-off-by: Diasker <kennardniko@foxmail.com> * remove the redundant logic Signed-off-by: Diasker <kennardniko@foxmail.com> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> * Update pkg/runtime/framework/plugins/torch/torch_test.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> * simply code and remove redundant comments Signed-off-by: Diasker <kennardniko@foxmail.com> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> * Update pkg/runtime/framework/plugins/torch/torch_test.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> * remove get_num_proc_per_node for sdk and simply torch.go Signed-off-by: Diasker <kennardniko@foxmail.com> * Update expected nproc_per_node value in trainjob_controller_test.go Signed-off-by: Diasker <kennardniko@foxmail.com> * Update expected nproc_per_node value in trainjob_controller_test.go and remove redundant code of sdk Signed-off-by: Diasker <kennardniko@foxmail.com> * update trainjob_controller_test.go Signed-off-by: Diasker <kennardniko@foxmail.com> * update torch.go Signed-off-by: Diasker <kennardniko@foxmail.com> --------- Signed-off-by: Diasker <kennardniko@foxmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

Fix kubeflow#2407: Cap nproc_per_node based on CPU resources for PyTo…

76746c0

…rch TrainJob Signed-off-by: Diasker <kennardniko@foxmail.com>

google-oss-prow bot requested review from Electronic-Waste and jinchihe March 10, 2025 07:25

google-oss-prow bot added the size/L label Mar 10, 2025

astefanutti reviewed Mar 10, 2025

View reviewed changes

Refactor the fix using resource.MustParse and RoundUp

b1f237b

Signed-off-by: Diasker <kennardniko@foxmail.com>

Diasker force-pushed the fix-issue-2407 branch from 3ca53ad to b1f237b Compare March 10, 2025 09:39

Diasker requested a review from astefanutti March 10, 2025 12:53

astefanutti reviewed Mar 11, 2025

View reviewed changes

Diasker requested a review from astefanutti March 11, 2025 10:11

remove unecessary log message

c4000db

Signed-off-by: Diasker <kennardniko@foxmail.com>

Diasker requested a review from tenzen-y March 11, 2025 12:19

Diasker force-pushed the fix-issue-2407 branch from 8275731 to c4000db Compare March 11, 2025 13:00

update test case to expect nproc_per_node=1 without GPU

562f286

Signed-off-by: Diasker <kennardniko@foxmail.com>

google-oss-prow bot added the ok-to-test label Mar 11, 2025

andreyvelich reviewed Mar 11, 2025

View reviewed changes

Implement numProcPerNode=cpu option and refactor tests.

7b84f59

Signed-off-by: Diasker <kennardniko@foxmail.com>

google-oss-prow bot added size/XL and removed size/L labels Mar 12, 2025

Diasker requested a review from andreyvelich March 12, 2025 05:43

tenzen-y reviewed Mar 13, 2025

View reviewed changes

refactor: improve torch test cases

ee926de

Signed-off-by: Diasker <kennardniko@foxmail.com>

google-oss-prow bot added size/XXL and removed size/XL labels Mar 13, 2025

astefanutti reviewed Mar 13, 2025

View reviewed changes

tenzen-y mentioned this pull request Mar 7, 2025

Decouple UTs between Framework and Plugins packages #2468

Open

6 tasks

astefanutti reviewed Mar 13, 2025

View reviewed changes

Diasker requested review from tenzen-y and astefanutti March 13, 2025 08:00

remove get_num_proc_per_node for sdk and simply torch.go

7369be2

Signed-off-by: Diasker <kennardniko@foxmail.com>

Merge branch 'master' into fix-issue-2407

b3eabd3

Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com>

andreyvelich reviewed Mar 13, 2025

View reviewed changes

Diasker force-pushed the fix-issue-2407 branch from 1189cca to b3eabd3 Compare March 13, 2025 11:18

Diasker added 4 commits March 13, 2025 19:25

Update expected nproc_per_node value in trainjob_controller_test.go

36f8a33

Signed-off-by: Diasker <kennardniko@foxmail.com>

Update expected nproc_per_node value in trainjob_controller_test.go a…

5836591

…nd remove redundant code of sdk Signed-off-by: Diasker <kennardniko@foxmail.com>

update trainjob_controller_test.go

b2127ad

Signed-off-by: Diasker <kennardniko@foxmail.com>

update torch.go

e0e9375

Signed-off-by: Diasker <kennardniko@foxmail.com>

tenzen-y reviewed Mar 13, 2025

View reviewed changes

google-oss-prow bot assigned tenzen-y Mar 13, 2025

google-oss-prow bot added the lgtm label Mar 13, 2025

google-oss-prow bot added the approved label Mar 13, 2025

google-oss-prow bot merged commit 9f290d4 into kubeflow:master Mar 13, 2025
14 checks passed

andreyvelich reviewed Mar 14, 2025

View reviewed changes

Diasker deleted the fix-issue-2407 branch March 14, 2025 14:04

Diasker mentioned this pull request Mar 14, 2025

fix: remove unused parameter name in default case of shouldUseCPU function #2521

Merged

		@@ -72,6 +75,52 @@ func (t Torch) EnforceMLPolicy(info runtime.Info, trainJob *trainer.TrainJob)
		numProcPerNode = ptr.Deref(trainJob.Spec.Trainer.NumProcPerNode, intstr.FromString("auto"))
		}

	// Issue #2407 test case - nproc_per_node=auto with CPU limit
	// nproc_per_node=auto with CPU limit

	if trainer and trainer.resources_per_node:
	trainer_crd.num_proc_per_node = (
	models.IoK8sApimachineryPkgUtilIntstrIntOrString(
	utils.get_num_proc_per_node(trainer.resources_per_node)
	)
	)

	shouldUseCPU = func(resources corev1.ResourceList) bool {
	shouldUseCPU = func(corev1.ResourceList) bool {

		if strings.Contains(strings.ToLower(resName.String()), "gpu") {
		return false

Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob #2492

Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob #2492

Uh oh!

Conversation

Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob

Problem Description

Solution

Code Changes

Test Cases

Related Links

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment