8000 Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob by Diasker · Pull Request #2492 · kubeflow/trainer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob #2492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 13, 2025

Conversation

Diasker
Copy link
Contributor
@Diasker Diasker commented Mar 10, 2025

Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob

Problem Description

When nproc_per_node is set to "auto" in a PyTorch TrainJob, PyTorch's default behavior is to set it to the number of available CPUs on the node (via os.cpu_count()). This can cause problems when:

  1. The container has a CPU limit that is lower than the node's actual CPU count
  2. No GPU is requested, so PyTorch tries to use all available CPUs
  3. The training job exceeds its memory limit due to too many processes, leading to OOM errors or deadlocks

This issue was discussed in PR #2387 and formally documented in Issue #2407.

Solution

This PR modifies the Torch plugin to cap nproc_per_node based on CPU resources when it's set to "auto" and no GPU is requested:

  1. First, check if GPU resources are requested (by examining both Limits and Requests in ResourcesPerNode)
  2. If no GPU is requested, set nproc_per_node to the CPU limit (or request if no limit is set)
  3. Default to 1 if no CPU resources are specified
  4. Ensure at least 1 process, even if CPU limit is less than 1

This ensures that PyTorch distributed training only uses the CPU resources allocated to the container, rather than attempting to use all available CPUs on the node.

Code Changes

Modified the following files:

  1. pkg/runtime/framework/plugins/torch/torch.go: Added logic to cap nproc_per_node based on CPU resources
  2. pkg/runtime/framework/plugins/torch/torch_nproc_per_node_test.go: Added test cases to validate the fix

Test Cases

Added comprehensive test cases to verify the behavior:

  • When nproc_per_node=auto with a CPU limit, it should be capped to that limit
  • When nproc_per_node=auto with no CPU resources specified, it should default to 1
  • When nproc_per_node=auto with a low CPU limit, it should be capped to that limit even if the actual CPU count is higher
  • When nproc_per_node=auto with only CPU requests (no limits), it should use the request value
  • When nproc_per_node=auto with GPU resources, it should remain "auto"
  • When nproc_per_node is explicitly set, that value should be preserved

Related Links

…rch TrainJob

Signed-off-by: Diasker <kennardniko@foxmail.com>
cpuLimit := int32(1) // Default to 1 if no CPU limit is specified

// First check limits, then requests
if cpuQuantity, ok := trainJob.Spec.Trainer.ResourcesPerNode.Limits[corev1.ResourceCPU]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more robust to rely on the k8s/apimachinery/pkg/api/resource package and use https://pkg.go.dev/k8s.io/apimachinery/pkg/api/resource#MustParse to handle the millicore case.

}

// Ensure at least 1 process
if cpuLimit < 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the RoundUp function from the resource package can be used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your valuable suggestions! I've implemented the changes you recommended:

  1. Now using resource.MustParse to handle CPU resource values more robustly
  2. Implemented RoundUp method to properly handle millicore cases as you suggested

The updated code now correctly handles various CPU resource formats:

  • Integer values (e.g., "4")
  • Decimal values (e.g., "2.5")
  • Millicore values (e.g., "2500m")

I've also added comprehensive test cases to verify these scenarios, including specific tests for millicore and fractional CPU limits. All tests are passing, confirming that the implementation correctly rounds up CPU values when determining the number of processes.

This implementation ensures that when nproc_per_node is set to "auto" and no GPU is requested, we'll properly cap the number of processes based on the CPU resources, even when those resources are specified in millicore format.

Thanks again for helping improve the robustness of this code!

Signed-off-by: Diasker <kennardniko@foxmail.com>
cpuValue := cpuQuantity.Value()

numProcPerNode = intstr.FromInt(int(cpuValue))
fmt.Printf("CPU-only device detected with nproc_per_node=auto, capping to CPU limit: %d\n", cpuValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to use the logger from the context, e.g. log := ctrl.LoggerFrom(ctx) but the context should be added first to EnforceMLPolicy similarly to other plugin APIs like ComponentBuilderPlugin.Build or TerminalConditionPlugin.TerminalCondition.

Copy link
Contributor Author
@Diasker Diasker Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion! I noticed that using ctrl.LoggerFrom(ctx) would require a significant change to the EnforceMLPolicyPlugin interface, as it currently doesn't accept a context parameter.

This change would affect:

  1. The interface definition itself
  2. All plugins implementing this interface (Torch, MPI, PlainML)
  3. The framework code that calls these plugins
  4. All related test code

Considering that the scope of this change might exceed the goals of the current PR, I'd like to ask:

  1. How important is this log message? Could we temporarily keep fmt.Printf or use a logging library that doesn't require context (like klog)? Or even delete this log message?
  2. Or do you think we should proceed with this interface change in the current PR?

Please let me know your thoughts. Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I'd be inclined to remove this log message for now.

@andreyvelich @tenzen-y if you have any suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we can add logger without interface changes in the following:

type Torch struct{
	logger logr.Logger
}

func New(context.Context, client.Client, client.FieldIndexer) (framework.Plugin, error) {
	return &Torch{
		logger: ctrl.LoggerFrom(ctx).WithValues("pluginName", Name)
	}, nil
}

I was planning to add logger like this when I designed this interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main difference with both approaches is with the context from the signature it's the child context that holds a reference to the reconciled object, while if that the context from the plugin, it's the parent context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your excellent suggestion about adding a logger to the Torch plugin without changing the interface. I've implemented your approach as follows:

  1. Added a logger field to the Torch struct:
type Torch struct {
    logger logr.Logger
}
  1. Initialized the logger in the New function with proper null checks:
func New(ctx context.Context, client client.Client, indexer client.FieldIndexer) (framework.Plugin, error) {
    // Use logr.Discard() as the default value
    logger := logr.Discard()

    // Try to get logger from context
    if ctx != nil {
        ctxLogger := log.FromContext(ctx)
        // Check if logger is valid (by checking its sink)
        if ctxLogger.GetSink() != nil {
            logger = ctxLogger
        }
    }

    return &Torch{
        logger: logger.WithValues("pluginName", Name),
    }, nil
}
  1. Replaced the fmt.Printf with structured logging in the EnforceMLPolicy method:
t.logger.Info("CPU-only device detected with nproc_per_node=auto, capping to CPU limit", "cpuLimit", cpuValue)

All tests are now passing, and this approach avoids the need for interface changes while still providing structured logging. Thank you for sharing this elegant solution - it's exactly what we needed!

Copy link
Contributor Author
@Diasker Diasker Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main difference with both approaches is with the context from the signature it's the child context that holds a reference to the reconciled object, while if that the context from the plugin, it's the parent context.

@astefanutti Thank you for pointing out the difference between the two approaches. I initially implemented the solution using the plugin initialization context as suggested by @tenzen-y, without seeing your latest comment. Sorry about that.
I understand your point about the context differences - using the context from the method signature would provide more specific information about the reconciled object, while the plugin initialization context is more general. So, do we need to keep or remove the log message? Actually, I think the necessity of this log message is limited.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Diasker thanks I agree, let's remove the message to stick to the scope of this PR as you suggested, and re-consider this when the need for more important messages in plugins like this will arise.

Copy link
Contributor Author
@Diasker Diasker Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Thanks for your guidance. I just removed the log message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's remove logger for now since it is out of scope of this PR.

@Diasker Diasker requested a review from astefanutti March 11, 2025 10:11
Signed-off-by: Diasker <kennardniko@foxmail.com>
@Diasker Diasker requested a review from tenzen-y March 11, 2025 12:19
@Diasker
Copy link
Contributor Author
Diasker commented Mar 11, 2025

@astefanutti

I noticed a test failure in pkg/runtime/core/trainingruntime_test.go in the test case succeeded to build JobSet with Torch values from the Runtime and envs within TestTrainingRuntimeNewObjects.

The current implementation logic for nproc_per_node is:

  • When set to "auto" with no GPU resources requested: use CPU resource limits to set the process count
  • When GPU resources are requested: keep "auto" value

The existing test case expects nproc_per_node to remain "auto", but since it only defines CPU resources (without GPU), the test is failing. Would you like me to change this test file?
Maybe we can:

  1. Update the test case to include GPU resources to maintain the "auto" value, or
  2. Update the test's expected value to "1" to match the CPU-based calculation?

I've checked other test files and confirmed that this is the only affected test case. The dedicated test file torch_nproc_per_node_test.go already covers all the CPU resource handling scenarios correctly.

@astefanutti
Copy link
Contributor

@Diasker right I think that makes sense to update the test as you're suggesting.

Both option would work, but I'd be inclined to go with 2) change the expected value to "1" to match the CPU-based calculation.

Signed-off-by: Diasker <kennardniko@foxmail.com>
@Diasker
Copy link
Contributor Author
Diasker commented Mar 11, 2025

@astefanutti Alright, change finished, thanks!

@andreyvelich
Copy link
Member

/ok-to-test

Copy link
Member
@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @Diasker!

@@ -72,6 +75,52 @@ func (t *Torch) EnforceMLPolicy(info *runtime.Info, trainJob *trainer.TrainJob)
numProcPerNode = ptr.Deref(trainJob.Spec.Trainer.NumProcPerNode, intstr.FromString("auto"))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpuValue := cpuQuantity.Value()

numProcPerNode = intstr.FromInt(int(cpuValue))
fmt.Printf("CPU-only device detected with nproc_per_node=auto, capping to CPU limit: %d\n", cpuValue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's remove logger for now since it is out of scope of this PR.

@@ -72,6 +75,52 @@ func (t *Torch) EnforceMLPolicy(info *runtime.Info, trainJob *trainer.TrainJob)
numProcPerNode = ptr.Deref(trainJob.Spec.Trainer.NumProcPerNode, intstr.FromString("auto"))
}

// Cap nproc_per_node based on CPU resources when set to "auto" and no GPU is requested
if numProcPerNode.String() == "auto" && trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.ResourcesPerNode != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also validate that numProcPerNode == "cpu" ?

Comment on lines 81 to 98
hasGPU := false
for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Limits {
if strings.Contains(strings.ToLower(string(resourceName)), "gpu") {
hasGPU = true
break
}
}
if !hasGPU {
for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Requests {
if strings.Contains(strings.ToLower(string(resourceName)), "gpu") {
hasGPU = true
break
}
}
}

// If no GPU is requested, use CPU limit or default to 1
if !hasGPU {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @astefanutti Is there any helper function in k8s.io/utils we can use to check whether resource (e.g. cpu, gpu) exists in container.resources ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, there is not such util.

utiltesting "github.com/kubeflow/trainer/pkg/util/testing"
)

func TestTorch_EnforceMLPolicy_NumProcPerNode_Issue2407(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, these test cases should be part of TestTorch similar to what we've done for MPI here: #2481

Signed-off-by: Diasker <kennardniko@foxmail.com>
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Mar 12, 2025
@Diasker
Copy link
Contributor Author
Diasker commented Mar 12, 2025

@andreyvelich Thank you for your feedback! I've done these:

  1. Regarding the SDK comment: I've removed the comment.

  2. For CPU support in torch.go: I've implemented the logic to handle numProcPerNode="cpu" which will always use CPU resources regardless of GPU presence. This ensures users can explicitly choose to use CPU cores for distributed training even when GPUs are available.

  3. Following your suggestion and PR Implement MPI plugin UTs #2481, I've created a comprehensive torch_test.go file instead of a separate test file. The test cases cover various scenarios:

    • numProcPerNode=auto with CPU limits but no GPU
    • numProcPerNode=auto with GPU resources
    • numProcPerNode=cpu with only CPU resources
    • numProcPerNode=cpu with both CPU and GPU resources
    • Handling of fractional CPU values (rounding up)
    • Multi-node multi-GPU training configurations

Is the current CPU handling logic in torch.go robust enough? I've ensured it checks both limits and requests, handles fractional values by rounding up, and guarantees at least one process.

Also, regarding the SDK's get_num_proc_per_node function - should we update it to match the Go implementation, particularly to support the "cpu" option? The current SDK function prioritizes GPU resources and doesn't have special handling for the "cpu" value.

Regarding the helper function to check for resource existence: I couldn't find a specific utility function in k8s.io/utils for checking whether a resource exists in container.resources. If such a helper exists, I'd be happy to simplify the code by using it.

As a coding newcomer, I really appreciate your patience and guidance throughout this process. Your feedback has been invaluable in helping me improve the implementation.

@Diasker Diasker requested a review from andreyvelich March 12, 2025 05:43
@andreyvelich
Copy link
Member

Also, regarding the SDK's get_num_proc_per_node function - should we update it to match the Go implementation, particularly to support the "cpu" option? The current SDK function prioritizes GPU resources and doesn't have special handling for the "cpu" value.

The goal of our issue was to move the logic to assign numProcPerNode from SDK to the Controller, that is why we should completely remove the get_num_proc_per_node util function, and rely on Torch plugin to generate correct value for this parameter.

Is the current CPU handling logic in torch.go robust enough? I've ensured it checks both limits and requests, handles fractional values by rounding up, and guarantees at least one process.

Thanks for improving it! I will take a look soon.

Scheduler: &runtime.Scheduler{TotalRequests: map[string]runtime.TotalResourceRequest{}},
},
},
// Issue #2407 test case - nproc_per_node=auto with CPU limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Issue #2407 test case - nproc_per_node=auto with CPU limit
// nproc_per_node=auto with CPU limit

It would be better not to add such a specific prefix.
Could you address those in other places, as well?

Comment on lines 487 to 503
// If need to validate numProcPerNode
if tc.wantNumProcPerNode != "" && tc.info != nil {
// Find PET_NPROC_PER_NODE environment variable
var numProcPerNodeValue string
for _, env := range tc.info.Trainer.Env {
if env.Name != nil && *env.Name == constants.TorchEnvNumProcPerNode {
if env.Value != nil {
numProcPerNodeValue = *env.Value
}
break
}
}

if diff := cmp.Diff(tc.wantNumProcPerNode, numProcPerNodeValue); diff != "" {
t.Errorf("Torch.EnforceMLPolicy() numProcPerNode mismatch (-want +got):\n%s", diff)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always compare whole of expected and got runtime.Info.

Comment on lines 81 to 98
hasGPU := false
for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Limits {
if strings.Contains(strings.ToLower(string(resourceName)), "gpu") {
hasGPU = true
break
}
}
if !hasGPU {
for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Requests {
if strings.Contains(strings.ToLower(string(resourceName)), "gpu") {
hasGPU = true
break
}
}
}

// If no GPU is requested, use CPU limit or default to 1
if !hasGPU {
Copy link
Member
< C95D img src="https://avatars.githubusercontent.com/u/68272500?s=48&v=4" alt="@tenzen-y" size="24" height="24" width="24" data-view-component="true" class="avatar circle d-inline-block mr-2" /> tenzen-y Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, there is not such util.

Signed-off-by: Diasker <kennardniko@foxmail.com>
@@ -111,8 +111,6 @@ def get_resources_per_node(
return resources


# TODO (andreyvelich): Move this part to the Kubeflow Trainer torch plugins.
# Ref issue: https://github.com/kubeflow/trainer/issues/2407
def get_num_proc_per_node(resources_per_node: dict) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, i'll remove the function later.

@tenzen-y
Copy link
Member

@Diasker It looks like there are conflicts

@tenzen-y
Copy link
Member

Part of #2468

}
return fallbackNumProcPerNode, false
}
return intstr.FromInt32(int32(defaultCPU)), false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the conversion to int32 really needed?

Copy link
Contributor Author
@Diasker Diasker Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Yes, the conversion is necessary because FromInt32 requires an int32 parameter.
Or we can define defaultCPU as int32 from the beginning: var defaultCPU int32 = 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it already declared as var defaultCPU int32 = 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti OMG!!! You're absolutely right! It is already declared as var defaultCPU int32 = 1. I apologize for the confusion - I must have overlooked this when reviewing the code. Thank you for pointing this out.

@Diasker
Copy link
Contributor Author
Diasker commented Mar 13, 2025

@Diasker It looks like there are conflicts

@tenzen-y Yes, I see the conflicts. This is similar to the issue we discussed earlier with the trainingruntime_test.go file.

I noticed a test failure in pkg/runtime/core/trainingruntime_test.go in the test case succeeded to build JobSet with Torch values from the Runtime and envs within TestTrainingRuntimeNewObjects.

The current implementation logic for nproc_per_node is:

  • When set to "auto" with no GPU resources requested: use CPU resource limits to set the process count
  • When GPU resources are requested: keep "auto" value

The existing test case expects nproc_per_node to remain "auto", but since it only defines CPU resources (without GPU), the test is failing. Would you like me to change this test file? Maybe we can:

  1. Update the test case to include GPU resources to maintain the "auto" value, or
  2. Update the test's expected value to "1" to match the CPU-based calculation?

I'd like to use the same approach for trainjob_controller_test.go as we did for trainingruntime_test.go, but I'm encountering a merge conflict.

@Diasker Diasker requested review from tenzen-y and astefanutti March 13, 2025 08:00
Signed-off-by: Diasker <kennardniko@foxmail.com>
@tenzen-y
Copy link
Member
tenzen-y commented Mar 13, 2025

I'd like to use the same approach for trainjob_controller_test.go as we did for trainingruntime_test.go

@Diasker yes, you can take the same approach. Is there any blocker for git rebase?

Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com>
@Diasker
Copy link
Contributor Author
Diasker commented Mar 13, 2025

@tenzen-y No blockers for git rebase. I was just unfamiliar with how to resolve this specific conflict, but I've figured it out now. Thanks!

@@ -101,31 +101,6 @@ def get_resources_per_node(
return resources


# TODO (andreyvelich): Move this part to the Kubeflow Trainer torch plugins.
# Ref issue: https://github.com/kubeflow/trainer/issues/2407
def get_num_proc_per_node(resources_per_node: dict) -> str:
Copy link
Member
@andreyvelich andreyvelich Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you removed this func, you also need to remove it from here:

if trainer and trainer.resources_per_node:
trainer_crd.num_proc_per_node = (
models.IoK8sApimachineryPkgUtilIntstrIntOrString(
utils.get_num_proc_per_node(trainer.resources_per_node)
)
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Alright, thanks!

Diasker added 4 commits March 13, 2025 19:25
Signed-off-by: Diasker <kennardniko@foxmail.com>
…nd remove redundant code of sdk

Signed-off-by: Diasker <kennardniko@foxmail.com>
Signed-off-by: Diasker <kennardniko@foxmail.com>
Signed-off-by: Diasker <kennardniko@foxmail.com>
Copy link
Member
@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Diasker Thank you for this great contribution!
I left just a nit comment. Could you open another PR to address it?

/lgtm
/approve

}
fallbackNumProcPerNode = intstr.FromString("auto")
case "cpu":
shouldUseCPU = func(resources corev1.ResourceList) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shouldUseCPU = func(resources corev1.ResourceList) bool {
shouldUseCPU = func(corev1.ResourceList) bool {

nits: This is not used anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll open another PR to address it later. Thanks for your Approval!

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 9f290d4 into kubeflow:master Mar 13, 2025
14 checks passed
Comment on lines +86 to +87
if strings.Contains(strings.ToLower(resName.String()), "gpu") {
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not work for other devices (e.g. tpu, npu).
@tenzen-y @Diasker @astefanutti Should we exclude all of them ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's challenging to cover all corner case since we can specify arbitrary name for special device in resources fields.

So, I think it would be better to add device names based on user feedbacks. I meant I want to find common sense for actual used special device names based on feedbacks.

Additionally, there are limitLange problem and etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tenzen-y's approach. Trying to cover all possible device types proactively would be challenging since resource names can be arbitrary. It makes more sense to start with GPU support (which is the most common case) and then extend to other accelerators like TPU/NPU based on actual user feedback and real-world usage patterns. If users report issues with specific devices, we can add support for those devices in future iterations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

@astefanutti
Copy link
Contributor

@Diasker thanks for the fantastic work!

@Diasker
Copy link
Contributor Author
Diasker commented Mar 14, 2025

@Diasker thanks for the fantastic work!
@astefanutti My pleasure. Thanks for your guidance and patience.

@Diasker Diasker deleted the fix-issue-2407 branch March 14, 2025 14:04
mahdikhashan pushed a commit to mahdikhashan/trainer that referenced this pull request Mar 16, 2025
…rch TrainJob (kubeflow#2492)

* Fix kubeflow#2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob

Signed-off-by: Diasker <kennardniko@foxmail.com>

* Refactor the fix using resource.MustParse and RoundUp

Signed-off-by: Diasker <kennardniko@foxmail.com>

* remove unecessary log message

Signed-off-by: Diasker <kennardniko@foxmail.com>

* update test case to expect nproc_per_node=1 without GPU

Signed-off-by: Diasker <kennardniko@foxmail.com>

* Implement numProcPerNode=cpu option and refactor tests.

Signed-off-by: Diasker <kennardniko@foxmail.com>

* refactor: improve torch test cases

Signed-off-by: Diasker <kennardniko@foxmail.com>

* remove the redundant logic

Signed-off-by: Diasker <kennardniko@foxmail.com>

* Update pkg/runtime/framework/plugins/torch/torch.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com>

* Update pkg/runtime/framework/plugins/torch/torch_test.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com>

* simply code and remove redundant comments

Signed-off-by: Diasker <kennardniko@foxmail.com>

* Update pkg/runtime/framework/plugins/torch/torch.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com>

* Update pkg/runtime/framework/plugins/torch/torch_test.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com>

* Update pkg/runtime/framework/plugins/torch/torch.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com>

* remove get_num_proc_per_node for sdk and simply torch.go

Signed-off-by: Diasker <kennardniko@foxmail.com>

* Update expected nproc_per_node value in trainjob_controller_test.go

Signed-off-by: Diasker <kennardniko@foxmail.com>

* Update expected nproc_per_node value in trainjob_controller_test.go and remove redundant code of sdk

Signed-off-by: Diasker <kennardniko@foxmail.com>

* update trainjob_controller_test.go

Signed-off-by: Diasker <kennardniko@foxmail.com>

* update torch.go

Signed-off-by: Diasker <kennardniko@foxmail.com>

---------

Signed-off-by: Diasker <kennardniko@foxmail.com>
Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cap nproc_per_node based on the CPU resources of the node for PyTorch TrainJob
4 participants
0