-
Notifications
You must be signed in to change notification settings - Fork 786
Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob #2492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rch TrainJob Signed-off-by: Diasker <kennardniko@foxmail.com>
cpuLimit := int32(1) // Default to 1 if no CPU limit is specified | ||
|
||
// First check limits, then requests | ||
if cpuQuantity, ok := trainJob.Spec.Trainer.ResourcesPerNode.Limits[corev1.ResourceCPU]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be more robust to rely on the k8s/apimachinery/pkg/api/resource package and use https://pkg.go.dev/k8s.io/apimachinery/pkg/api/resource#MustParse to handle the millicore case.
} | ||
|
||
// Ensure at least 1 process | ||
if cpuLimit < 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the RoundUp
function from the resource package can be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your valuable suggestions! I've implemented the changes you recommended:
- Now using
resource.MustParse
to handle CPU resource values more robustly - Implemented
RoundUp
method to properly handle millicore cases as you suggested
The updated code now correctly handles various CPU resource formats:
- Integer values (e.g., "4")
- Decimal values (e.g., "2.5")
- Millicore values (e.g., "2500m")
I've also added comprehensive test cases to verify these scenarios, including specific tests for millicore and fractional CPU limits. All tests are passing, confirming that the implementation correctly rounds up CPU values when determining the number of processes.
This implementation ensures that when nproc_per_node
is set to "auto" and no GPU is requested, we'll properly cap the number of processes based on the CPU resources, even when those resources are specified in millicore format.
Thanks again for helping improve the robustness of this code!
Signed-off-by: Diasker <kennardniko@foxmail.com>
cpuValue := cpuQuantity.Value() | ||
|
||
numProcPerNode = intstr.FromInt(int(cpuValue)) | ||
fmt.Printf("CPU-only device detected with nproc_per_node=auto, capping to CPU limit: %d\n", cpuValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to use the logger from the context, e.g. log := ctrl.LoggerFrom(ctx)
but the context should be added first to EnforceMLPolicy
similarly to other plugin APIs like ComponentBuilderPlugin.Build
or TerminalConditionPlugin.TerminalCondition
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion! I noticed that using ctrl.LoggerFrom(ctx)
would require a significant change to the EnforceMLPolicyPlugin
interface, as it currently doesn't accept a context parameter.
This change would affect:
- The interface definition itself
- All plugins implementing this interface (Torch, MPI, PlainML)
- The framework code that calls these plugins
- All related test code
Considering that the scope of this change might exceed the goals of the current PR, I'd like to ask:
- How important is this log message? Could we temporarily keep fmt.Printf or use a logging library that doesn't require context (like klog)? Or even delete this log message?
- Or do you think we should proceed with this interface change in the current PR?
Please let me know your thoughts. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I'd be inclined to remove this log message for now.
@andreyvelich @tenzen-y if you have any suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we can add logger without interface changes in the following:
type Torch struct{
logger logr.Logger
}
func New(context.Context, client.Client, client.FieldIndexer) (framework.Plugin, error) {
return &Torch{
logger: ctrl.LoggerFrom(ctx).WithValues("pluginName", Name)
}, nil
}
I was planning to add logger like this when I designed this interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main difference with both approaches is with the context from the signature it's the child context that holds a reference to the reconciled object, while if that the context from the plugin, it's the parent context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your excellent suggestion about adding a logger to the Torch plugin without changing the interface. I've implemented your approach as follows:
- Added a
logger
field to theTorch
struct:
type Torch struct {
logger logr.Logger
}
- Initialized the logger in the
New
function with proper null checks:
func New(ctx context.Context, client client.Client, indexer client.FieldIndexer) (framework.Plugin, error) {
// Use logr.Discard() as the default value
logger := logr.Discard()
// Try to get logger from context
if ctx != nil {
ctxLogger := log.FromContext(ctx)
// Check if logger is valid (by checking its sink)
if ctxLogger.GetSink() != nil {
logger = ctxLogger
}
}
return &Torch{
logger: logger.WithValues("pluginName", Name),
}, nil
}
- Replaced the
fmt.Printf
with structured logging in theEnforceMLPolicy
method:
t.logger.Info("CPU-only device detected with nproc_per_node=auto, capping to CPU limit", "cpuLimit", cpuValue)
All tests are now passing, and this approach avoids the need for interface changes while still providing structured logging. Thank you for sharing this elegant solution - it's exactly what we needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main difference with both approaches is with the context from the signature it's the child context that holds a reference to the reconciled object, while if that the context from the plugin, it's the parent context.
@astefanutti Thank you for pointing out the difference between the two approaches. I initially implemented the solution using the plugin initialization context as suggested by @tenzen-y, without seeing your latest comment. Sorry about that.
I understand your point about the context differences - using the context from the method signature would provide more specific information about the reconciled object, while the plugin initialization context is more general. So, do we need to keep or remove the log message? Actually, I think the necessity of this log message is limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Diasker thanks I agree, let's remove the message to stick to the scope of this PR as you suggested, and re-consider this when the need for more important messages in plugins like this will arise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti Thanks for your guidance. I just removed the log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let's remove logger for now since it is out of scope of this PR.
Signed-off-by: Diasker <kennardniko@foxmail.com>
I noticed a test failure in The current implementation logic for
The existing test case expects
I've checked other test files and confirmed that this is the only affected test case. The dedicated test file |
@Diasker right I think that makes sense to update the test as you're suggesting. Both option would work, but I'd be inclined to go with 2) change the expected value to "1" to match the CPU-based calculation. |
Signed-off-by: Diasker <kennardniko@foxmail.com>
@astefanutti Alright, change finished, thanks! |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @Diasker!
@@ -72,6 +75,52 @@ func (t *Torch) EnforceMLPolicy(info *runtime.Info, trainJob *trainer.TrainJob) | |||
numProcPerNode = ptr.Deref(trainJob.Spec.Trainer.NumProcPerNode, intstr.FromString("auto")) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to remove it from the SDK: https://github.com/kubeflow/trainer/blob/master/sdk/kubeflow/trainer/utils/utils.py#L119
cpuValue := cpuQuantity.Value() | ||
|
||
numProcPerNode = intstr.FromInt(int(cpuValue)) | ||
fmt.Printf("CPU-only device detected with nproc_per_node=auto, capping to CPU limit: %d\n", cpuValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let's remove logger for now since it is out of scope of this PR.
@@ -72,6 +75,52 @@ func (t *Torch) EnforceMLPolicy(info *runtime.Info, trainJob *trainer.TrainJob) | |||
numProcPerNode = ptr.Deref(trainJob.Spec.Trainer.NumProcPerNode, intstr.FromString("auto")) | |||
} | |||
|
|||
// Cap nproc_per_node based on CPU resources when set to "auto" and no GPU is requested | |||
if numProcPerNode.String() == "auto" && trainJob.Spec.Trainer != nil && trainJob.Spec.Trainer.ResourcesPerNode != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also validate that numProcPerNode == "cpu" ?
hasGPU := false | ||
for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Limits { | ||
if strings.Contains(strings.ToLower(string(resourceName)), "gpu") { | ||
hasGPU = true | ||
break | ||
} | ||
} | ||
if !hasGPU { | ||
for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Requests { | ||
if strings.Contains(strings.ToLower(string(resourceName)), "gpu") { | ||
hasGPU = true | ||
break | ||
} | ||
} | ||
} | ||
|
||
// If no GPU is requested, use CPU limit or default to 1 | ||
if !hasGPU { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @astefanutti Is there any helper function in k8s.io/utils
we can use to check whether resource (e.g. cpu
, gpu
) exists in container.resources ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, there is not such util.
utiltesting "github.com/kubeflow/trainer/pkg/util/testing" | ||
) | ||
|
||
func TestTorch_EnforceMLPolicy_NumProcPerNode_Issue2407(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, these test cases should be part of TestTorch similar to what we've done for MPI here: #2481
Signed-off-by: Diasker <kennardniko@foxmail.com>
@andreyvelich Thank you for your feedback! I've done these:
Is the current CPU handling logic in torch.go robust enough? I've ensured it checks both limits and requests, handles fractional values by rounding up, and guarantees at least one process. Also, regarding the SDK's Regarding the helper function to check for resource existence: I couldn't find a specific utility function in k8s.io/utils for checking whether a resource exists in container.resources. If such a helper exists, I'd be happy to simplify the code by using it. As a coding newcomer, I really appreciate your patience and guidance throughout this process. Your feedback has been invaluable in helping me improve the implementation. |
The goal of our issue was to move the logic to assign numProcPerNode from SDK to the Controller, that is why we should completely remove the
Thanks for improving it! I will take a look soon. |
Scheduler: &runtime.Scheduler{TotalRequests: map[string]runtime.TotalResourceRequest{}}, | ||
}, | ||
}, | ||
// Issue #2407 test case - nproc_per_node=auto with CPU limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Issue #2407 test case - nproc_per_node=auto with CPU limit | |
// nproc_per_node=auto with CPU limit |
It would be better not to add such a specific prefix.
Could you address those in other places, as well?
// If need to validate numProcPerNode | ||
if tc.wantNumProcPerNode != "" && tc.info != nil { | ||
// Find PET_NPROC_PER_NODE environment variable | ||
var numProcPerNodeValue string | ||
for _, env := range tc.info.Trainer.Env { | ||
if env.Name != nil && *env.Name == constants.TorchEnvNumProcPerNode { | ||
if env.Value != nil { | ||
numProcPerNodeValue = *env.Value | ||
} | ||
break | ||
} | ||
} | ||
|
||
if diff := cmp.Diff(tc.wantNumProcPerNode, numProcPerNodeValue); diff != "" { | ||
t.Errorf("Torch.EnforceMLPolicy() numProcPerNode mismatch (-want +got):\n%s", diff) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should always compare whole of expected and got runtime.Info.
hasGPU := false | ||
for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Limits { | ||
if strings.Contains(strings.ToLower(string(resourceName)), "gpu") { | ||
hasGPU = true | ||
break | ||
} | ||
} | ||
if !hasGPU { | ||
for resourceName := range trainJob.Spec.Trainer.ResourcesPerNode.Requests { | ||
if strings.Contains(strings.ToLower(string(resourceName)), "gpu") { | ||
hasGPU = true | ||
break | ||
} | ||
} | ||
} | ||
|
||
// If no GPU is requested, use CPU limit or default to 1 | ||
if !hasGPU { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, there is not such util.
Signed-off-by: Diasker <kennardniko@foxmail.com>
sdk/kubeflow/trainer/utils/utils.py
Outdated
@@ -111,8 +111,6 @@ def get_resources_per_node( | |||
return resources | |||
|
|||
|
|||
# TODO (andreyvelich): Move this part to the Kubeflow Trainer torch plugins. | |||
# Ref issue: https://github.com/kubeflow/trainer/issues/2407 | |||
def get_num_proc_per_node(resources_per_node: dict) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be removed altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, i'll remove the function later.
@Diasker It looks like there are conflicts |
Part of #2468 |
} | ||
return fallbackNumProcPerNode, false | ||
} | ||
return intstr.FromInt32(int32(defaultCPU)), false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the conversion to int32 really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti Yes, the conversion is necessary because FromInt32
requires an int32
parameter.
Or we can define defaultCPU
as int32 from the beginning: var defaultCPU int32 = 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it already declared as var defaultCPU int32 = 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti OMG!!! You're absolutely right! It is already declared as var defaultCPU int32 = 1
. I apologize for the confusion - I must have overlooked this when reviewing the code. Thank you for pointing this out.
@tenzen-y Yes, I see the conflicts. This is similar to the issue we discussed earlier with the trainingruntime_test.go file.
I'd like to use the same approach for trainjob_controller_test.go as we did for trainingruntime_test.go, but I'm encountering a merge conflict. |
Signed-off-by: Diasker <kennardniko@foxmail.com>
@Diasker yes, you can take the same approach. Is there any blocker for git rebase? |
Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com>
@tenzen-y No blockers for git rebase. I was just unfamiliar with how to resolve this specific conflict, but I've figured it out now. Thanks! |
@@ -101,31 +101,6 @@ def get_resources_per_node( | |||
return resources | |||
|
|||
|
|||
# TODO (andreyvelich): Move this part to the Kubeflow Trainer torch plugins. | |||
# Ref issue: https://github.com/kubeflow/trainer/issues/2407 | |||
def get_num_proc_per_node(resources_per_node: dict) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you removed this func, you also need to remove it from here:
trainer/sdk/kubeflow/trainer/api/trainer_client.py
Lines 225 to 230 in b3eabd3
if trainer and trainer.resources_per_node: | |
trainer_crd.num_proc_per_node = ( | |
models.IoK8sApimachineryPkgUtilIntstrIntOrString( | |
utils.get_num_proc_per_node(trainer.resources_per_node) | |
) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Alright, thanks!
Signed-off-by: Diasker <kennardniko@foxmail.com>
…nd remove redundant code of sdk Signed-off-by: Diasker <kennardniko@foxmail.com>
Signed-off-by: Diasker <kennardniko@foxmail.com>
Signed-off-by: Diasker <kennardniko@foxmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Diasker Thank you for this great contribution!
I left just a nit comment. Could you open another PR to address it?
/lgtm
/approve
} | ||
fallbackNumProcPerNode = intstr.FromString("auto") | ||
case "cpu": | ||
shouldUseCPU = func(resources corev1.ResourceList) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldUseCPU = func(resources corev1.ResourceList) bool { | |
shouldUseCPU = func(corev1.ResourceList) bool { |
nits: This is not used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll open another PR to address it later. Thanks for your Approval!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
if strings.Contains(strings.ToLower(resName.String()), "gpu") { | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not work for other devices (e.g. tpu
, npu
).
@tenzen-y @Diasker @astefanutti Should we exclude all of them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's challenging to cover all corner case since we can specify arbitrary name for special device in resources fields.
So, I think it would be better to add device names based on user feedbacks. I meant I want to find common sense for actual used special device names based on feedbacks.
Additionally, there are limitLange problem and etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tenzen-y's approach. Trying to cover all possible device types proactively would be challenging since resource names can be arbitrary. It makes more sense to start with GPU support (which is the most common case) and then extend to other accelerators like TPU/NPU based on actual user feedback and real-world usage patterns. If users report issues with specific devices, we can add support for those devices in future iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
@Diasker thanks for the fantastic work! |
|
…rch TrainJob (kubeflow#2492) * Fix kubeflow#2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob Signed-off-by: Diasker <kennardniko@foxmail.com> * Refactor the fix using resource.MustParse and RoundUp Signed-off-by: Diasker <kennardniko@foxmail.com> * remove unecessary log message Signed-off-by: Diasker <kennardniko@foxmail.com> * update test case to expect nproc_per_node=1 without GPU Signed-off-by: Diasker <kennardniko@foxmail.com> * Implement numProcPerNode=cpu option and refactor tests. Signed-off-by: Diasker <kennardniko@foxmail.com> * refactor: improve torch test cases Signed-off-by: Diasker <kennardniko@foxmail.com> * remove the redundant logic Signed-off-by: Diasker <kennardniko@foxmail.com> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> * Update pkg/runtime/framework/plugins/torch/torch_test.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> * simply code and remove redundant comments Signed-off-by: Diasker <kennardniko@foxmail.com> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> * Update pkg/runtime/framework/plugins/torch/torch_test.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> * remove get_num_proc_per_node for sdk and simply torch.go Signed-off-by: Diasker <kennardniko@foxmail.com> * Update expected nproc_per_node value in trainjob_controller_test.go Signed-off-by: Diasker <kennardniko@foxmail.com> * Update expected nproc_per_node value in trainjob_controller_test.go and remove redundant code of sdk Signed-off-by: Diasker <kennardniko@foxmail.com> * update trainjob_controller_test.go Signed-off-by: Diasker <kennardniko@foxmail.com> * update torch.go Signed-off-by: Diasker <kennardniko@foxmail.com> --------- Signed-off-by: Diasker <kennardniko@foxmail.com> Signed-off-by: Diasker <83821036+Diasker@users.noreply.github.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob
Problem Description
When
nproc_per_node
is set to "auto" in a PyTorch TrainJob, PyTorch's default behavior is to set it to the number of available CPUs on the node (viaos.cpu_count()
). This can cause problems when:This issue was discussed in PR #2387 and formally documented in Issue #2407.
Solution
This PR modifies the Torch plugin to cap
nproc_per_node
based on CPU resources when it's set to "auto" and no GPU is requested:nproc_per_node
to the CPU limit (or request if no limit is set)This ensures that PyTorch distributed training only uses the CPU resources allocated to the container, rather than attempting to use all available CPUs on the node.
Code Changes
Modified the following files:
pkg/runtime/framework/plugins/torch/torch.go
: Added logic to capnproc_per_node
based on CPU resourcespkg/runtime/framework/plugins/torch/torch_nproc_per_node_test.go
: Added test cases to validate the fixTest Cases
Added comprehensive test cases to verify the behavior:
nproc_per_node=auto
with a CPU limit, it should be capped to that limitnproc_per_node=auto
with no CPU resources specified, it should default to 1nproc_per_node=auto
with a low CPU limit, it should be capped to that limit even if the actual CPU count is highernproc_per_node=auto
with only CPU requests (no limits), it should use the request valuenproc_per_node=auto
with GPU resources, it should remain "auto"nproc_per_node
is explicitly set, that value should be preservedRelated Links
nproc_per_node
based on the CPU resources of the node for PyTorch TrainJob #2407: Cap nproc_per_node based on the CPU resources of the node for PyTorch TrainJob