fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files #2669

Electronic-Waste · 2025-06-14T13:50:36Z

What this PR does / why we need it:

This PR 8000 fixed some errors occurred in the LLM fine-tuning workflow, which is conducted by applying TrainJob yaml files:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: torchtune-llama3-2-1b
  namespace: kubeflow
spec:
  runtimeRef:
    name: torchtune-llama3.2-1b
  trainer:
    resourcesPerNode:
      requests:
        nvidia.com/gpu: 1
      limits:
        nvidia.com/gpu: 1
    numProcPerNode: 1
  initializer:
    model:
      env:
        - name: ACCESS_TOKEN
          value: <MY_HF_TOKEN>

Results

Model Initializer

kubectl logs torchtune-llama3-2-1b-model-initializer-0-0-hrcws -n kubeflow
2025-06-14T05:58:37Z INFO     [__main__.py:16] Starting pre-trained model initialization
2025-06-14T05:58:37Z INFO     [huggingface.py:26] Downloading model: meta-llama/Llama-3.2-1B-Instruct
2025-06-14T05:58:37Z INFO     [huggingface.py:27] ----------------------------------------
Fetching 8 files: 100%|██████████| 8/8 [00:43<00:00,  5.41s/it]
2025-06-14T05:59:21Z INFO     [huggingface.py:43] Model has been downloaded

Dataset Initializer

kubectl logs torchtune-llama3-2-1b-dataset-initializer-0-0-97k42 -n kubeflow
2025-06-14T05:58:18Z INFO     [__main__.py:16] Starting dataset initialization
2025-06-14T05:58:18Z INFO     [huggingface.py:28] Downloading dataset: tatsu-lab/alpaca
2025-06-14T05:58:18Z INFO     [huggingface.py:29] ----------------------------------------
Fetching 3 files: 100%|██████████| 3/3 [00:02<00:00,  1.47it/s]
2025-06-14T05:58:21Z INFO     [huggingface.py:40] Dataset has been downloaded

TorchTune Trainer

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /workspace/model
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: /workspace/output/model
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.instruct_dataset
  data_dir: /workspace/dataset/data
  packed: false
  source: parquet
device: cuda
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 8
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /workspace/output/model/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  fused: true
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /workspace/output/model
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: /workspace/output/model/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /workspace/model/original/tokenizer.model

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3412313428. Local seed is seed + rank = 3412313428 + 0
Writing logs to /workspace/output/model/logs/log_1749908355.txt
INFO:torchtune.utils._logging:Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
INFO:torchtune.utils._logging:Instantiating model and loading checkpoint took 19.08 secs
INFO:torchtune.utils._logging:Memory stats after model init:
        GPU peak memory allocation: 2.33 GiB
        GPU peak memory reserved: 2.34 GiB
        GPU peak memory active: 2.33 GiB
INFO:torchtune.utils._logging:Optimizer is initialized.
INFO:torchtune.utils._logging:Loss is initialized.
Generating train split: 52002 examples [00:00, 190018.20 examples/s]
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|99|Loss: 1.545789361000061:   6%|▌         | 100/1625 [01:16<19:08,  1.33it/%

/cc @kubeflow/wg-training-leads @astefanutti
/milestone v2.0

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

Signed-off-by: Electronic-Waste <2690692950@qq.com>

google-oss-prow · 2025-06-14T13:50:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from electronic-waste. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

8000

coveralls · 2025-06-14T13:55:46Z

Pull Request Test Coverage Report for Build 15682869133

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 29.19%

Totals
Change from base Build 15579727901:	0.0%
Covered Lines:	897
Relevant Lines:	3073

💛 - Coveralls

Electronic-Waste

/assign @kubeflow/wg-training-leads @astefanutti

Electronic-Waste · 2025-06-14T13:55:15Z

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

@@ -72,12 +72,14 @@ spec:
                      command:
                        - tune
                        - run
+                        - --rdzv_endpoint=localhost:29500


Made this change because distributed training needs to use distributed mode, as we discussed in #2587 (comment)

Don't we automatically inject rdzv endpoint in torch plugin ?

trainer/pkg/runtime/framework/plugins/torch/torch.go

Line 202 in 294b211

trainJob.Name, constants.Node, trainJob.Name, constants.ContainerTrainerPort,

Don't we automatically inject rdzv endpoint in torch plugin ?

Currently, I made mutation only enabled for applying TrainJob with SDK.

If users want to use TrainJob yaml file, they tend to expect that they have fully control of the overriding items.

Do we also want to enable it for applying with yaml file? @andreyvelich

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

Signed-off-by: Electronic-Waste <2690692950@qq.com>

andreyvelich

Thank you @Electronic-Waste!
/lgtm
/assign @tenzen-y @astefanutti

astefanutti · 2025-06-16T14:28:52Z

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

@@ -72,21 +72,21 @@ spec:
                      command:
                        - tune
                        - run
+                        - --rdzv_endpoint=localhost:29500


Should the logic in the torch plugin that always adds the --rdzv_endpoint for torchtune be amended to only add it if it's not present in the runtime nor in the overrides?

In my current implementation, there are two ways to apply TrainJob:

Using yaml file: Do not do the command/arg mutation. Users specify all commands and args (but use the cmd/args set in Runtime if users do not specify any cmd or args).

Using SDK: We do the command/arg mutation for users.

I think, for applying with SDK, we need to always mutate --rdzv_endpoint since users might enable multi-node training. In this case, we'll use the headless service of which url is generated with TrainJob name. And we can't know about the TrainJob name in the CTR before we creating a TrainJob instance.

But I think we should unify these two applying ways in the future, because:

Robustness: If users forget to add args like --rdzv_endpoint when using TrainJob yaml file to override some of the args, we can generate a proper arg for them to provide fault tolerance in some degree.

Modularized Implementation: It will make the code easy to understand and maintain.

WDYT @andreyvelich @astefanutti

Electronic-Waste · 2025-06-16T17:07:33Z

manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml

@@ -25,7 +25,7 @@ spec:
                        - name: STORAGE_URI
                          value: hf://tatsu-lab/alpaca
                      volumeMounts:
-                        - mountPath: /workspace/dataset
+                        - mountPath: /workspace


However, if we change the mount path to /workspace, it will conflict with the initializer:

trainer/cmd/initializers/dataset/Dockerfile

Lines 1 to 12 in b71a690

FROM python:3.11-alpine

WORKDIR /workspace

# Copy the required Python modules.

COPY cmd/initializers/dataset/requirements.txt .

COPY pkg/initializers pkg/initializers

# Install the needed packages.

RUN pip install -r requirements.txt

ENTRYPOINT ["python", "-m", "pkg.initializers.dataset"]

which will remove the /workspace/pkg dir and report error:

/usr/local/bin/python: Error while finding module specification for 'pkg.initializers.dataset' (ModuleNotFoundError: No module named 'pkg')

I'm thinking of changing the workdir to / and mount the PVC to /workspace to avoid the conflict.

WDYT? @andreyvelich @astefanutti

Electronic-Waste added 2 commits June 14, 2025 05:21

chore(mainfests): include torchtune runtimes.

ea2757d

Signed-off-by: Electronic-Waste <2690692950@qq.com>

fix(manifests): Update torchtune runtimes.:

294b211

Signed-off-by: Electronic-Waste <2690692950@qq.com>

google-oss-prow bot requested review from astefanutti and a team June 14, 2025 13:50

google-oss-prow bot added this to the v2.0 milestone Jun 14, 2025

google-oss-prow bot added the size/S label Jun 14, 2025

Electronic-Waste commented Jun 14, 2025

View reviewed changes

Electronic-Waste added 2 commits June 16, 2025 13:50

chore(manifests): Update mounting path in CTRs.

2263055

Signed-off-by: Electronic-Waste <2690692950@qq.com>

fix(manifests): Update output_dir.

cc2678f

Signed-off-by: Electronic-Waste <2690692950@qq.com>

andreyvelich reviewed Jun 16, 2025

View reviewed changes

google-oss-prow bot assigned astefanutti, tenzen-y and andreyvelich Jun 16, 2025

google-oss-prow bot added the lgtm label Jun 16, 2025

astefanutti reviewed Jun 16, 2025

View reviewed changes

Electronic-Waste commented Jun 16, 2025

View reviewed changes

Electronic-Waste mentioned this pull request Jun 17, 2025

fix(plugins): Fix some errors in torchtune mutation process. #2675

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files #2669

fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files #2669

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	FROM python:3.11-alpine

	WORKDIR /workspace

	# Copy the required Python modules.
	COPY cmd/initializers/dataset/requirements.txt .
	COPY pkg/initializers pkg/initializers

	# Install the needed packages.
	RUN pip install -r requirements.txt

	ENTRYPOINT ["python", "-m", "pkg.initializers.dataset"]

fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files #2669

Are you sure you want to change the base?

fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files #2669

Uh oh!

Conversation

Results

Uh oh!

Uh oh!

Uh oh!

Pull Request Test Coverage Report for Build 15682869133

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!