8000 fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files by Electronic-Waste · Pull Request #2669 · kubeflow/trainer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files #2669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Electronic-Waste
Copy link
Member

What this PR does / why we need it:

This PR 8000 fixed some errors occurred in the LLM fine-tuning workflow, which is conducted by applying TrainJob yaml files:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: torchtune-llama3-2-1b
  namespace: kubeflow
spec:
  runtimeRef:
    name: torchtune-llama3.2-1b
  trainer:
    resourcesPerNode:
      requests:
        nvidia.com/gpu: 1
      limits:
        nvidia.com/gpu: 1
    numProcPerNode: 1
  initializer:
    model:
      env:
        - name: ACCESS_TOKEN
          value: <MY_HF_TOKEN>

Results

  • Model Initializer
kubectl logs torchtune-llama3-2-1b-model-initializer-0-0-hrcws -n kubeflow
2025-06-14T05:58:37Z INFO     [__main__.py:16] Starting pre-trained model initialization
2025-06-14T05:58:37Z INFO     [huggingface.py:26] Downloading model: meta-llama/Llama-3.2-1B-Instruct
2025-06-14T05:58:37Z INFO     [huggingface.py:27] ----------------------------------------
Fetching 8 files: 100%|██████████| 8/8 [00:43<00:00,  5.41s/it]
2025-06-14T05:59:21Z INFO     [huggingface.py:43] Model has been downloaded
  • Dataset Initializer
kubectl logs torchtune-llama3-2-1b-dataset-initializer-0-0-97k42 -n kubeflow
2025-06-14T05:58:18Z INFO     [__main__.py:16] Starting dataset initialization
2025-06-14T05:58:18Z INFO     [huggingface.py:28] Downloading dataset: tatsu-lab/alpaca
2025-06-14T05:58:18Z INFO     [huggingface.py:29] ----------------------------------------
Fetching 3 files: 100%|██████████| 3/3 [00:02<00:00,  1.47it/s]
2025-06-14T05:58:21Z INFO     [huggingface.py:40] Dataset has been downloaded
  • TorchTune Trainer
INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 4
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /workspace/model
  checkpoint_files:
  - model.safetensors
  model_type: LLAMA3_2
  output_dir: /workspace/output/model
  recipe_checkpoint: null
clip_grad_norm: null
compile: false
dataset:
  _component_: torchtune.datasets.instruct_dataset
  data_dir: /workspace/dataset/data
  packed: false
  source: parquet
device: cuda
dtype: bf16
enable_activation_checkpointing: false
enable_activation_offloading: false
epochs: 1
gradient_accumulation_steps: 8
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /workspace/output/model/logs
model:
  _component_: torchtune.models.llama3_2.llama3_2_1b
optimizer:
  _component_: torch.optim.AdamW
  fused: true
  lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /workspace/output/model
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: /workspace/output/model/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 3
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /workspace/model/original/tokenizer.model

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3412313428. Local seed is seed + rank = 3412313428 + 0
Writing logs to /workspace/output/model/logs/log_1749908355.txt
INFO:torchtune.utils._logging:Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
INFO:torchtune.utils._logging:Instantiating model and loading checkpoint took 19.08 secs
INFO:torchtune.utils._logging:Memory stats after model init:
        GPU peak memory allocation: 2.33 GiB
        GPU peak memory reserved: 2.34 GiB
        GPU peak memory active: 2.33 GiB
INFO:torchtune.utils._logging:Optimizer is initialized.
INFO:torchtune.utils._logging:Loss is initialized.
Generating train split: 52002 examples [00:00, 190018.20 examples/s]
INFO:torchtune.utils._logging:No learning rate scheduler configured. Using constant learning rate.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
1|99|Loss: 1.545789361000061:   6%|▌         | 100/1625 [01:16<19:08,  1.33it/%

/cc @kubeflow/wg-training-leads @astefanutti
/milestone v2.0

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@google-oss-prow google-oss-prow bot requested review from astefanutti and a team June 14, 2025 13:50
@google-oss-prow google-oss-prow bot added this to the v2.0 milestone Jun 14, 2025
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from electronic-waste. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

8000

@coveralls
Copy link
coveralls commented Jun 14, 2025

Pull Request Test Coverage Report for Build 15682869133

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 29.19%

Totals Coverage Status
Change from base Build 15579727901: 0.0%
Covered Lines: 897
Relevant Lines: 3073

💛 - Coveralls

Copy link
Member Author
@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @kubeflow/wg-training-leads @astefanutti

@@ -72,12 +72,14 @@ spec:
command:
- tune
- run
- --rdzv_endpoint=localhost:29500
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change because distributed training needs to use distributed mode, as we discussed in #2587 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we automatically inject rdzv endpoint in torch plugin ?

trainJob.Name, constants.Node, trainJob.Name, constants.ContainerTrainerPort,

Copy link
Member Author
@Electronic-Waste Electronic-Waste Jun 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we automatically inject rdzv endpoint in torch plugin ?

Currently, I made mutation only enabled for applying TrainJob with SDK.

If users want to use TrainJob yaml file, they tend to expect that they have fully control of the overriding items.

Do we also want to enable it for applying with yaml file? @andreyvelich

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Copy link
Member
@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Electronic-Waste!
/lgtm
/assign @tenzen-y @astefanutti

@@ -72,21 +72,21 @@ spec:
command:
- tune
- run
- --rdzv_endpoint=localhost:29500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the logic in the torch plugin that always adds the --rdzv_endpoint for torchtune be amended to only add it if it's not present in the runtime nor in the overrides?

Copy link
Member Author
@Electronic-Waste Electronic-Waste Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my current implementation, there are two ways to apply TrainJob:

  1. Using yaml file: Do not do the command/arg mutation. Users specify all commands and args (but use the cmd/args set in Runtime if users do not specify any cmd or args).
  2. Using SDK: We do the command/arg mutation for users.

I think, for applying with SDK, we need to always mutate --rdzv_endpoint since users might enable multi-node training. In this case, we'll use the headless service of which url is generated with TrainJob name. And we can't know about the TrainJob name in the CTR before we creating a TrainJob instance.

Copy link
Member Author
@Electronic-Waste Electronic-Waste Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think we should unify these two applying ways in the future, because:

  1. Robustness: If users forget to add args like --rdzv_endpoint when using TrainJob yaml file to override some of the args, we can generate a proper arg for them to provide fault tolerance in some degree.
  2. Modularized Implementation: It will make the code easy to understand and maintain.

WDYT @andreyvelich @astefanutti

@@ -25,7 +25,7 @@ spec:
- name: STORAGE_URI
value: hf://tatsu-lab/alpaca
volumeMounts:
- mountPath: /workspace/dataset
- mountPath: /workspace
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if we change the mount path to /workspace, it will conflict with the initializer:

FROM python:3.11-alpine
WORKDIR /workspace
# Copy the required Python modules.
COPY cmd/initializers/dataset/requirements.txt .
COPY pkg/initializers pkg/initializers
# Install the needed packages.
RUN pip install -r requirements.txt
ENTRYPOINT ["python", "-m", "pkg.initializers.dataset"]

which will remove the /workspace/pkg dir and report error:

/usr/local/bin/python: Error while finding module specification for 'pkg.initializers.dataset' (ModuleNotFoundError: No module named 'pkg')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of changing the workdir to / and mount the PVC to /workspace to avoid the conflict.

WDYT? @andreyvelich @astefanutti

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0