Breaking Changes

@Electronic-Waste

This is the Kubeflow Trainer v2.0.0-rc.0 pre-release.

Breaking Changes

KEP-2170: Change API Group Name to trainer.kubeflow.org (#2413 by @Electronic-Waste)
Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)

New Features

LLM Trainer V2

KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
KEP-2401: Create torchtune trainer image (#2516 by @Electronic-Waste)
KEP-2401: Refactor current train() API (#2513 by @Electronic-Waste)
KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)

Runtime Framework

feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

[feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
Implement MPI plugin UTs (#2481 by @tenzen-y)
Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)

JobSet

Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
KEP-2170: Deploy JobSet in kubeflow-system namespace (#2388 by @andreyvelich)
Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)

New Examples

Add question-answer example for v2 trainer (#2580 by @solanyn)
KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)

SDK Updates

Remove SDK (#2657 by @eoinfennessy)
feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)

Bug Fixes

Revert "fix(sdk): Fix type annotation for train method's trainer parameter" (#2651 by @Electronic-Waste)
fix(sdk): Fix bad arg passed to get_args_using_torchtune_config (#2647 by @eoinfennessy)
fix(sdk): Fix type annotation for train method's trainer parameter (#2646 by @eoinfennessy)
fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
Fix MPI Test runnable errors (#2570 by @tenzen-y)
Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
fix(ci): update test-go coverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo)
fix(doc): Update train() API in KEP-2401 (#2536 by @Electronic-Waste)
fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
[hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
[hotfix] fix docker cred (#2530 by @mahdikhashan)
fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
fix type in model initializer entrypoint (#2489 by @szaher)
fix(runtime): fix error label name. (#2487 by @Electronic-Waste)
fix(sdk): resolve errors in deserialization (#2457 by @Electronic-Waste)
Fix missing external types in apply configurations (#2429 by @astefanutti)
Fix API Group for Torch Runtime (#2424 by @andreyvelich)
Fix Kustomize patchesStrategicMerge deprecation warning (#2405 by @astefanutti)
ControlPlane: Fix flaky integraion testings due to missing the latest version of object (#2414 by @tenzen-y)

Misc

Tag Docker images with GitHub release tags (#2662 by @kramaranya)
feat(controller): Implement PodSpecOverride API (#2614 by @andreyvelich)
Nominate @Electronic-Waste as approver and @astefanutti as reviewer (#2659 by @andreyvelich)
chore(build): Support Podman to run OpenAPI generator (#2656 by @astefanutti)
chore(docs): Add OpenSSF Best Practices Badge (#2611 by @andreyvelich)
[chore] update stale action version to latest (#2642 by @mahdikhashan)
Remove TrainJobCreated condition (#2621 by @astefanutti)
ci: refactor build-push-images workflow (#2607 by @milinddethe15)
Update Go to v1.24 (#2615) (#2620 by @vzamboulingame)
test(runtime): add UT for IndexTrainJobTrainingRuntime (#2603 by @Harshal292004)
ci: add k8s v1.32 for tests env ([#2613](#26...

@abhijeet-dhumal

This is the Training Operator v1.9.2 release.

New Features

Add provision to provide labels and annotations for the pytorchjob an… (#2612 by @abhijeet-dhumal)

Bug Fixes

Fix llm hp optimization error (#2576 by @helenxie-bit)
[bug] pull image from ghcr (#2584 by @mahdikhashan)

@saileshd1402

This is the Training Operator v1.9.1 release.

Breaking Changes

Update Manifest Images to GHCR (#2544 by @saileshd1402)
Push images to GHCR for release-1.9 (#2491 by @saileshd1402)

New Features

Add volume and volume mounts arguments to TrainingClient.create_job API (#2449 by @astefanutti)
Add configurable QPS and burst settings for kube API client (#2411 by @ronk21runai)

Bug Fixes

fix(ci): Change publish dir from training to trainer (#2546 by @Electronic-Waste)
fix: fix typos in script comments. (#2465 by @IRONICBo)
fix: adds jaxjobs to the kubeflow-training-roles.yaml ClusterRole (#2417 by @DnPlas)
[release-1.9] Rename paddlepaddle_defaults.go file name (#2400 by @ChristianZaccaria)

@astefanutti

This is the Training Operator v1.9.0 release.

This release introduces a new JAXJob, enabling seamless distributed training with JAX.

Additionally, it adds the managedBy API to streamline the orchestration of training Jobs in multi-cluster environment using MultiKueue.

Breaking Changes

Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
Update the name of PVC in train API (#2187 by @helenxie-bit)
Remove support for MXJob (#2150 by @tariq-hasan)
Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

Add JAX controller (#2194 by @sandipanpanda)
Add JAX API (#2163 by @sandipanpanda)
JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)
JAX example for MNIST SPMD and add CI testing (#2390 by @saileshd1402)

New Examples

FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
[Feature] Support managed by external controller (#2203 by @mszadkow)
Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

[SDK] Adding env vars (#2285 by @tarekabouzeid)
[SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
[SDK] move env var to constants.py (#2268 by @varshaprasad96)
[SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
[SDK] Read namespace from the current context (#2255 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
[SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Trainer V2

KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
Always update TrainJob status on errors (#2352 by @astefanutti)
Fix TrainJob status comparison and update (#2353 by @astefanutti)
Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
[v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

[release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
Pin accelerate package version in trainer (#2340 by @gavrissh)
[fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
[SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
[Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
fix volcano podgroup update issue (#2079 by @ckyuto)
[SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

[release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
Add e2e test for train API (#2199 by @helenxie-bit)
buildx link was broken ([#2356](https://github.com/kubeflow/training-operator/pul...

@astefanutti

This is the Training Operator v1.9.0-rc.0 pre-release.

Breaking Changes

Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
Update the name of PVC in train API (#2187 by @helenxie-bit)
Remove support for MXJob (#2150 by @tariq-hasan)
Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

Add JAX controller (#2194 by @sandipanpanda)
Add JAX API (#2163 by @sandipanpanda)
JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)

New Examples

FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
[Feature] Support managed by external controller (#2203 by @mszadkow)
Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

[SDK] Adding env vars (#2285 by @tarekabouzeid)
[SDK] Use to 10000 rchrun to create PyTorchJob from function (#2276 by @andreyvelich)
[SDK] move env var to constants.py (#2268 by @varshaprasad96)
[SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
[SDK] Read namespace from the current context (#2255 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
[SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Training V2

KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
Always update TrainJob status on errors (#2352 by @astefanutti)
Fix TrainJob status comparison and update (#2353 by @astefanutti)
Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
[v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

[release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
Pin accelerate package version in trainer (#2340 by @gavrissh)
[fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
[SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
[Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
fix volcano podgroup update issue (#2079 by @ckyuto)
[SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

[release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
Add e2e test for train API (#2199 by @helenxie-bit)
buildx link was broken (#2356 by @Veer0x1)
Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
Upgrade Go version to v1.23 (#2302 by @tenzen-y)
Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
Added test for create-pytorchjob.ipynb python notebook ([#2274](https://github.com/kubeflow/training-operator...

@mszadkow

This is the Training Operator v1.8.1 release.

Bug Fixes

[Bug] Finish CleanupJob early if the job is suspended (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)

New Contributors

@mszadkow made their first contribution in #2243
@helenxie-bit made their first contribution in #2180

@deepanker13

This is the Training Operator v1.8.0 release.

This release introduces a new Python API for LLMs Fine-Tuning that simplifies the ability to fine-tune foundational models using distributed PyTorch nodes.

Install the Kubeflow Training SDK as follows to try it:

pip install -U "kubeflow-training[huggingface]"

LLMs Fine-Tuning API

Train/Fine-tune API Proposal for LLMs (#1945 by @deepanker13)
[SDK] Train API for LLM Fine-Tuning (#1962 by @deepanker13)
Modify LLM Trainer to support BERT and Tiny LLaMA (#2031 by @andreyvelich)
Support arm64 for Hugging Face trainer (#2028 by @tariq-hasan)
Add Fine-Tune BERT LLM Example (#2021 by @andreyvelich)
Train api dataset download changes (#1959 by @deepanker13)
Train api init container creation (#1958 by @deepanker13)
[SDK] Add docstring for Train API (#2075 by @andreyvelich)

Breaking Changes

[SDK] Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
Support K8s v1.29 and Drop K8s v1.26 (#2039 by @tenzen-y)
Support K8s v1.28 and Drop K8s v1.25 (#2038 by @tenzen-y)
Deprecation Notice for MXJob (#2058 by @tenzen-y)
⚠️ Breaking Changes: Rename monitoring-port flag to webook-server-port (#1925 by @afritzler)

New Features

Control Plane Updates

Upgrade scheduler-plugins to v0.28.9 (#2065 by @tenzen-y)
Implement webhook validations for the PaddleJob (#2057 by @tenzen-y)
Implement webhook validations for the XGBoostJob (#2052 by @tenzen-y)
Implement webhook validation for the TFJob (#2051 by @tenzen-y)
Implement webhook validations for the PyTorchJob (#2035 by @tenzen-y)
Upgrade PyTorchJob examples to PyTorch v2 (#2024 by @champon1020)
Upgrade Go version to v1.22 (#2046 by @tenzen-y)

SDK Improvements

[SDK] Add resources per worker for Create Job API (#1990 by @andreyvelich)
[SDK] Fix Worker and Master templates for PyTorchJob (#1988 by @andreyvelich)
[SDK] Get Kubernetes Events for Job (#1975 by @andreyvelich)
SDK: Upgrade the minimum required Kubernetes version to v1.27.2 (#2066 by @tenzen-y)
[SDK] Add information about TrainingClient logging (#1973 by @andreyvelich)
Training operator SDK unit test (#1938 by @deepanker13)
[SDK] Consolidate Naming for CRUD APIs (#1907 by @andreyvelich)

Bug Fixes

[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2147 by @andreyvelich)
[SDK] Changed package name to flake8 to fix pip install (#2140 by @tenzen-y)
[SDK] Fix Incorrect Events in get_job_logs API (#2138 by @tenzen-y)
Fix volcano podgroup update issue (#2079 by @ckyuto)
Fix import for HuggingFace Dataset Provider (#2085 by @andreyvelich)
Updated examples for train API (#2077 by @shruti2522)
Fail job for non-retryable exit codes (#2071 by @kellyaa)
E2E: Replace outdated images with latest ones (#2083 by @tenzen-y)
fix wrong filepath in the simple example command (#2062 by @qzoscar)
fix(example): add installation of python-etcd in Pytorch example (#2064 by @champon1020)
fix: Upgrade controller-gen to v0.14.0 (#2026 by @champon1020)
Fix build workflow config for pytorch-torchrun-example (#2020 by @PeterWrighten)
Fix Distributed Data Samplers in PyTorch Examples (#2012 by @andreyvelich)
Fix URL in python SDK setup.py (#2011 by @garymm)
Fix for Github CI to publish HF trainer image (#1987 by @johnugeorge)
train api jupyternotebook fix (#1984 by @deepanker13)
fix: volcano podgroup should has a non-empty queue name (#1977 by @lowang-bh)
Fix Master Label for PyTorchJob (#1974 by @andreyvelich)
IsMasterRole fix in pytorchjob controller (#1969 by @deepanker13)
[fix] replace ${go env GOPATH} with $(go env GOPATH) (#1952 by @double12gzh)
Fixing issues with providing existing service account (#1918 by @rpemsel)

Misc

Refine the integration tests for the immutable PyTorchJob (#2130 by @tenzen-y)
Update training operator image to latest (#2089 by @johnugeorge)
Update sdk to v1.8.0rc0 (#2087 by @johnugeorge)
Test: Simplify and Identify pod-controller envtest (#2084 by @tenzen-y)
Remove deadcode related to PodDisruptionBudget (#2073 by @tenzen-y)
docs: updating docs for local development (#2074 by @franciscojavierarceo)
PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode (#2067 by @tenzen-y)
Updated developer docs to include Kind (#2061 by @franciscojavierarceo)
adding fine tune example with s3 as the dataset store (#2006 by @deepanker13)
CI: Use a mode=min in the builder cache (#2053 by @tenzen-y)
Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 (#2043 by @jdcfd)
Remove Dockerfile.ppc64le of pytorch example (#2042 by @champon1020)
publish torchrun example via Dockerfile (#2018 by @PeterWrighten)
Updated examples/pytorch to disable istio sidecar injection (#2004 by @jdcfd)
[docs] development guide update (#1995 by @shashank-iitbhu)
Add Kubeflow Website links to README (#1983 by @andreyvelich)
publish trainer hugging face image (#1985 by @deepanker13)
Adding Training image needed for train api (#1963 by @deepanker13)
Add test to create PyTorchJob from func (#1979 by @andreyvelich)
Corrected Some Spelling And Grammatical Errors (#1980 by @daniel-hutao)
torchrun example with cpu version pytorch (#1965 by @kuizhiqing)
utils changes needed to add train api (#1954 by @deepanker13)
Adding parallel support for coveralls (#1956 by @johnugeorge)
chore: pkg import only once (#1950 by @testwill)
fix nproc env in elas...

New features

Train/Fine-tune API Proposal for LLMs #1945 (deepanker13)
Adding Training image needed for train api #1963 (deepanker13)
[SDK] Train API #1962 (deepanker13)
Train api dataset download changes #1959 (deepanker13)
Train api init container creation #1958 (deepanker13)
Publish trainer hugging face image #1985 (deepanker13)
Support arm64 for Hugging Face trainer #2028 (tariq-hasan)
Modify LLM Trainer to support BERT and Tiny LLaMA #2031 (andreyvelich)
Implement webhook validations for the PyTorchJob #2035 (tenzen-y)
Implement webhook validations for the XGBoostJob #2052 (tenzen-y)
Implement webhook validation for the TFJob #2051 (tenzen-y)
Implement webhook warnings for the MXJob #2058 (tenzen-y)
Implement webhook validations for the PaddleJob #2057 (tenzen-y)
Fail job for non-retryable exit codes #2071 (kellyaa)
Adding fine tune example with s3 as the dataset store #2006 (deepanker13)

Bug fixes

fix nproc env in elastic mode for pytorchjob #1948 (kuizhiqing)
IsMasterRole fix in pytorchjob controller #1969 (deepanker13)
fix: volcano podgroup should has a non-empty queue name #1977 (lowang-bh)
Fix Master Label for PyTorchJob #1974 (andreyvelich)
[SDK] Fix Worker and Master templates for PyTorchJob #1988 (andreyvelich)
Fix import for HuggingFace Dataset Provider #2085 (andreyvelich)
Upgrade controller-gen to v0.14.0 #2026 (champon1020)
Fix Distributed Data Samplers in PyTorch Examples #2012 (andreyvelich)
Fix URL in python SDK setup.py #2011 (garymm)

Misc

Adding parallel support for coveralls #1956 (johnugeorge)
torchrun example with cpu version pytorch #1965 (kuizhiqing)
[SDK] Get Kubernetes Events for Job #1975 (andreyvelich)
Fix Master Label for PyTorchJob #1974 (andreyvelich)
[SDK] Add information about TrainingClient logging #1973 (andreyvelich)
PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode #2067 (tenzen-y)
SDK: Upgrade the minimum required Kubernetes version to v1.27.2 #2066 (tenzen-y)
Test: Simplify and Identify pod-controller envtest #2084 (tenzen-y)
E2E: Replace outdated images with latest ones #2083 (tenzen-y)
Upgrade scheduler-plugins to v0.28.9 #2065 (tenzen-y)

Breaking Changes

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)

New features

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Merge kubeflow/common to training-operator #1813 (johnugeorge)
Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
Implement suspend semantics #1859 (tenzen-y)
Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)

Bug fixes

Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

Removing reconciler code #1879 (johnugeorge)
Make Condition and ReplicaStatus optional #1862 (tenzen-y)
Use the same reasons for Condition and Event #1854 (tenzen-y)
Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
Clean up /pkg/common/util/v1 #1845 (tenzen-y)
Refactoring tests in common/controller.v1 #1843 (tenzen-y)
remove duplicate code of add task spec annotation #1839 (lowang-bh)
fetch volcano log when e2e failed #1837 (lowang-bh)
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
Replace dummy client with fake client #1818 (tenzen-y)
Add default Intel MPI env variables to MPIJob #1804 (tkatila)
Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
make timeout configurable from e2e tests #1787 (nagar-ajay)

Breaking Changes

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)

New features

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Merge kubeflow/common to training-operator #1813 (johnugeorge)
Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
Implement suspend semantics #1859 (tenzen-y)
Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)

Bug fixes

Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

Removing reconciler code #1879 (johnugeorge)
Make Condition and ReplicaStatus optional #1862 (tenzen-y)
Use the same reasons for Condition and Event #1854 (tenzen-y)
Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
Clean up /pkg/common/util/v1 #1845 (tenzen-y)
Refactoring tests in common/controller.v1 #1843 (tenzen-y)
remove duplicate code of add task spec annotation #1839 (lowang-bh)
fetch volcano log when e2e failed #1837 (lowang-bh)
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
Replace dummy client with fake client #1818 (tenzen-y)
Add default Intel MPI env variables to MPIJob #1804 (tkatila)
Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
make timeout configurable from e2e tests #1787 (nagar-ajay)

Releases: kubeflow/trainer

v2.0.0-rc.0

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Misc

Contributors

Uh oh!

v1.9.2

New Features

Bug Fixes

Contributors

Uh oh!

v1.9.1 release

Breaking Changes

New Features

Bug Fixes

Contributors

Uh oh!

v1.9.0 release

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Trainer V2

Bug Fixes

Misc

Contributors

Uh oh!

v1.9.0-rc.0 release

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Training V2

Bug Fixes

Misc

Contributors

Uh oh!

v1.8.1 release

Bug Fixes

New Contributors

Contributors

Uh oh!

v1.8.0 release

LLMs Fine-Tuning API

Breaking Changes

New Features

Control Plane Updates

SDK Improvements

Bug Fixes

Misc

Contributors

Uh oh!

v1.8.0-rc.0 release

Uh oh!

v1.7.0 release

Uh oh!

v1.7.0-rc.0 release

Uh oh!