Releases: kubeflow/trainer
v2.0.0-rc.0
This is the Kubeflow Trainer v2.0.0-rc.0 pre-release.
Breaking Changes
- KEP-2170: Change API Group Name to
trainer.kubeflow.org
(#2413 by @Electronic-Waste) - Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
- Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
- Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
- Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)
New Features
LLM Trainer V2
- KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
- KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
- KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
- KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
- KEP-2401: Create
torchtune
trainer image (#2516 by @Electronic-Waste) - KEP-2401: Refactor current
train()
API (#2513 by @Electronic-Waste) - KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)
Runtime Framework
- feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
- feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
- feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
- Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
- Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
- KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
- Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
- Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)
MPI Plugin
- [feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
- Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
- Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
- Implement MPI plugin UTs (#2481 by @tenzen-y)
- Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
- Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
- Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
- Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
- KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)
JobSet
- Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
- KEP-2170: Deploy JobSet in
kubeflow-system
namespace (#2388 by @andreyvelich) - Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
- Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)
New Examples
- Add question-answer example for v2 trainer (#2580 by @solanyn)
- KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)
SDK Updates
- Remove SDK (#2657 by @eoinfennessy)
- feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
- feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
- feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
- feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)
Bug Fixes
- Revert "fix(sdk): Fix type annotation for
train
method'strainer
parameter" (#2651 by @Electronic-Waste) - fix(sdk): Fix bad arg passed to
get_args_using_torchtune_config
(#2647 by @eoinfennessy) - fix(sdk): Fix type annotation for
train
method'strainer
parameter (#2646 by @eoinfennessy) - fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
- Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
- fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
- fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
- Fix MPI Test runnable errors (#2570 by @tenzen-y)
- Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
- fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
- fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
- fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
- fix(ci): update
test-go
coverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo) - fix(doc): Update
train()
API in KEP-2401 (#2536 by @Electronic-Waste) - fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
- [hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
- [hotfix] fix docker cred (#2530 by @mahdikhashan)
- fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
- Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
- fix type in model initializer entrypoint (#2489 by @szaher)
- fix(runtime): fix error label name. (#2487 by @Electronic-Waste)
- fix(sdk): resolve errors in deserialization (#2457 by @Electronic-Waste)
- Fix missing external types in apply configurations (#2429 by @astefanutti)
- Fix API Group for Torch Runtime (#2424 by @andreyvelich)
- Fix Kustomize patchesStrategicMerge deprecation warning (#2405 by @astefanutti)
- ControlPlane: Fix flaky integraion testings due to missing the latest version of object (#2414 by @tenzen-y)
Misc
- Tag Docker images with GitHub release tags (#2662 by @kramaranya)
- feat(controller): Implement PodSpecOverride API (#2614 by @andreyvelich)
- Nominate @Electronic-Waste as approver and @astefanutti as reviewer (#2659 by @andreyvelich)
- chore(build): Support Podman to run OpenAPI generator (#2656 by @astefanutti)
- chore(docs): Add OpenSSF Best Practices Badge (#2611 by @andreyvelich)
- [chore] update stale action version to latest (#2642 by @mahdikhashan)
- Remove TrainJobCreated condition (#2621 by @astefanutti)
- ci: refactor build-push-images workflow (#2607 by @milinddethe15)
- Update Go to v1.24 (#2615) (#2620 by @vzamboulingame)
- test(runtime): add UT for IndexTrainJobTrainingRuntime (#2603 by @Harshal292004)
- ci: add k8s
v1.32
for tests env ([#2613](#26...
v1.9.2
This is the Training Operator v1.9.2 release.
New Features
- Add provision to provide labels and annotations for the pytorchjob an… (#2612 by @abhijeet-dhumal)
Bug Fixes
- Fix llm hp optimization error (#2576 by @helenxie-bit)
- [bug] pull image from ghcr (#2584 by @mahdikhashan)
v1.9.1 release
This is the Training Operator v1.9.1 release.
Breaking Changes
- Update Manifest Images to GHCR (#2544 by @saileshd1402)
- Push images to GHCR for release-1.9 (#2491 by @saileshd1402)
New Features
- Add volume and volume mounts arguments to TrainingClient.create_job API (#2449 by @astefanutti)
- Add configurable QPS and burst settings for kube API client (#2411 by @ronk21runai)
Bug Fixes
- fix(ci): Change publish dir from
training
totrainer
(#2546 by @Electronic-Waste) - fix: fix typos in script comments. (#2465 by @IRONICBo)
- fix: adds jaxjobs to the kubeflow-training-roles.yaml ClusterRole (#2417 by @DnPlas)
- [release-1.9] Rename paddlepaddle_defaults.go file name (#2400 by @ChristianZaccaria)
v1.9.0 release
This is the Training Operator v1.9.0 release.
This release introduces a new JAXJob, enabling seamless distributed training with JAX.
Additionally, it adds the managedBy
API to streamline the orchestration of training Jobs in multi-cluster environment using MultiKueue.
Breaking Changes
- Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
- Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
- Update the name of PVC in
train
API (#2187 by @helenxie-bit) - Remove support for MXJob (#2150 by @tariq-hasan)
- Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
New Features
Distributed JAX
- Add JAX controller (#2194 by @sandipanpanda)
- Add JAX API (#2163 by @sandipanpanda)
- JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)
- JAX example for MNIST SPMD and add CI testing (#2390 by @saileshd1402)
New Examples
- FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
- Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)
Control Plane Updates
- Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
- [Feature] Support managed by external controller (#2203 by @mszadkow)
- Update trainer to ensure type consistency for
train_args
andlora_config
(#2181 by @helenxie-bit) - Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
- Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
- ARM64 supported in PyTorch examples (#2116 by @danielsuh05)
SDK Updates
- [SDK] Adding env vars (#2285 by @tarekabouzeid)
- [SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
- [SDK] move env var to constants.py (#2268 by @varshaprasad96)
- [SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
- [SDK] Read namespace from the current context (#2255 by @andreyvelich)
- [SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
- [SDK] Explain Python version support cycle (#2144 by @andreyvelich)
Kubeflow Trainer V2
- KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
- KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
- Always update TrainJob status on errors (#2352 by @astefanutti)
- Fix TrainJob status comparison and update (#2353 by @astefanutti)
- Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
- KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
- KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
- KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
- KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
- KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
- KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
- KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
- KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
- KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
- KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
- KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
- KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
- KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
- KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
- KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
- KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
- [v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
- KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
- KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
- KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
- KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
- KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
- KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
- KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
- KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
- KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
- KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)
Bug Fixes
- [release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
- Pin accelerate package version in trainer (#2340 by @gavrissh)
- [fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
- [SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
- [SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
- [Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
- [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
- Update
huggingface_hub
Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit) - [SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
- fix volcano podgroup update issue (#2079 by @ckyuto)
- [SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)
Misc
- [release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
- Add e2e test for train API (#2199 by @helenxie-bit)
- buildx link was broken ([#2356](https://github.com/kubeflow/training-operator/pul...
v1.9.0-rc.0 release
This is the Training Operator v1.9.0-rc.0 pre-release.
Breaking Changes
- Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
- Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
- Update the name of PVC in
train
API (#2187 by @helenxie-bit) - Remove support for MXJob (#2150 by @tariq-hasan)
- Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
New Features
Distributed JAX
- Add JAX controller (#2194 by @sandipanpanda)
- Add JAX API (#2163 by @sandipanpanda)
- JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)
New Examples
- FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
- Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)
Control Plane Updates
- Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
- [Feature] Support managed by external controller (#2203 by @mszadkow)
- Update trainer to ensure type consistency for
train_args
andlora_config
(#2181 by @helenxie-bit) - Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
- Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
- ARM64 supported in PyTorch examples (#2116 by @danielsuh05)
SDK Updates
- [SDK] Adding env vars (#2285 by @tarekabouzeid)
- [SDK] Use to 10000 rchrun to create PyTorchJob from function (#2276 by @andreyvelich)
- [SDK] move env var to constants.py (#2268 by @varshaprasad96)
- [SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
- [SDK] Read namespace from the current context (#2255 by @andreyvelich)
- [SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
- [SDK] Explain Python version support cycle (#2144 by @andreyvelich)
Kubeflow Training V2
- KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
- KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
- Always update TrainJob status on errors (#2352 by @astefanutti)
- Fix TrainJob status comparison and update (#2353 by @astefanutti)
- Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
- KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
- KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
- KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
- KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
- KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
- KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
- KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
- KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
- KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
- KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
- KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
- KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
- KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
- KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
- KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
- KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
- [v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
- KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
- KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
- KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
- KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
- KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
- KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
- KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
- KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
- KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
- KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)
Bug Fixes
- [release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
- Pin accelerate package version in trainer (#2340 by @gavrissh)
- [fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
- [SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
- [SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
- [Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
- [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
- Update
huggingface_hub
Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit) - [SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
- fix volcano podgroup update issue (#2079 by @ckyuto)
- [SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)
Misc
- [release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
- Add e2e test for train API (#2199 by @helenxie-bit)
- buildx link was broken (#2356 by @Veer0x1)
- Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
- Upgrade Go version to v1.23 (#2302 by @tenzen-y)
- Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
- Added test for create-pytorchjob.ipynb python notebook ([#2274](https://github.com/kubeflow/training-operator...
v1.8.1 release
This is the Training Operator v1.8.1 release.
Bug Fixes
- [Bug] Finish CleanupJob early if the job is suspended (#2243 by @mszadkow)
- [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
- Update
huggingface_hub
Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
New Contributors
- @mszadkow made their first contribution in #2243
- @helenxie-bit made their first contribution in #2180
v1.8.0 release
This is the Training Operator v1.8.0 release.
This release introduces a new Python API for LLMs Fine-Tuning that simplifies the ability to fine-tune foundational models using distributed PyTorch nodes.
Install the Kubeflow Training SDK as follows to try it:
pip install -U "kubeflow-training[huggingface]"
LLMs Fine-Tuning API
- Train/Fine-tune API Proposal for LLMs (#1945 by @deepanker13)
- [SDK] Train API for LLM Fine-Tuning (#1962 by @deepanker13)
- Modify LLM Trainer to support BERT and Tiny LLaMA (#2031 by @andreyvelich)
- Support arm64 for Hugging Face trainer (#2028 by @tariq-hasan)
- Add Fine-Tune BERT LLM Example (#2021 by @andreyvelich)
- Train api dataset download changes (#1959 by @deepanker13)
- Train api init container creation (#1958 by @deepanker13)
- [SDK] Add docstring for Train API (#2075 by @andreyvelich)
Breaking Changes
- [SDK] Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
- Support K8s v1.29 and Drop K8s v1.26 (#2039 by @tenzen-y)
- Support K8s v1.28 and Drop K8s v1.25 (#2038 by @tenzen-y)
- Deprecation Notice for MXJob (#2058 by @tenzen-y)
⚠️ Breaking Changes: Renamemonitoring-port
flag towebook-server-port
(#1925 by @afritzler)
New Features
Control Plane Updates
- Upgrade scheduler-plugins to v0.28.9 (#2065 by @tenzen-y)
- Implement webhook validations for the PaddleJob (#2057 by @tenzen-y)
- Implement webhook validations for the XGBoostJob (#2052 by @tenzen-y)
- Implement webhook validation for the TFJob (#2051 by @tenzen-y)
- Implement webhook validations for the PyTorchJob (#2035 by @tenzen-y)
- Upgrade PyTorchJob examples to PyTorch v2 (#2024 by @champon1020)
- Upgrade Go version to v1.22 (#2046 by @tenzen-y)
SDK Improvements
- [SDK] Add resources per worker for Create Job API (#1990 by @andreyvelich)
- [SDK] Fix Worker and Master templates for PyTorchJob (#1988 by @andreyvelich)
- [SDK] Get Kubernetes Events for Job (#1975 by @andreyvelich)
- SDK: Upgrade the minimum required Kubernetes version to v1.27.2 (#2066 by @tenzen-y)
- [SDK] Add information about TrainingClient logging (#1973 by @andreyvelich)
- Training operator SDK unit test (#1938 by @deepanker13)
- [SDK] Consolidate Naming for CRUD APIs (#1907 by @andreyvelich)
Bug Fixes
- [SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
- [SDK] Sync Transformers version for train API (#2147 by @andreyvelich)
- [SDK] Changed package name to flake8 to fix pip install (#2140 by @tenzen-y)
- [SDK] Fix Incorrect Events in get_job_logs API (#2138 by @tenzen-y)
- Fix volcano podgroup update issue (#2079 by @ckyuto)
- Fix import for HuggingFace Dataset Provider (#2085 by @andreyvelich)
- Updated examples for train API (#2077 by @shruti2522)
- Fail job for non-retryable exit codes (#2071 by @kellyaa)
- E2E: Replace outdated images with latest ones (#2083 by @tenzen-y)
- fix wrong filepath in the simple example command (#2062 by @qzoscar)
- fix(example): add installation of python-etcd in Pytorch example (#2064 by @champon1020)
- fix: Upgrade controller-gen to v0.14.0 (#2026 by @champon1020)
- Fix build workflow config for pytorch-torchrun-example (#2020 by @PeterWrighten)
- Fix Distributed Data Samplers in PyTorch Examples (#2012 by @andreyvelich)
- Fix URL in python SDK setup.py (#2011 by @garymm)
- Fix for Github CI to publish HF trainer image (#1987 by @johnugeorge)
- train api jupyternotebook fix (#1984 by @deepanker13)
- fix: volcano podgroup should has a non-empty queue name (#1977 by @lowang-bh)
- Fix Master Label for PyTorchJob (#1974 by @andreyvelich)
- IsMasterRole fix in pytorchjob controller (#1969 by @deepanker13)
- [fix] replace
${go env GOPATH}
with$(go env GOPATH)
(#1952 by @double12gzh) - Fixing issues with providing existing service account (#1918 by @rpemsel)
Misc
- Refine the integration tests for the immutable PyTorchJob (#2130 by @tenzen-y)
- Update training operator image to latest (#2089 by @johnugeorge)
- Update sdk to v1.8.0rc0 (#2087 by @johnugeorge)
- Test: Simplify and Identify pod-controller envtest (#2084 by @tenzen-y)
- Remove deadcode related to PodDisruptionBudget (#2073 by @tenzen-y)
- docs: updating docs for local development (#2074 by @franciscojavierarceo)
- PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode (#2067 by @tenzen-y)
- Updated developer docs to include Kind (#2061 by @franciscojavierarceo)
- adding fine tune example with s3 as the dataset store (#2006 by @deepanker13)
- CI: Use a mode=min in the builder cache (#2053 by @tenzen-y)
- Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 (#2043 by @jdcfd)
- Remove Dockerfile.ppc64le of pytorch example (#2042 by @champon1020)
- publish torchrun example via Dockerfile (#2018 by @PeterWrighten)
- Updated examples/pytorch to disable istio sidecar injection (#2004 by @jdcfd)
- [docs] development guide update (#1995 by @shashank-iitbhu)
- Add Kubeflow Website links to README (#1983 by @andreyvelich)
- publish trainer hugging face image (#1985 by @deepanker13)
- Adding Training image needed for train api (#1963 by @deepanker13)
- Add test to create PyTorchJob from func (#1979 by @andreyvelich)
- Corrected Some Spelling And Grammatical Errors (#1980 by @daniel-hutao)
- torchrun example with cpu version pytorch (#1965 by @kuizhiqing)
- utils changes needed to add train api (#1954 by @deepanker13)
- Adding parallel support for coveralls (#1956 by @johnugeorge)
- chore: pkg import only once (#1950 by @testwill)
- fix nproc env in elas...
v1.8.0-rc.0 release
New features
- Train/Fine-tune API Proposal for LLMs #1945 (deepanker13)
- Adding Training image needed for train api #1963 (deepanker13)
- [SDK] Train API #1962 (deepanker13)
- Train api dataset download changes #1959 (deepanker13)
- Train api init container creation #1958 (deepanker13)
- Publish trainer hugging face image #1985 (deepanker13)
- Support arm64 for Hugging Face trainer #2028 (tariq-hasan)
- Modify LLM Trainer to support BERT and Tiny LLaMA #2031 (andreyvelich)
- Implement webhook validations for the PyTorchJob #2035 (tenzen-y)
- Implement webhook validations for the XGBoostJob #2052 (tenzen-y)
- Implement webhook validation for the TFJob #2051 (tenzen-y)
- Implement webhook warnings for the MXJob #2058 (tenzen-y)
- Implement webhook validations for the PaddleJob #2057 (tenzen-y)
- Fail job for non-retryable exit codes #2071 (kellyaa)
- Adding fine tune example with s3 as the dataset store #2006 (deepanker13)
Bug fixes
- fix nproc env in elastic mode for pytorchjob #1948 (kuizhiqing)
- IsMasterRole fix in pytorchjob controller #1969 (deepanker13)
- fix: volcano podgroup should has a non-empty queue name #1977 (lowang-bh)
- Fix Master Label for PyTorchJob #1974 (andreyvelich)
- [SDK] Fix Worker and Master templates for PyTorchJob #1988 (andreyvelich)
- Fix import for HuggingFace Dataset Provider #2085 (andreyvelich)
- Upgrade controller-gen to v0.14.0 #2026 (champon1020)
- Fix Distributed Data Samplers in PyTorch Examples #2012 (andreyvelich)
- Fix URL in python SDK setup.py #2011 (garymm)
Misc
- Adding parallel support for coveralls #1956 (johnugeorge)
- torchrun example with cpu version pytorch #1965 (kuizhiqing)
- [SDK] Get Kubernetes Events for Job #1975 (andreyvelich)
- Fix Master Label for PyTorchJob #1974 (andreyvelich)
- [SDK] Add information about TrainingClient logging #1973 (andreyvelich)
- PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode #2067 (tenzen-y)
- SDK: Upgrade the minimum required Kubernetes version to v1.27.2 #2066 (tenzen-y)
- Test: Simplify and Identify pod-controller envtest #2084 (tenzen-y)
- E2E: Replace outdated images with latest ones #2083 (tenzen-y)
- Upgrade scheduler-plugins to v0.28.9 #2065 (tenzen-y)
v1.7.0 release
Breaking Changes
- Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
- Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)
New features
- Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
- Merge kubeflow/common to training-operator #1813 (johnugeorge)
- Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
- Implement suspend semantics #1859 (tenzen-y)
- Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
- Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)
Bug fixes
- Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
- Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
- Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)
Misc
- Removing reconciler code #1879 (johnugeorge)
- Make Condition and ReplicaStatus optional #1862 (tenzen-y)
- Use the same reasons for Condition and Event #1854 (tenzen-y)
- Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
- Clean up /pkg/common/util/v1 #1845 (tenzen-y)
- Refactoring tests in common/controller.v1 #1843 (tenzen-y)
- remove duplicate code of add task spec annotation #1839 (lowang-bh)
- fetch volcano log when e2e failed #1837 (lowang-bh)
- Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
- Replace dummy client with fake client #1818 (tenzen-y)
- Add default Intel MPI env variables to MPIJob #1804 (tkatila)
- Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
- xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
- make timeout configurable from e2e tests #1787 (nagar-ajay)
v1.7.0-rc.0 release
Breaking Changes
- Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
- Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)
New features
- Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
- Merge kubeflow/common to training-operator #1813 (johnugeorge)
- Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
- Implement suspend semantics #1859 (tenzen-y)
- Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
- Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)
Bug fixes
- Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
- Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
- Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)
Misc
- Removing reconciler code #1879 (johnugeorge)
- Make Condition and ReplicaStatus optional #1862 (tenzen-y)
- Use the same reasons for Condition and Event #1854 (tenzen-y)
- Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
- Clean up /pkg/common/util/v1 #1845 (tenzen-y)
- Refactoring tests in common/controller.v1 #1843 (tenzen-y)
- remove duplicate code of add task spec annotation #1839 (lowang-bh)
- fetch volcano log when e2e failed #1837 (lowang-bh)
- Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
- Replace dummy client with fake client #1818 (tenzen-y)
- Add default Intel MPI env variables to MPIJob #1804 (tkatila)
- Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
- xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
- make timeout configurable from e2e tests #1787 (nagar-ajay)