Pulse · kubeflow/trainer · GitHub

8000 Pulse · kubeflow/trainer · GitHub

More Web Proxy on the site http://driver.im/

May 25, 2025 – June 25, 2025

Overview

23 Active pull requests

21 Active issues

1 Release published by 1 person

v2.0.0-rc.0
published Jun 12, 2025

15 Pull requests merged by 6 people

chore(ci): Add more workaround no space left on device
#2677 merged Jun 20, 2025
[release-2.0] Add Changelog for Trainer v2.0.0-rc.0
#2667 merged Jun 11, 2025
Add Changelog for Trainer v2.0.0-rc.0
#2666 merged Jun 11, 2025
Remove SDK
#2657 merged Jun 9, 2025
Tag Docker images with GitHub release tags
#2662 merged Jun 7, 2025
chore(docs): Cherry-pick changelog for Training Operator v1.9.0
#2661 merged Jun 6, 2025
KEP-2401: Support loading local LLMs
#2644 merged Jun 6, 2025
feat(controller): Implement PodSpecOverride API
#2614 merged Jun 6, 2025
Nominate @Electronic-Waste as approver and @astefanutti as reviewer
#2659 merged Jun 6, 2025
chore(build): Support Podman to run OpenAPI generator
#2656 merged Jun 5, 2025
chore(docs): Add OpenSSF Best Practices Badge
#2611 merged Jun 4, 2025
KEP-2401: Support mutating dataset preprocessing config in SDK
#2638 merged Jun 4, 2025
Revert "fix(sdk): Fix type annotation for train method's trainer parameter"
#2651 merged May 27, 2025
fix(sdk): Fix bad arg passed to get_args_using_torchtune_config
#2647 merged May 27, 2025
fix(sdk): Fix type annotation for train method's trainer parameter
#2646 merged May 26, 2025

8 Pull requests opened by 6 people

Apply resources appropriately to both launcher and node containers
#2653 opened May 30, 2025
docs: Add `LocalTrainerClient` example notebook
#2658 opened Jun 5, 2025
KEP-2628: Support KAI Scheduler in Kubeflow Trainer
#2663 opened Jun 8, 2025
fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files
#2669 opened Jun 14, 2025
feat(example): Add alpaca-trianjob-yaml.ipynb.
#2670 opened Jun 15, 2025
KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2
#2672 opened Jun 16, 2025
fix(plugins): Fix some errors in torchtune mutation process.
#2675 opened Jun 17, 2025
Fix - Add certificate and issuer resources to manifests and helm chart
#2678 opened Jun 21, 2025

12 Issues closed by 4 people

Distributed training with mutliple pods, with multi-gpu in each pod
#2456 closed Jun 18, 2025
Is it possible to pass annotation and label to jobset?
#2660 closed Jun 10, 2025
Consider container image rename of `kubeflow/storage-initializer`
#2183 closed Jun 10, 2025
KEP-2401: Support loading local LLMs
#2641 closed Jun 6, 2025
KEP-2170: Support the PodSpecOverrides API in TrainJob
#2218 closed Jun 6, 2025
Unit test for trainer_client.py in the v2 SDK
#2652 closed Jun 6, 2025
GPU benchmark image does not exist
#1672 closed Jun 5, 2025
KEP-2401: Support mutating dataset preprocessing config in SDK
#2506 closed Jun 4, 2025
Permission denied when reading TrainJob function script when run as non-root user
#2372 closed Jun 4, 2025
Support richer volcano scheduling
#2182 closed May 29, 2025
Automate Python SDK release process in GitHub Actions
#1540 closed May 28, 2025
Docs: reference architecture for fault tolerance capabilities
#2157 closed May 25, 2025

9 Issues opened by 8 people

Add schedulingGates to PodSpecOverrides
#2680 opened Jun 23, 2025
Mutable PodSpecOverrides for suspended TrainJob
#2679 opened Jun 23, 2025
KEP-2401: Add Notebook examples for LLM Trainer V2
#2676 opened Jun 17, 2025
[GSoC] Project 7: GPU Testing for LLM Blueprints
#2674 opened Jun 16, 2025
[GSoC] Project 8: JAX and TensorFlow Runtimes
#2673 opened Jun 16, 2025
[GSoC] Project 10: Support Volcano Scheduler in Kubeflow Trainer
#2671 opened Jun 16, 2025
Custom metrics
#2668 opened Jun 13, 2025
Skip Full CI for Non-Code/Docs-Only PRs
#2664 opened Jun 8, 2025
KEP-2655: Kubeflow Data Cache for distributed training on Kubernetes
#2655 opened Jun 4, 2025

46 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

Add provision to provide local-queue for the training job in SDKv1 an…
#2636 commented on Jun 23, 2025 • 24 new comments
[proposal] GSoC Project 8: JAX Runtime for V2
#2643 commented on Jun 24, 2025 • 11 new comments
KEP-2170: Add manifest overlays for standalone installation
#2527 commented on Jun 14, 2025 • 6 new comments
KEP-2170: Add the manifests overlay for Kubeflow Training V2
#2382 commented on Jun 14, 2025 • 2 new comments
Fix training client error logs
#2586 commented on May 29, 2025 • 1 new comment
Introduce torch.compile to all PyTorch examples
#2027 commented on Jun 13, 2025 • 0 new comments
Use Debian images for Python components in the Training Operator V2
#2311 commented on Jun 14, 2025 • 0 new comments
KEP-2401: Kubeflow LLM Trainer V2
#2401 commented on Jun 17, 2025 • 0 new comments
Operator guide to manage TrainingRuntime and ClusterTrainingRuntime
#2542 commented on Jun 17, 2025 • 0 new comments
User guide for PyTorch Training
#2543 commented on Jun 17, 2025 • 0 new comments
Support TrainJob ResourcePerNode in CoScheduling plugin
#2525 commented on Jun 18, 2025 • 0 new comments
Flaky Test: TestDatasetIntegration.test_dataset_download[HuggingFace - Public dataset-huggingface-test_case0]
#2460 commented on Jun 18, 2025 • 0 new comments
Get and Use TrainingRuntime ApplyConfiguration throughout KF PipelineFramework
#2515 commented on Jun 19, 2025 • 0 new comments
Export Models to Kubeflow Model Registry
#2438 commented on Jun 20, 2025 • 0 new comments
KEP-2401: Determine the tag for torchtune trainer & Add support for multiple accelerators
#2518 commented on Jun 22, 2025 • 0 new comments
KEP-2170: Support hundreds and thousands worker nodes for a single training Job
#2318 commented on Jun 22, 2025 • 0 new comments
KEP-2170: Add Kubeflow Trainer Pipeline Framework Concept page to Documentation
#2458 commented on Jun 23, 2025 • 0 new comments
Add migration guide from Training Operator to Kubeflow Trainer V2
#2412 commented on Jun 24, 2025 • 0 new comments
Update Dockerfile with python debian image in cmd/initializer_v2/dataset/Dockerfile
#2312 commented on Jun 12, 2025 • 0 new comments
Config API for Kubeflow Trainer controller manager
#2428 commented on Jun 10, 2025 • 0 new comments
Use cncf-hosted gha runners
#2538 commented on Jun 16, 2025 • 0 new comments
Fix Prometheus metrics counter
#2553 commented on Jun 22, 2025 • 0 new comments
feat(scheduler):add support for kai scheduler
#2649 commented on Jun 19, 2025 • 0 new comments
Add the Config API for Kubeflow Trainer controller manager
#2420 commented on May 27, 2025 • 0 new comments
Support for ResourcesPerNode in DeepSpeed Training Job Containers
#2650 commented on May 27, 2025 • 0 new comments
KEP-2170: Add AMD ROCm Torch Distributed Training Runtime
#2335 commented on May 27, 2025 • 0 new comments
release the trainer python models
#2645 commented on May 28, 2025 • 0 new comments
Support TensorFlow Runtime
#2443 commented on May 29, 2025 • 0 new comments
"zero-trust" security / networking for training jobs
#2341 commented on May 29, 2025 • 0 new comments
Add a workflow for publishing Helm charts
#2488 commented on Jun 2, 2025 • 0 new comments
Add unit tests that cover the `pkg/apply` package
#2452 commented on Jun 4, 2025 • 0 new comments
Create Slurm runtime for model training using V2 APIs
#2249 commented on Jun 4, 2025 • 0 new comments
Create Trainer UI
#2648 commented on Jun 5, 2025 • 0 new comments
KEP-2170: Kubeflow Trainer V2 API
#2170 commented on Jun 6, 2025 • 0 new comments
Improve Kubeflow Trainer release process
#2155 commented on Jun 6, 2025 • 0 new comments
Training Operator - panic: runtime error: index out of range
#1842 commented on Jun 6, 2025 • 0 new comments
[Feedback] (the dataset download link gets 403 error) docs/components/training/user-guides/pytorch.md |
#2499 commented on Jun 9, 2025 • 0 new comments
Support KAI Scheduler in Kubeflow Trainer
#2628 commented on Jun 9, 2025 • 0 new comments
Leverage GitHub action arm64 runner
#2422 commented on Jun 9, 2025 • 0 new comments
Enable GPU Testing for LLM Blueprints
#2432 commented on Jun 9, 2025 • 0 new comments
KEP-2401: Create LLM Training Runtimes for Llama 3.1 model family
#2509 commented on Jun 10, 2025 • 0 new comments
Managing Pod Lifecycle in Distributed Training with TFJob
#2454 commented on Jun 11, 2025 • 0 new comments
Strategies for Deleting Successful Pods without Affecting Task Execution in TFJob
#2453 commented on Jun 11, 2025 • 0 new comments
Create model exporter for checkpointing and training output
#2245 commented on Jun 12, 2025 • 0 new comments
Support XGBoost/LightGBM runtime and examples
#2598 commented on Jun 12, 2025 • 0 new comments
Support Volcano Scheduler in Kubeflow Trainer
#2437 commented on Jun 12, 2025 • 0 new comments

0