-
Notifications
You must be signed in to change notification settings - Fork 786
Insights: kubeflow/trainer
Overview
Could not load contribution data
Please try again later
1 Release published by 1 person
-
v2.0.0-rc.0
published
Jun 12, 2025
15 Pull requests merged by 6 people
-
chore(ci): Add more workaround no space left on device
#2677 merged
Jun 20, 2025 -
[release-2.0] Add Changelog for Trainer v2.0.0-rc.0
#2667 merged
Jun 11, 2025 -
Add Changelog for Trainer v2.0.0-rc.0
#2666 merged
Jun 11, 2025 -
Remove SDK
#2657 merged
Jun 9, 2025 -
Tag Docker images with GitHub release tags
#2662 merged
Jun 7, 2025 -
chore(docs): Cherry-pick changelog for Training Operator v1.9.0
#2661 merged
Jun 6, 2025 -
KEP-2401: Support loading local LLMs
#2644 merged
Jun 6, 2025 -
feat(controller): Implement PodSpecOverride API
#2614 merged
Jun 6, 2025 -
Nominate @Electronic-Waste as approver and @astefanutti as reviewer
#2659 merged
Jun 6, 2025 -
chore(build): Support Podman to run OpenAPI generator
#2656 merged
Jun 5, 2025 -
chore(docs): Add OpenSSF Best Practices Badge
#2611 merged
Jun 4, 2025 -
KEP-2401: Support mutating dataset preprocessing config in SDK
#2638 merged
Jun 4, 2025 -
Revert "fix(sdk): Fix type annotation for
train
method'strainer
parameter"#2651 merged
May 27, 2025 -
fix(sdk): Fix bad arg passed to
get_args_using_torchtune_config
#2647 merged
May 27, 2025 -
fix(sdk): Fix type annotation for
train
method'strainer
parameter#2646 merged
May 26, 2025
8 Pull requests opened by 6 people
-
Apply resources appropriately to both launcher and node containers
#2653 opened
May 30, 2025 -
docs: Add `LocalTrainerClient` example notebook
#2658 opened
Jun 5, 2025 -
KEP-2628: Support KAI Scheduler in Kubeflow Trainer
#2663 opened
Jun 8, 2025 -
fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files
#2669 opened
Jun 14, 2025 -
feat(example): Add alpaca-trianjob-yaml.ipynb.
#2670 opened
Jun 15, 2025 -
KEP-2437: Support Volcano Scheduler in Kubeflow Trainer V2
#2672 opened
Jun 16, 2025 -
fix(plugins): Fix some errors in torchtune mutation process.
#2675 opened
Jun 17, 2025 -
Fix - Add certificate and issuer resources to manifests and helm chart
#2678 opened
Jun 21, 2025
12 Issues closed by 4 people
-
Distributed training with mutliple pods, with multi-gpu in each pod
#2456 closed
Jun 18, 2025 -
Is it possible to pass annotation and label to jobset?
#2660 closed
Jun 10, 2025 -
Consider container image rename of `kubeflow/storage-initializer`
#2183 closed
Jun 10, 2025 -
KEP-2401: Support loading local LLMs
#2641 closed
Jun 6, 2025 -
KEP-2170: Support the PodSpecOverrides API in TrainJob
#2218 closed
Jun 6, 2025 -
Unit test for trainer_client.py in the v2 SDK
#2652 closed
Jun 6, 2025 -
GPU benchmark image does not exist
#1672 closed
Jun 5, 2025 -
KEP-2401: Support mutating dataset preprocessing config in SDK
#2506 closed
Jun 4, 2025 -
Permission denied when reading TrainJob function script when run as non-root user
#2372 closed
Jun 4, 2025 -
Support richer volcano scheduling
#2182 closed
May 29, 2025 -
Automate Python SDK release process in GitHub Actions
#1540 closed
May 28, 2025 -
Docs: reference architecture for fault tolerance capabilities
#2157 closed
May 25, 2025
9 Issues opened by 8 people
-
Add schedulingGates to PodSpecOverrides
#2680 opened
Jun 23, 2025 -
Mutable PodSpecOverrides for suspended TrainJob
#2679 opened
Jun 23, 2025 -
KEP-2401: Add Notebook examples for LLM Trainer V2
#2676 opened
Jun 17, 2025 -
[GSoC] Project 7: GPU Testing for LLM Blueprints
#2674 opened
Jun 16, 2025 -
[GSoC] Project 8: JAX and TensorFlow Runtimes
#2673 opened
Jun 16, 2025 -
[GSoC] Project 10: Support Volcano Scheduler in Kubeflow Trainer
#2671 opened
Jun 16, 2025 -
Custom metrics
#2668 opened
Jun 13, 2025 -
Skip Full CI for Non-Code/Docs-Only PRs
#2664 opened
Jun 8, 2025 -
KEP-2655: Kubeflow Data Cache for distributed training on Kubernetes
#2655 opened
Jun 4, 2025
46 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Add provision to provide local-queue for the training job in SDKv1 an…
#2636 commented on
Jun 23, 2025 • 24 new comments -
[proposal] GSoC Project 8: JAX Runtime for V2
#2643 commented on
Jun 24, 2025 • 11 new comments -
KEP-2170: Add manifest overlays for standalone installation
#2527 commented on
Jun 14, 2025 • 6 new comments -
KEP-2170: Add the manifests overlay for Kubeflow Training V2
#2382 commented on
Jun 14, 2025 • 2 new comments -
Fix training client error logs
#2586 commented on
May 29, 2025 • 1 new comment -
Introduce torch.compile to all PyTorch examples
#2027 commented on
Jun 13, 2025 • 0 new comments -
Use Debian images for Python components in the Training Operator V2
#2311 commented on
Jun 14, 2025 • 0 new comments -
KEP-2401: Kubeflow LLM Trainer V2
#2401 commented on
Jun 17, 2025 • 0 new comments -
Operator guide to manage TrainingRuntime and ClusterTrainingRuntime
#2542 commented on
Jun 17, 2025 • 0 new comments -
User guide for PyTorch Training
#2543 commented on
Jun 17, 2025 • 0 new comments -
Support TrainJob ResourcePerNode in CoScheduling plugin
#2525 commented on
Jun 18, 2025 • 0 new comments -
Flaky Test: TestDatasetIntegration.test_dataset_download[HuggingFace - Public dataset-huggingface-test_case0]
#2460 commented on
Jun 18, 2025 • 0 new comments -
Get and Use TrainingRuntime ApplyConfiguration throughout KF PipelineFramework
#2515 commented on
Jun 19, 2025 • 0 new comments -
Export Models to Kubeflow Model Registry
#2438 commented on
Jun 20, 2025 • 0 new comments -
KEP-2401: Determine the tag for torchtune trainer & Add support for multiple accelerators
#2518 commented on
Jun 22, 2025 • 0 new comments -
KEP-2170: Support hundreds and thousands worker nodes for a single training Job
#2318 commented on
Jun 22, 2025 • 0 new comments -
KEP-2170: Add Kubeflow Trainer Pipeline Framework Concept page to Documentation
#2458 commented on
Jun 23, 2025 • 0 new comments -
Add migration guide from Training Operator to Kubeflow Trainer V2
#2412 commented on
Jun 24, 2025 • 0 new comments -
Update Dockerfile with python debian image in cmd/initializer_v2/dataset/Dockerfile
#2312 commented on
Jun 12, 2025 • 0 new comments -
Config API for Kubeflow Trainer controller manager
#2428 commented on
Jun 10, 2025 • 0 new comments -
Use cncf-hosted gha runners
#2538 commented on
Jun 16, 2025 • 0 new comments -
Fix Prometheus metrics counter
#2553 commented on
Jun 22, 2025 • 0 new comments -
feat(scheduler):add support for kai scheduler
#2649 commented on
Jun 19, 2025 • 0 new comments -
Add the Config API for Kubeflow Trainer controller manager
#2420 commented on
May 27, 2025 • 0 new comments -
Support for ResourcesPerNode in DeepSpeed Training Job Containers
#2650 commented on
May 27, 2025 • 0 new comments -
KEP-2170: Add AMD ROCm Torch Distributed Training Runtime
#2335 commented on
May 27, 2025 • 0 new comments -
release the trainer python models
#2645 commented on
May 28, 2025 • 0 new comments -
Support TensorFlow Runtime
#2443 commented on
May 29, 2025 • 0 new comments -
"zero-trust" security / networking for training jobs
#2341 commented on
May 29, 2025 • 0 new comments -
Add a workflow for publishing Helm charts
#2488 commented on
Jun 2, 2025 • 0 new comments -
Add unit tests that cover the `pkg/apply` package
#2452 commented on
Jun 4, 2025 • 0 new comments -
Create Slurm runtime for model training using V2 APIs
#2249 commented on
Jun 4, 2025 • 0 new comments -
Create Trainer UI
#2648 commented on
Jun 5, 2025 • 0 new comments -
KEP-2170: Kubeflow Trainer V2 API
#2170 commented on
Jun 6, 2025 • 0 new comments -
Improve Kubeflow Trainer release process
#2155 commented on
Jun 6, 2025 • 0 new comments -
Training Operator - panic: runtime error: index out of range
#1842 commented on
Jun 6, 2025 • 0 new comments -
[Feedback] (the dataset download link gets 403 error) docs/components/training/user-guides/pytorch.md |
#2499 commented on
Jun 9, 2025 • 0 new comments -
Support KAI Scheduler in Kubeflow Trainer
#2628 commented on
Jun 9, 2025 • 0 new comments -
Leverage GitHub action arm64 runner
#2422 commented on
Jun 9, 2025 • 0 new comments -
Enable GPU Testing for LLM Blueprints
#2432 commented on
Jun 9, 2025 • 0 new comments -
KEP-2401: Create LLM Training Runtimes for Llama 3.1 model family
#2509 commented on
Jun 10, 2025 • 0 new comments -
Managing Pod Lifecycle in Distributed Training with TFJob
#2454 commented on
Jun 11, 2025 • 0 new comments -
Strategies for Deleting Successful Pods without Affecting Task Execution in TFJob
#2453 commented on
Jun 11, 2025 • 0 new comments -
Create model exporter for checkpointing and training output
#2245 commented on
Jun 12, 2025 • 0 new comments -
Support XGBoost/LightGBM runtime and examples
#2598 commented on
Jun 12, 2025 • 0 new comments -
Support Volcano Scheduler in Kubeflow Trainer
#2437 commented on
Jun 12, 2025 • 0 new comments