-
Notifications
You must be signed in to change notification settings - Fork 69
[Feature] Allow early training termination at specific step using Trainer.max_steps without modifying LR schedule ie from full convergence run #749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
9 tasks
github-merge-queue bot
pushed a commit
that referenced
this issue
Apr 11, 2025
… subpackages and update benchmarks (#803) ### Description Current implementations (Evo2 and ESM2) use different approaches to stop training at specific steps while maintaining the full learning rate schedule or other characteristics. Trying to unify it Evo2: Uses checkpoint mechanism to stop training after K steps ESM2: Implements a different solution in train_esm2.py Evo2: https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py ESM2 https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py Addressing issue: #749 ### Type of changes <!-- Mark the relevant option with an [x] --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage <!--- How does a user interact with the changed code --> ```python TODO: Add code snippet ``` ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully
cspades
pushed a commit
that referenced
this issue
May 4, 2025
… subpackages and update benchmarks (#803) ### Description Current implementations (Evo2 and ESM2) use different approaches to stop training at specific steps while maintaining the full learning rate schedule or other characteristics. Trying to unify it Evo2: Uses checkpoint mechanism to stop training after K steps ESM2: Implements a different solution in train_esm2.py Evo2: https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py ESM2 https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py Addressing issue: #749 ### Type of changes <!-- Mark the relevant option with an [x] --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage <!--- How does a user interact with the changed code --> ```python TODO: Add code snippet ``` ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully Signed-off-by: Cory Ye <cye@nvidia.com>
farhadrgh
pushed a commit
that referenced
this issue
May 5, 2025
… subpackages and update benchmarks (#803) ### Description Current implementations (Evo2 and ESM2) use different approaches to stop training at specific steps while maintaining the full learning rate schedule or other characteristics. Trying to unify it Evo2: Uses checkpoint mechanism to stop training after K steps ESM2: Implements a different solution in train_esm2.py Evo2: https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py ESM2 https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py Addressing issue: #749 ### Type of changes <!-- Mark the relevant option with an [x] --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) 538F - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage <!--- How does a user interact with the changed code --> ```python TODO: Add code snippet ``` ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Problem & Motivation
In Evo2, using the --max-steps argument to stop training at a specific step also modifies the learning rate schedule. This makes it difficult to test partial convergence training that stops at a given step without altering the intended LR schedule.
File: sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py
Remove then SignalAfterGivenStepCallback from the training script
BioNeMo Framework Version
7428f5f
Category
Model/Training
Proposed Solution
introduce a new optional argument ie
lr_scheduler_steps
which, when passed, sets lr rate scheduler number of steps instead of max_stepsExpected Benefits
max_steps can be used to control length of the training when lr_scheduler_steps is used to define schedule of lr
Code Example
The text was updated successfully, but these errors were encountered: