[Feature] Allow early training termination at specific step using Trainer.max_steps without modifying LR schedule ie from full convergence run #749

dorotat-nv · 2025-03-12T14:35:23Z

Problem & Motivation

In Evo2, using the --max-steps argument to stop training at a specific step also modifies the learning rate schedule. This makes it difficult to test partial convergence training that stops at a given step without altering the intended LR schedule.
File: sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py

Remove then SignalAfterGivenStepCallback from the training script

BioNeMo Framework Version

7428f5f

Proposed Solution

introduce a new optional argument ie lr_scheduler_steps which, when passed, sets lr rate scheduler number of steps instead of max_steps

Expected Benefits

max_steps can be used to control length of the training when lr_scheduler_steps is used to define schedule of lr

Code Example

The text was updated successfully, but these errors were encountered:

… subpackages and update benchmarks (#803) ### Description Current implementations (Evo2 and ESM2) use different approaches to stop training at specific steps while maintaining the full learning rate schedule or other characteristics. Trying to unify it Evo2: Uses checkpoint mechanism to stop training after K steps ESM2: Implements a different solution in train_esm2.py Evo2: https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py ESM2 https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py Addressing issue: #749 ### Type of changes  - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage  ```python TODO: Add code snippet ``` ### Pre-submit Checklist  - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully

… subpackages and update benchmarks (#803) ### Description Current implementations (Evo2 and ESM2) use different approaches to stop training at specific steps while maintaining the full learning rate schedule or other characteristics. Trying to unify it Evo2: Uses checkpoint mechanism to stop training after K steps ESM2: Implements a different solution in train_esm2.py Evo2: https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py ESM2 https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py Addressing issue: #749 ### Type of changes  - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage  ```python TODO: Add code snippet ``` ### Pre-submit Checklist  - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully Signed-off-by: Cory Ye <cye@nvidia.com>

… subpackages and update benchmarks (#803) ### Description Current implementations (Evo2 and ESM2) use different approaches to stop training at specific steps while maintaining the full learning rate schedule or other characteristics. Trying to unify it Evo2: Uses checkpoint mechanism to stop training after K steps ESM2: Implements a different solution in train_esm2.py Evo2: https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py ESM2 https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py Addressing issue: #749 ### Type of changes  - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) 538F - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage  ```python TODO: Add code snippet ``` ### Pre-submit Checklist  - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>

dorotat-nv added the Evo2 label Mar 12, 2025

dorotat-nv self-assigned this Apr 8, 2025

dorotat-nv mentioned this issue Apr 8, 2025

unify the implementation of early training termination across BioNeMo subpackages and update benchmarks #803

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Allow early training termination at specific step using Trainer.max_steps without modifying LR schedule ie from full convergence run #749

[Feature] Allow early training termination at specific step using Trainer.max_steps without modifying LR schedule ie from full convergence run #749

[Feature] Allow early training termination at specific step using Trainer.max_steps without modifying LR schedule ie from full convergence run #749

[Feature] Allow early training termination at specific step using Trainer.max_steps without modifying LR schedule ie from full convergence run #749

Comments

Uh oh!

Problem & Motivation

BioNeMo Framework Version

Category

Proposed Solution

Expected Benefits

Code Example