8000 [Feature] Allow early training termination at specific step using Trainer.max_steps without modifying LR schedule ie from full convergence run · Issue #749 · NVIDIA/bionemo-framework · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Feature] Allow early training termination at specific step using Trainer.max_steps without modifying LR schedule ie from full convergence run #749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dorotat-nv opened this issue Mar 12, 2025 · 0 comments
Assignees
Labels

Comments

@dorotat-nv
Copy link
Collaborator
dorotat-nv commented Mar 12, 2025

Problem & Motivation

In Evo2, using the --max-steps argument to stop training at a specific step also modifies the learning rate schedule. This makes it difficult to test partial convergence training that stops at a given step without altering the intended LR schedule.
File: sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py

Remove then SignalAfterGivenStepCallback from the training script

BioNeMo Framework Version

7428f5f

Category

Model/Training

Proposed Solution

introduce a new optional argument ie lr_scheduler_steps which, when passed, sets lr rate scheduler number of steps instead of max_steps

Expected Benefits

max_steps can be used to control length of the training when lr_scheduler_steps is used to define schedule of lr

Code Example

@dorotat-nv dorotat-nv changed the title Add support for optionally setting lr_max_steps in the learning rate scheduler, enabling training to stop at a specified step using Trainer.max without requiring modifications to the full LR schedule. [Feature] Add support for optionally setting lr_max_steps in the learning rate scheduler, enabling training to stop at a specified step using Trainer.max without requiring modifications to the full LR schedule. Mar 12, 2025
@dorotat-nv dorotat-nv self-assigned this Apr 8, 2025
@dorotat-nv dorotat-nv changed the title [Feature] Add support for optionally setting lr_max_steps in the learning rate scheduler, enabling training to stop at a specified step using Trainer.max without requiring modifications to the full LR schedule. [Feature] Allow early training termination at specific step using Trainer.max_steps without modifying LR schedule ie from full convergence run Apr 11, 2025
github-merge-queue bot pushed a commit that referenced this issue Apr 11, 2025
… subpackages and update benchmarks (#803)

### Description
Current implementations (Evo2 and ESM2) use different approaches to stop
training at specific steps while maintaining the full learning rate
schedule or other characteristics. Trying to unify it

Evo2: Uses checkpoint mechanism to stop training after K steps
ESM2: Implements a different solution in train_esm2.py

Evo2:
https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py

ESM2
https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py

Addressing issue: #749

### Type of changes
<!-- Mark the relevant option with an [x] -->

- [ ]  Bug fix (non-breaking change which fixes an issue)
- [ ]  New feature (non-breaking change which adds functionality)
- [x]  Refactor
- [ ]  Documentation update
- [ ]  Other (please describe):

### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:

-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing

> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

* If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
* If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```

### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->

 - [ ] I have tested these changes locally
 - [ ] I have updated the documentation accordingly
 - [ ] I have added/updated tests as needed
 - [ ] All existing tests pass successfully
cspades pushed a commit that referenced this issue May 4, 2025
… subpackages and update benchmarks (#803)

### Description
Current implementations (Evo2 and ESM2) use different approaches to stop
training at specific steps while maintaining the full learning rate
schedule or other characteristics. Trying to unify it

Evo2: Uses checkpoint mechanism to stop training after K steps
ESM2: Implements a different solution in train_esm2.py

Evo2:
https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py

ESM2
https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py

Addressing issue: #749

### Type of changes
<!-- Mark the relevant option with an [x] -->

- [ ]  Bug fix (non-breaking change which fixes an issue)
- [ ]  New feature (non-breaking change which adds functionality)
- [x]  Refactor
- [ ]  Documentation update
- [ ]  Other (please describe):

### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:

-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing

> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

* If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
* If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```

### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->

 - [ ] I have tested these changes locally
 - [ ] I have updated the documentation accordingly
 - [ ] I have added/updated tests as needed
 - [ ] All existing tests pass successfully

Signed-off-by: Cory Ye <cye@nvidia.com>
farhadrgh pushed a commit that referenced this issue May 5, 2025
… subpackages and update benchmarks (#803)

### Description
Current implementations (Evo2 and ESM2) use different approaches to stop
training at specific steps while maintaining the full learning rate
schedule or other characteristics. Trying to unify it

Evo2: Uses checkpoint mechanism to stop training after K steps
ESM2: Implements a different solution in train_esm2.py

Evo2:
https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py

ESM2
https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py

Addressing issue: #749

### Type of changes
<!-- Mark the relevant option with an [x] -->

- [ ]  Bug fix (non-breaking change which fixes an issue)
- [ ]  New feature (non-breaking change which adds functionality)
- [x]  Refactor
- [ ]  Documentation update
- [ ]  Other (please describe):

### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:

-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)

538F
- Execute tests labelled as slow in pytest for extensive testing

> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

* If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
* If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```

### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->

 - [ ] I have tested these changes locally
 - [ ] I have updated the documentation accordingly
 - [ ] I have added/updated tests as needed
 - [ ] All existing tests pass successfully

Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant
0