8000 [BUG] Evo2 pretraining diverges for TE > 1.13 · Issue #794 · NVIDIA/bionemo-framework · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[BUG] Evo2 pretraining diverges for TE > 1.13 #794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dorotat-nv opened this issue Mar 31, 2025 · 0 comments
Open

[BUG] Evo2 pretraining diverges for TE > 1.13 #794

dorotat-nv opened this issue Mar 31, 2025 · 0 comments
Labels
bug Something isn't working Evo2

Comments

@dorotat-nv
Copy link
Collaborator
dorotat-nv commented Mar 31, 2025

BioNeMo Framework Version

4ea7cbc

Bug Description

The training of Evo2 1b 8k divergece when TransformerEngine version is greater than 1.13. Se the PR with a hotfix (hardcoding TE version)
See description in #791

Steps to Reproduce

  1. Build docker image with TE > 1.13, ie the commit from this issue
  2. Run training command for 6K steps

train_evo2 -d /workspace/bionemo2/sub-packages/bionemo-evo2/examples/configs/full_pretrain_shortphase_config.yaml --dataset-dir /data/evo2 --grad-acc-batches 1 --fp8 --fp8-wgrad --activation-checkpoint-recompute-num-layers 5 --enable-preemption --ckpt-async-save --use-megatron-comm-overlap-llama3-8k --overlap-grad-reduce --clip-grad=250 --eod-pad-in-loss-mask --seq-length=8192 --seed 3735928559 --lr=0.00015 --wd=0.1 --min-lr=1.5e-05 --warmup-steps=5000 --tensor-parallel-size=1 --context-parallel-size=1 --pipeline-model-parallel-size=1 --workers 8 --num-nodes=4 --devices=8 --micro-batch-size=8 --model-size=1b --max-steps=490000 --early-stop-on-step 6900 --limit-val-batches=20 --log-every-n-steps=50 --val-check-interval=500 --create-tflops-callback --create-tensorboard-logger --result-dir=./results --disable-checkpointing

Error Messages and Logs

The training crashes

29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/core/module.py", line 1306, in optimizer_step
29: [rank29]:     optimizer.step(closure=optimizer_closure)
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/core/optimizer.py", line 153, in step
29: [rank29]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
29: [rank29]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 685, in optimizer_step
29: [rank29]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
29: [rank29]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
29: [rank29]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
29: [rank29]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/strategy.py", line 238, in optimizer_step
29: [rank29]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
29: [rank29]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
29: [rank29]:     return optimizer.step(closure=closure, **kwargs)
29: [rank29]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/torch/optim/lr_scheduler.py", line 140, in wrapper
29: [rank29]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
29: [rank29]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/nemo/core/optim/mcore_optim.py", line 129, in step
29: [rank29]:     loss = closure()
29: [rank29]:            ^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
29: [rank29]:     closure_result = closure()
29: [rank29]:                      ^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
29: [rank29]:     self._result = self.closure(*args, **kwargs)
29: [rank29]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
29: [rank29]:     return func(*args, **kwargs)
29: [rank29]:            ^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
29: [rank29]:     step_output = self._step_fn()
29: [rank29]:                   ^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
29: [rank29]:     training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
29: [rank29]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
29: [rank29]:     output = fn(*args, **kwargs)
29: [rank29]:              ^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 619, in training_step
29: [rank29]:     out = self.model.training_step(dataloader_iter, *args, **kwargs)
29: [rank29]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 384, in training_step
29: [rank29]:     return self._step(
29: [rank29]:            ^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 496, in _step
29: [rank29]:     return self.forward(
29: [rank29]:            ^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 346, in forward
29: [rank29]:     microbatch_outputs = step()
29: [rank29]:                          ^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 1251, in __call__
29: [rank29]:     return self.forward_backward_func(
29: [rank29]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/megatron/core/pipeline_parallel/schedules.py", line 488, in forward_backward_no_pipelining
29: [rank29]:     backward_step(input_tensor, output_tensor, output_tensor_grad, model_type, config)
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/megatron/core/pipeline_parallel/schedules.py", line 368, in backward_step
29: [rank29]:     torch.autograd.backward(output_tensor[0], grad_tensors=output_tensor_grad[0])
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/__init__.py", line 347, in backward
29: [rank29]:     _engine_run_backward(
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
29: [rank29]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
29: [rank29]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/megatron/core/distributed/distributed_data_parallel.py", line 386, in hook
29: [rank29]:     self.param_to_bucket_group[param].register_grad_ready(param)
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/megatron/core/distributed/param_and_grad_buffer.py", line 434, in register_grad_ready
29: [rank29]:     self.start_grad_sync()
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/megatron/core/distributed/param_and_grad_buffer.py", line 292, in start_grad_sync
29: [rank29]:     self.check_grads(
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/megatron/core/distributed/param_and_grad_buffer.py", line 172, in check_grads
29: [rank29]:     rerun_state_machine.validate_result(
29: [rank29]:   File "/usr/local/lib/python3.12/dist-packages/megatron/core/rerun_state_machine.py", line 505, in validate_result
29: [rank29]:     raise RuntimeError(full_message)
29: [rank29]: RuntimeError: Rank 29, node ....., device 5, iteration -1: Unexpected result inf (message='found Inf in local grad norm for bucket #0 in backward pass before data-parallel communication collective')

Docker Image

No response

System Information

Environment Details:

  • OS: [e.g., Ubuntu 20.04]
  • CPU: [e.g., Intel i9-12900K]
  • RAM: [e.g., 64GB]

GPU Details:

  • GPU Model: [e.g., NVIDIA RTX 4090]
  • GPU Memory: [e.g., 24GB]
  • CUDA Version: [e.g., 12.1]
  • CUDA Driver: [e.g., 525.85.05]
  • cuDNN Version: [e.g., 8.9.0]

Additional Context

No response

@dorotat-nv dorotat-nv added bug Something isn't working Evo2 labels Mar 31, 2025
@dorotat-nv dorotat-nv changed the title [BUG] Evo2 pretraining diverges with TE > 1.13 [BUG] Evo2 pretraining diverges for TE > 1.13 Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Evo2
Projects
None yet
Development

No branches or pull requests

1 participant
0