You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The training of Evo2 1b 8k divergece when TransformerEngine version is greater than 1.13. Se the PR with a hotfix (hardcoding TE version)
See description in #791
Steps to Reproduce
Build docker image with TE > 1.13, ie the commit from this issue
BioNeMo Framework Version
4ea7cbc
Bug Description
The training of Evo2 1b 8k divergece when TransformerEngine version is greater than 1.13. Se the PR with a hotfix (hardcoding TE version)
See description in #791
Steps to Reproduce
train_evo2 -d /workspace/bionemo2/sub-packages/bionemo-evo2/examples/configs/full_pretrain_shortphase_config.yaml --dataset-dir /data/evo2 --grad-acc-batches 1 --fp8 --fp8-wgrad --activation-checkpoint-recompute-num-layers 5 --enable-preemption --ckpt-async-save --use-megatron-comm-overlap-llama3-8k --overlap-grad-reduce --clip-grad=250 --eod-pad-in-loss-mask --seq-length=8192 --seed 3735928559 --lr=0.00015 --wd=0.1 --min-lr=1.5e-05 --warmup-steps=5000 --tensor-parallel-size=1 --context-parallel-size=1 --pipeline-model-parallel-size=1 --workers 8 --num-nodes=4 --devices=8 --micro-batch-size=8 --model-size=1b --max-steps=490000 --early-stop-on-step 6900 --limit-val-batches=20 --log-every-n-steps=50 --val-check-interval=500 --create-tflops-callback --create-tensorboard-logger --result-dir=./results --disable-checkpointing
Error Messages and Logs
The training crashes
Docker Image
No response
System Information
Environment Details:
GPU Details:
Additional Context
No response
The text was updated successfully, but these errors were encountered: