You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running Task 1 zero-shot evaluation on an NVIDIA A100 (80 GB VRAM) using the default batch_size=1024 in gena_lm.py and encountering CUDA OOM errors. Could you please clarify what batch sizes you used for this task? If 1024 was used, how did you avoid running into memory issues on similar hardware?
Thanks for your help!
The text was updated successfully, but these errors were encountered:
We did run our evaluations on similar hardware using the parameters we specified in the code. However, we noticed that GENA-LM's tokenizer has a tendency to produce large variations in sequence lengths, resulting in significant fluctuations in memory usage from batch to batch.
I'd expect that decreasing batch size would mitigate this, but let us know if issues persist.
reducing the batch size does indeed prevent the OOM errors, but the evaluation procedure is very slow. The estimation time this specific task is about 16 days to finish on one A100.
Did you observe a similar runtime in your experiments, or did you apply any additional optimizations to speed things up?
Hi,
I'm running Task 1 zero-shot evaluation on an NVIDIA A100 (80 GB VRAM) using the default batch_size=1024 in gena_lm.py and encountering CUDA OOM errors. Could you please clarify what batch sizes you used for this task? If 1024 was used, how did you avoid running into memory issues on similar hardware?
Thanks for your help!
The text was updated successfully, but these errors were encountered: