-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[BUG] DataLoader low GPU utilization and extremely slow compared to manual batching #154318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jobs-git Curious, we do we have such large batch size ? Also, looks like in the 'direct access' setup we do not shuffle, where we do in the Dataloader |
Agreed it's not fair to compare shuffling vs. non-shuffling, but I can repro DataLoader slowness even with |
So we can accelerate model training. Setting shuffle to false in Dataloader resolves to same outcome, it stave of 1 or 2s in Dataloader, but we still arrived at the same conclusion. |
@jobs-git The batch size is quite large - but if that works for your pipeline thats great! I tried playing around with your script. Using
The dataloader might be doing some extra work here unnecessary for this simple use-case. I can try and see if there is a way to make the dataloader not do that. |
Tried this, not bad for improvements, but still significantly slower in my test. About 600% slower, which is huge for some functionality that is not always needed. Also GPU utilization is very low @ 10-20%. With bfloat16, the gap even widens to 50 times or 5000%! Increasing I like DataLoader as it simplifies batching, but the bloat, slowness and low GPU utilization is really an issue that hopefully could be resolved soon. |
Uh oh!
There was an error while loading. Please reload this page.
🐛 Describe the bug
DataLoader retrieves data about 7-22x or 50x slower with bfloat16 as compared to direct access even if direct access retrieves data from CPU.
Here is a reproducible sample:
results are as follows:
Other ref: https://stackoverflow.com/questions/76838721/iterating-over-pytorch-dataloader-slower-than-direct-dataset-access
with: shuffle=False, num_workers=8, prefetch_factor=8, pin_memory=True and batch_X.to(device, non_blocking=True)
results are as follows:
I have to reduce to batch size since DataLoader fails to complete due to
unexpected bus error
Versions
2.7
cc @msaroufim @jerryzh168 @andrewkho @divyanshk @ssnl @VitalyFedyunin @dzhulgakov
The text was updated successfully, but these errors were encountered: