-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: The final batch of an epoch is skipped when batch size is 1 #3653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unit Test Results 6 files + 1 6 suites +1 53m 19s ⏱️ + 24m 19s For more details on these failures, see this check. Results for commit 46e1b19. ± Comparison against base commit ef2c14a. ♻️ This comment has been updated with latest results. |
ludwig/trainers/trainer.py
Outdated
@@ -848,7 +848,9 @@ def train( | |||
should_shuffle=self.should_shuffle, | |||
random_seed=self.random_seed, | |||
distributed=self.distributed, | |||
ignore_last=True, | |||
ignore_last=( | |||
self.model.type() != MODEL_LLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an incidental thing (this would apply equally to ECD with batch size 1, and conversely, not apply if the LLM has batch size > 1). I could see you setting this based on whether self.batch_size > 1
, but isn't this obviated by the above change to ignore the skipping behavior when the batch size is > 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking for batch size in the batcher, I'm thinking that it might be clearer to do that check in the trainer and set ignore_last
explicitly:
# `ignore_last` skips the last batch of an epoch if the last batch only
# has one example in it. If the batch size is exactly 1, then we set ignore_last=False.
ignore_last = self.batch_size > 1
...
with training_set.initialize_batcher(
batch_size=self.batch_size,
should_shuffle=self.should_shuffle,
random_seed=self.random_seed,
distributed=self.distributed,
ignore_last=ignore_last,
...
This way, ignore_last
continues to straightforwardly ignore the last batch if it only contains one example, and the exceptional case when batch_size == 1
, which applies to both ECD and LLM model types, is managed outside of the batcher.
I can see an argument that the responsibility of handling the exceptional batch_size==1
case should belong to the batcher, in which case I think we can keep the trainer code the same as before, with ignore_last=True
always for both model types. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and made the latter change to always use ignore_last=True
while keeping the batch_size check in the batcher. @tgaddair does this LGTY?
Mentioning #2778 as the original change that introduced this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test?
…o skipping-last-batch
Done. |
When training with batch size 1 using the
RandomAcessBatcher
, the final batch of an epoch is always dropped because of this condition. Skipping batches leads to downstream inconsistencies with evaluation and metrics logging, and the trainer also runs an additional partial epoch to account for the missing training steps. Notably, this issue heavily impacts LLM fine-tuning, which is almost always run with batch size 1.