8000 fix: The final batch of an epoch is skipped when batch size is 1 by jeffkinnison · Pull Request #3653 · ludwig-ai/ludwig · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix: The final batch of an epoch is skipped when batch size is 1 #3653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 26, 2023

Conversation

jeffkinnison
Copy link
Contributor

When training with batch size 1 using the RandomAcessBatcher, the final batch of an epoch is always dropped because of this condition. Skipping batches leads to downstream inconsistencies with evaluation and metrics logging, and the trainer also runs an additional partial epoch to account for the missing training steps. Notably, this issue heavily impacts LLM fine-tuning, which is almost always run with batch size 1.

@github-actions
Copy link
github-actions bot commented Sep 22, 2023

Unit Test Results

       6 files  +       1         6 suites  +1   53m 19s ⏱️ + 24m 19s
2 807 tests +2 776  2 793 ✔️ +2 767  12 💤 +7  2 +2 
2 847 runs  +2 784  2 824 ✔️ +2 775  21 💤 +7  2 +2 

For more details on these failures, see this check.

Results for commit 46e1b19. ± Comparison against base commit ef2c14a.

♻️ This comment has been updated with latest results.

@@ -848,7 +848,9 @@ def train(
should_shuffle=self.should_shuffle,
random_seed=self.random_seed,
distributed=self.distributed,
ignore_last=True,
ignore_last=(
self.model.type() != MODEL_LLM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an incidental thing (this would apply equally to ECD with batch size 1, and conversely, not apply if the LLM has batch size > 1). I could see you setting this based on whether self.batch_size > 1, but isn't this obviated by the above change to ignore the skipping behavior when the batch size is > 1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking for batch size in the batcher, I'm thinking that it might be clearer to do that check in the trainer and set ignore_last explicitly:

# `ignore_last` skips the last batch of an epoch if the last batch only 
# has one example in it. If the batch size is exactly 1, then we set ignore_last=False.
ignore_last = self.batch_size > 1

...

with training_set.initialize_batcher(
                batch_size=self.batch_size,
                should_shuffle=self.should_shuffle,
                random_seed=self.random_seed,
                distributed=self.distributed,
                ignore_last=ignore_last,
...

This way, ignore_last continues to straightforwardly ignore the last batch if it only contains one example, and the exceptional case when batch_size == 1, which applies to both ECD and LLM model types, is managed outside of the batcher.

I can see an argument that the responsibility of handling the exceptional batch_size==1 case should belong to the batcher, in which case I think we can keep the trainer code the same as before, with ignore_last=True always for both model types. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and made the latter change to always use ignore_last=True while keeping the batch_size check in the batcher. @tgaddair does this LGTY?

@justinxzhao
Copy link
Contributor
justinxzhao commented Sep 25, 2023

Mentioning #2778 as the original change that introduced this issue.

Copy link
Collaborator
@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test?

@justinxzhao
Copy link
Contributor

Can we add a test?

Done.

@justinxzhao justinxzhao merged commit 4af5331 into master Sep 26, 2023
@justinxzhao justinxzhao deleted the skipping-last-batch branch September 26, 2023 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0