fix: The final batch of an epoch is skipped when batch size is 1 #3653

jeffkinnison · 2023-09-22T21:30:43Z

When training with batch size 1 using the RandomAcessBatcher, the final batch of an epoch is always dropped because of this condition. Skipping batches leads to downstream inconsistencies with evaluation and metrics logging, and the trainer also runs an additional partial epoch to account for the missing training steps. Notably, this issue heavily impacts LLM fine-tuning, which is almost always run with batch size 1.

github-actions · 2023-09-22T22:14:21Z

Unit Test Results

      6 files +      1       6 suites +1 53m 19s ⏱️ + 24m 19s
2 807 tests +2 776 2 793 ✔️ +2 767 12 💤 +7 2 ❌ +2
2 847 runs +2 784 2 824 ✔️ +2 775 21 💤 +7 2 ❌ +2

For more details on these failures, see this check.

Results for commit 46e1b19. ± Comparison against base commit ef2c14a.

♻️ This comment has been updated with latest results.

tgaddair · 2023-09-22T23:48:06Z

ludwig/trainers/trainer.py

@@ -848,7 +848,9 @@ def train(
                should_shuffle=self.should_shuffle,
                random_seed=self.random_seed,
                distributed=self.distributed,
-                ignore_last=True,
+                ignore_last=(
+                    self.model.type() != MODEL_LLM


This seems like an incidental thing (this would apply equally to ECD with batch size 1, and conversely, not apply if the LLM has batch size > 1). I could see you setting this based on whether self.batch_size > 1, but isn't this obviated by the above change to ignore the skipping behavior when the batch size is > 1?

Instead of checking for batch size in the batcher, I'm thinking that it might be clearer to do that check in the trainer and set ignore_last explicitly:

# `ignore_last` skips the last batch of an epoch if the last batch only # has one example in it. If the batch size is exactly 1, then we set ignore_last=False. ignore_last = self.batch_size > 1 ... with training_set.initialize_batcher( batch_size=self.batch_size, should_shuffle=self.should_shuffle, random_seed=self.random_seed, distributed=self.distributed, ignore_last=ignore_last, ...

This way, ignore_last continues to straightforwardly ignore the last batch if it only contains one example, and the exceptional case when batch_size == 1, which applies to both ECD and LLM model types, is managed outside of the batcher.

I can see an argument that the responsibility of handling the exceptional batch_size==1 case should belong to the batcher, in which case I think we can keep the trainer code the same as before, with ignore_last=True always for both model types. WDYT?

I went ahead and made the latter change to always use ignore_last=True while keeping the batch_size check in the batcher. @tgaddair does this LGTY?

justinxzhao · 2023-09-25T14:00:07Z

Mentioning #2778 as the original change that introduced this issue.

tgaddair

Can we add a test?

…o skipping-last-batch

justinxzhao · 2023-09-26T21:09:50Z

Can we add a test?

Done.

jeffkinnison added 2 commits September 22, 2023 16:28

do not skip batches for batch size 1

8724d97

do not skip last batch for LLMs

6ad55ba

tgaddair reviewed Sep 22, 2023

View reviewed changes

Set ignore_last=True always, and add a docstring.

37e6e08

justinxzhao requested review from tgaddair and martindavis September 25, 2023 21:07

tgaddair approved these changes Sep 26, 2023

View reviewed changes

justinxzhao added 2 commits September 26, 2023 17:09

Add a test.

03bff76

Merge branch 'skipping-last-batch' of github.com:ludwig-ai/ludwig int…

46e1b19

…o skipping-last-batch

justinxzhao merged commit 4af5331 into master Sep 26, 2023

justinxzhao deleted the skipping-last-batch branch September 26, 2023 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: The final batch of an epoch is skipped when batch size is 1 #3653

fix: The final batch of an epoch is skipped when batch size is 1 #3653

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix: The final batch of an epoch is skipped when batch size is 1 #3653

fix: The final batch of an epoch is skipped when batch size is 1 #3653

Uh oh!

Conversation

Uh oh!

Uh oh!

Unit Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!