8000 Revert: Simplify how we set pad token and pad token ID for huggingfac… by arnavgarg1 · Pull Request #3897 · ludwig-ai/ludwig · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Revert: Simplify how we set pad token and pad token ID for huggingfac… #3897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 18, 2024

Conversation

arnavgarg1
Copy link
Contributor
@arnavgarg1 arnavgarg1 commented Jan 18, 2024

This revert addresses a critical issue introduced in the previous pull request (#3735), where the simplification of pad token configuration inadvertently led to the PAD token being mapped to the same token ID as the UNK (unknown) token. This mapping anomaly resulted in quality degradation during fine-tuning.

The problem surfaced as the model, instead of learning to predict an EOS (end-of-sequence) token to indicate stopping at the end of a sequence, learned to predict an UNK token at the end of sequences. This hindered the model's ability to recognize when to halt during generation, impacting the overall performance and quality of the fine-tuned model.

This reversion aims to restore the previous pad token setup and rectify the unintended mapping issue, ensuring that the model correctly learns to predict EOS tokens for proper sequence termination during fine-tuning.

Demonstration of the bug that was introduced using Llama-2

Current:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> tokenizer
LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, <
8000
span class="pl-s1">normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
>>> tokenizer.pad_token = "[PAD]"
>>> tokenizer
LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '[PAD]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
>>> tokenizer.pad_token_id
0
>>> str2idx = tokenizer.vocab
>>> idx2str = {v: k for k, v in str2idx.items()}
>>> idx2str[0]
'<unk>'
>>> tokenizer.unk_token_id
0

The issue here is that we're mapping the new PAD token to the same token ID as the UNK token, which is the last token that's passed into the model's forward pass.

This is what used to happen before (and what this revert will go back to)

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> tokenizer.pad_token = tokenizer.eos_token
>>> tokenizer.pad_token_id = tokenizer.eos_token_id
>>> tokenizer.pad_token
'</s>'
>>> tokenizer.pad_token_id
2
>>> tokenizer.eos_token
'</s>'
>>> tokenizer.eos_token_id
2
>>> tokenizer.unk_token_id
0
>>> tokenizer.unk_token
'<unk>'
>>> tokenizer
LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

Copy link
Collaborator
@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed t 8000 o describe this comment to others. Learn more.

LGTM

Copy link

Unit Test Results

  6 files  ±0    6 suites  ±0   14m 19s ⏱️ +9s
12 tests ±0    9 ✔️ ±0    3 💤 ±0  0 ±0 
60 runs  ±0  42 ✔️ ±0  18 💤 ±0  0 ±0 

Results for commit 224bebf. ± Comparison against base commit 3203cc1.

@arnavgarg1 arnavgarg1 changed the title [WIP] Revert: Simplify how we set pad token and pad token ID for huggingfac… Revert: Simplify how we set pad token and pad token ID for huggingfac… Jan 18, 2024
@arnavgarg1 arnavgarg1 merged commit 9eb113b into master Jan 18, 2024
@arnavgarg1 arnavgarg1 deleted the revert_padding_tokens branch January 18, 2024 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0