8000 Ensure unicode support, strip carriage returns from vocab by ArtanisTheOne · Pull Request #215 · eole-nlp/eole · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Ensure unicode support, strip carriage returns from vocab #215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 12, 2025

Conversation

ArtanisTheOne
Copy link
Contributor

No description provided.

@ArtanisTheOne
Copy link
Contributor Author

#214

@ArtanisTheOne ArtanisTheOne changed the title Ensure unicode support, strip recall chars from vocab Ensure unicode support, strip carriage returns from vocab Mar 11, 2025
@ArtanisTheOne
Copy link
Contributor Author
ArtanisTheOne commente 8000 d Mar 11, 2025

I've fixed up the eole build_vocab to not include carriage returns in the saved file, however I think it is beneficial to keep the vocab loading parsing out the \r\n sequence as insurance for any file that is not generated by the build_vocab step that a user may want to use.
The implementation of splitting lines on r"\r?\n" will not interfere with \r characters included in the vocabulary tokens themselves. I've also added back the sentencepiece version specification that was there previously.

@vince62s
Copy link
Contributor

it won't work for instance with llama vocabs where "\r" is a token on one line

@ArtanisTheOne
Copy link
Contributor Author
ArtanisTheOne commented Mar 11, 2025

It will, simply not if that token does not have an associated count, the main format, eg "{tok}\t{count}\n", which is the structure of build_vocab and the spm_to_vocab script, would work

@vince62s
Copy link
Contributor

the conversion script convert_HF.py will take the HF tokenizer and generate a vocab file without counts and again some vocabs have "\r" alone or with extra characters in front of it. It is what it is, Eole is not just an NMT toolkit anymore.

@ArtanisTheOne
Copy link
Contributor Author

Upon looking at it again, a simpler way is to just run .strip() on the counter that's found when parsing the line, which prevents the counter check from failing on is_digit(). This won't change HF model behavior. I adjusted as such.

@vince62s vince62s merged commit f6576a2 into eole-nlp:main Mar 12, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0