Ensure unicode support, strip carriage returns from vocab #215

ArtanisTheOne · 2025-03-09T22:40:09Z

No description provided.

ArtanisTheOne · 2025-03-09T23:03:18Z

eole/inputters/inputter.py

setup.py

…n to only be in newlines

ArtanisTheOne · 2025-03-11T17:06:40Z

I've fixed up the eole build_vocab to not include carriage returns in the saved file, however I think it is beneficial to keep the vocab loading parsing out the \r\n sequence as insurance for any file that is not generated by the build_vocab step that a user may want to use.
The implementation of splitting lines on r"\r?\n" will not interfere with \r characters included in the vocabulary tokens themselves. I've also added back the sentencepiece version specification that was there previously.

vince62s · 2025-03-11T17:15:39Z

it won't work for instance with llama vocabs where "\r" is a token on one line

ArtanisTheOne · 2025-03-11T17:17:53Z

It will, simply not if that token does not have an associated count, the main format, eg "{tok}\t{count}\n", which is the structure of build_vocab and the spm_to_vocab script, would work

vince62s · 2025-03-11T17:52:34Z

the conversion script convert_HF.py will take the HF tokenizer and generate a vocab file without counts and again some vocabs have "\r" alone or with extra characters in front of it. It is what it is, Eole is not just an NMT toolkit anymore.

…part of line split

ArtanisTheOne · 2025-03-11T18:41:12Z

Upon looking at it again, a simpler way is to just run .strip() on the counter that's found when parsing the line, which prevents the counter check from failing on is_digit(). This won't change HF model behavior. I adjusted as such.

Address windows issues

d7168d1

vince62s reviewed Mar 10, 2025

View reviewed changes

eole/inputters/inputter.py Outdated Show resolved Hide resolved

vince62s reviewed Mar 10, 2025

View reviewed changes

setup.py Outdated Show resolved Hide resolved

ArtanisTheOne added 2 commits March 10, 2025 09:28

Revert sentencepiece version change, refine removal of carriage retur…

e5699f8

…n to only be in newlines

Save built vocab in byte mode to avoid recall chars

c19ee8a

ArtanisTheOne changed the title ~~Ensure unicode support, strip recall chars from vocab~~ Ensure unicode support, strip carriage returns from vocab Mar 11, 2025

ArtanisTheOne added 3 commits March 11, 2025 14:02

Merge branch 'eole-nlp:main' into main

79cc125

Open all vocab files in unicode mode universally

1ff722c

Adjust from trying to split lines with recall -> stripping from last …

26f272d

…part of line split

vince62s merged commit f6576a2 into eole-nlp:main Mar 12, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure unicode support, strip carriage returns from vocab #215

Ensure unicode support, strip carriage returns from vocab #215

Ensure unicode support, strip carriage returns from vocab #215

Ensure unicode support, strip carriage returns from vocab #215

Conversation