8000 Add unigram bytefallback by ArthurZucker · Pull Request #1217 · huggingface/tokenizers · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add unigram bytefallback #1217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Jun 26, 2023

Conversation

ArthurZucker
Copy link
Collaborator
@ArthurZucker ArthurZucker commented Apr 12, 2023

Adds support for bytfallback with the unigram model

@chris-ha458
Copy link
Contributor
chris-ha458 commented May 1, 2023

Something like this could initialize initial vocabulary for byte_fallback.
This could aslo be useful for
#1183 (comment)

const UNICODE_CAPACITY: usize = 256;

pub fn create_encoded_bytes() -> Vec<String> {
     (0..UNICODE_CAPACITY)
        .map(|i| format!("<0x{:02X}>", i))
        .collect()
}