Open
Description
Infixes Update Not Applying Properly to Tokenizer
Description
I tried updating the infix patterns in spaCy, but the changes are not applying correctly to the tokenizer. Specifically, I'm trying to modify how apostrophes and other symbols ( '
) are handled. However, even after setting a new regex, the tokenizer does not reflect these changes.
Steps to Reproduce
Here are the two approaches I tried:
1️⃣ Removing apostrophe-related rules from infixes
and recompiling:
default_infixes = [pattern for pattern in nlp.Defaults.infixes if "'" not in pattern]
infix_re = compile_infix_regex(default_infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
Issue: Even after modifying the infix rules, contractions like "can't"
still split incorrectly.
2️⃣ Manually adding new infix rules (including hyphens, plus signs, and dollar signs):
infixes = nlp.Defaults.infixes + [r"'",]
infixe_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infixe_regex.finditer
Expected Behavior
- The tokenizer should correctly apply the new infix rules.
Actual Behavior
- Changes to
nlp.tokenizer.infix_finditer
do not seem to take effect.
Question
Am I missing something in how infix rules should be updated? Is there a correct way to override infix splitting?
Thanks for your help!
Metadata
Metadata
Assignees
Labels
No labels