-
Notifications
You must be signed in to change notification settings - Fork 475
TSI tables content as utf-8 and unicode instead of bytes #1060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@moyogo I remember we found that in some old VTT 4.x versions text was encoded with some single-byte encoding (Windows ANSI cp1252, or MacRoman, I don't quite remember). |
This came in when we added the XML import/export in VTT 6.01. Prior to that, VTT used byte encoding, after that, it used UTF-8, but could read the byte encoding as well. So, fonts created in >6.01 can't be read by <6.01, but any fonts can be read by >6.01. So, it all comes down to how many people are using VTT 4.x. (We don't have telemetry on that). |
We can add the following to decompile(), which would convert CP1252 VTT tables to UTF-8 when round tripping through fontTools. try:
text = tounicode(text, encoding='utf-8')
except UnicodeDecodeError:
text = tounicode(text, encoding='cp1252') Alternatively, we could add an attribute |
@moyogo I don't think that would work reliably 100% of the time. Some sequences of bytes could be decoded with either utf-8 or cp1252. E.g. >>> b'\xc2\xa9'.decode('utf-8')
'©'
>>> b'\xc2\xa9'.decode('cp1252')
'©' Unless you rule out the possibility that one may actually wish to write '©' instead of '©'... Even adding an I think it'd be better to revert this, and let clients (e.g. vttLib) do the decoding, as they more likely would know what VTT version was used to generated the TSI tables. |
I have a bunch of fonts with TSI* tables that fails to load with 'utf-8' decode errors, so should this change be reverted or is it too late now? |
I think it's a bit too late to revert. Do your fonts work if the encoding is "cp1252"? They won't crash at least since every possible byte sequence can be decoded that way, though they may contain gibberish. If we think that crashing is worse than having old VTT sources sometimes being misinterpreted as utf-8 'mojibake', then this would work: try: There are also 3rd party libs like |
how does VTT 6 know if its reading one of those old sources with single-byte encodings or UTF-8 text? Is the VTT version stored somewhere? |
I tried a try except like you propose and the files loaded fine, but I know nothing about VTT sources to know if it broke anything. |
normally non-ascii characters would appear in the inline comments, because the instructions' names are plain ascii (which both utf-8 and cp1252 decode the same way). The compile-ability should be preserved. |
I noticed that instructions from the VTT autohinter contain comments with the plusminus symbol: Like @anthrotype said: this can only happen in comments. VTT instructions are all in plain ascii. |
No description provided.