8000 TSI tables content as utf-8 and unicode instead of bytes by moyogo · Pull Request #1060 · fonttools/fonttools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

TSI tables content as utf-8 and unicode instead of bytes #1060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 21, 2017

Conversation

moyogo
Copy link
Collaborator
@moyogo moyogo commented Sep 21, 2017

No description provided.

@brawer brawer merged commit 7f352b0 into fonttools:master Sep 21, 2017
@moyogo moyogo deleted the TSI1 branch September 21, 2017 20:44
@anthrotype
Copy link
Member

@moyogo I remember we found that in some old VTT 4.x versions text was encoded with some single-byte encoding (Windows ANSI cp1252, or MacRoman, I don't quite remember).
Would this enforcing of UTF-8 break those?
Not that I care. I'm fine with only supporting the latest VTT 6.x series.

@robmck-ms
Copy link
Contributor

This came in when we added the XML import/export in VTT 6.01. Prior to that, VTT used byte encoding, after that, it used UTF-8, but could read the byte encoding as well. So, fonts created in >6.01 can't be read by <6.01, but any fonts can be read by >6.01. So, it all comes down to how many people are using VTT 4.x. (We don't have telemetry on that).

@moyogo
Copy link
Collaborator Author
moyogo commented Sep 29, 2017

We can add the following to decompile(), which would convert CP1252 VTT tables to UTF-8 when round tripping through fontTools.

try:
    text = tounicode(text, encoding='utf-8')
except UnicodeDecodeError:
    text = tounicode(text, encoding='cp1252')

Alternatively, we could add an attribute encoding to roundtrip without losing the encoding.

@anthrotype
Copy link
Member

@moyogo I don't think that would work reliably 100% of the time.

Some sequences of bytes could be decoded with either utf-8 or cp1252. E.g.

>>> b'\xc2\xa9'.decode('utf-8')
'©'
>>> b'\xc2\xa9'.decode('cp1252')
'©' 

Unless you rule out the possibility that one may actually wish to write '©' instead of '©'...

Even adding an encoding attribute in the XML dump would not solve the issue of detecting what is the encoding used for the binary TSI table text.
If the version of VTT was stored inside these tables somewhere then we could use one or the other encoding accordingly.

I think it'd be better to revert this, and let clients (e.g. vttLib) do the decoding, as they more likely would know what VTT version was used to generated the TSI tables.

@khaledhosny
Copy link
Collaborator

I have a bunch of fonts with TSI* tables that fails to load with 'utf-8' decode errors, so should this change be reverted or is it too late now?

@anthrotype
Copy link
Member

I think it's a bit too late to revert. Do your fonts work if the encoding is "cp1252"? They won't crash at least since every possible byte sequence can be decoded that way, though they may contain gibberish.

If we think that crashing is worse than having old VTT sources sometimes being misinterpreted as utf-8 'mojibake', then this would work:

try:
text = tounicode(text, encoding='utf-8')
except UnicodeDecodeError:
text = tounicode(text, encoding='cp1252')

There are also 3rd party libs like charset-normalizer (used by requests) that assist in auto-detecting the 'best' encoding but it would probably be overkill for this.

@anthrotype
Copy link
Member

This came in when we added the XML import/export in VTT 6.01. Prior to that, VTT used byte encoding, after that, it used UTF-8, but could read the byte encoding as well. So, fonts created in >6.01 can't be read by <6.01, but any fonts can be read by >6.01. So, it all comes down to how many people are using VTT 4.x.

how does VTT 6 know if its reading one of those old sources with single-byte encodings or UTF-8 text? Is the VTT version stored somewhere?

@khaledhosny
Copy link
Collaborator

I tried a try except like you propose and the files loaded fine, but I know nothing about VTT sources to know if it broke anything.

@anthrotype
Copy link
Member

normally non-ascii characters would appear in the inline comments, because the instructions' names are plain ascii (which both utf-8 and cp1252 decode the same way). The compile-ability should be preserved.

@BoldMonday
Copy link

I noticed that instructions from the VTT autohinter contain comments with the plusminus symbol:
fn changes <cvt> by <amount> (in ±1/64 pixel) at ppem sizes

Like @anthrotype said: this can only happen in comments. VTT instructions are all in plain ascii.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0