TSI tables content as utf-8 and unicode instead of bytes #1060

moyogo · 2017-09-21T10:25:23Z

No description provided.

anthrotype · 2017-09-27T18:20:25Z

@moyogo I remember we found that in some old VTT 4.x versions text was encoded with some single-byte encoding (Windows ANSI cp1252, or MacRoman, I don't quite remember).
Would this enforcing of UTF-8 break those?
Not that I care. I'm fine with only supporting the latest VTT 6.x series.

robmck-ms · 2017-09-28T23:07:10Z

This came in when we added the XML import/export in VTT 6.01. Prior to that, VTT used byte encoding, after that, it used UTF-8, but could read the byte encoding as well. So, fonts created in >6.01 can't be read by <6.01, but any fonts can be read by >6.01. So, it all comes down to how many people are using VTT 4.x. (We don't have telemetry on that).

moyogo · 2017-09-29T05:03:42Z

We can add the following to decompile(), which would convert CP1252 VTT tables to UTF-8 when round tripping through fontTools.

try:
    text = tounicode(text, encoding='utf-8')
except UnicodeDecodeError:
    text = tounicode(text, encoding='cp1252')

Alternatively, we could add an attribute encoding to roundtrip without losing the encoding.

anthrotype · 2017-10-02T15:05:57Z

@moyogo I don't think that would work reliably 100% of the time.

Some sequences of bytes could be decoded with either utf-8 or cp1252. E.g.

>>> b'\xc2\xa9'.decode('utf-8')
'©'
>>> b'\xc2\xa9'.decode('cp1252')
'Â©'

Unless you rule out the possibility that one may actually wish to write 'Â©' instead of '©'...

Even adding an encoding attribute in the XML dump would not solve the issue of detecting what is the encoding used for the binary TSI table text.
If the version of VTT was stored inside these tables somewhere then we could use one or the other encoding accordingly.

I think it'd be better to revert this, and let clients (e.g. vttLib) do the decoding, as they more likely would know what VTT version was used to generated the TSI tables.

khaledhosny · 2025-04-28T22:31:01Z

I have a bunch of fonts with TSI* tables that fails to load with 'utf-8' decode errors, so should this change be reverted or is it too late now?

anthrotype · 2025-04-29T10:40:36Z

I think it's a bit too late to revert. Do your fonts work if the encoding is "cp1252"? They won't crash at least since every possible byte sequence can be decoded that way, though they may contain gibberish.

If we think that crashing is worse than having old VTT sources sometimes being misinterpreted as utf-8 'mojibake', then this would work:

try:
text = tounicode(text, encoding='utf-8')
except UnicodeDecodeError:
text = tounicode(text, encoding='cp1252')

There are also 3rd party libs like charset-normalizer (used by requests) that assist in auto-detecting the 'best' encoding but it would probably be overkill for this.

anthrotype · 2025-04-29T10:42:34Z

This came in when we added the XML import/export in VTT 6.01. Prior to that, VTT used byte encoding, after that, it used UTF-8, but could read the byte encoding as well. So, fonts created in >6.01 can't be read by <6.01, but any fonts can be read by >6.01. So, it all comes down to how many people are using VTT 4.x.

how does VTT 6 know if its reading one of those old sources with single-byte encodings or UTF-8 text? Is the VTT version stored somewhere?

khaledhosny · 2025-04-29T12:37:39Z

I tried a try except like you propose and the files loaded fine, but I know nothing about VTT sources to know if it broke anything.

anthrotype · 2025-04-30T08:44:47Z

normally non-ascii characters would appear in the inline comments, because the instructions' names are plain ascii (which both utf-8 and cp1252 decode the same way). The compile-ability should be preserved.

BoldMonday · 2025-04-30T10:00:54Z

I noticed that instructions from the VTT autohinter contain comments with the plusminus symbol:
fn changes <cvt> by <amount> (in ±1/64 pixel) at ppem sizes

Like @anthrotype said: this can only happen in comments. VTT instructions are all in plain ascii.

Denis Moyogo Jacquerye added 2 commits September 21, 2017 15:12

TSI1: update tests for utf-8

fcce935

TSI tables content as unicode

35625df

moyogo force-pushed the TSI1 branch from 257a3da to 35625df Compare September 21, 2017 14:13

brawer merged commit 7f352b0 into fonttools:master Sep 21, 2017

moyogo deleted the TSI1 branch September 21, 2017 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TSI tables content as utf-8 and unicode instead of bytes #1060

TSI tables content as utf-8 and unicode instead of bytes #1060

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TSI tables content as utf-8 and unicode instead of bytes #1060

TSI tables content as utf-8 and unicode instead of bytes #1060

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!