Improve BCP-47 matching in name table #930

behdad · 2017-04-19T17:55:50Z

Currently name table thinks "fa-IR" cannot be encoded in Windows names, because the table only has a "fa" entry. How should this be resolved? Keep dropping last item in the langcode while there's no match?

brawer · 2017-04-20T05:57:52Z

Unicode defines an algorithm to minimize BCP47 language identifiers, with supporting data in CLDR. For example, this will transform all of fa-IR, fa-Arab and fa-Arab-IR to fa. The algorithm preserves some subtags; for example, fa-Arab-AF gets mapped to fa-AF because Iran (and not Afghanistan) is assumed to be the default region for the Farsi language.

ICU has implemented this algorithm, so we could call ICU. Alternatively, we could write a pure Python implementation somewhere in fonttools.misc. I think we should do the latter, because the algorithm isn’t all that complicated, and ICU would add a rather hefty dependency to fonttools.

I wouldn’t recommend blindly stripping language tags until we find a match in the old-style enums. The very point of BCP47 is to go beyond limited enums, and the reason for adding ltag to TrueType (respectively name format 1 to OpenType) was to support arbitrary BCP47 tags.

behdad · 2017-04-20T06:24:26Z

I agree a fontTools.bcp47 would be nice.

brawer · 2017-05-01T14:23:20Z

Here’s two independent Python implementations. They both seem a bit heavy to me, but I’m putting the links here for reference.

anthrotype · 2017-09-16T16:32:15Z

I see that varLib's fvar builder is not using that addMultilingualNames method yet, and only adds English names (via addNames method). For the time being, couldn't we just use the hard-coded mappings we already have, or must we wait for full support for BCP-47 tags normalization?

brawer · 2017-09-16T19:16:37Z

Agree, I think the current function is OK for now. It can always be improved later.

anthrotype · 2017-09-16T20:44:23Z

The addMultilingualNames method adds Macintosh (platform 1) nameIDs only when it can't find a match for a given BCP-47 tag among Windows (platform 3) language ids.

The addNames method currently used by default adds both Mac and Windows (English) nameIDs.

If we un-comment that line and use the former instead of the latter, that means we would no longer add both sets of nameIDs for fvar axes names, but only one set (most often, the Windows ones).

Would that be ok? @robmck-ms reported an issue with OSX requiring platformID=1 names for variable fonts: #683

Is it only the postScriptNameID which OSX require as single-byte encoding, or also axisNameID and subfamilyNameID?
#756

brawer · 2017-09-16T21:30:25Z

Uhm, why not try and see what works?

justvanrossum · 2018-11-02T07:29:08Z

The addMultilingualNames method adds Macintosh (platform 1) nameIDs only when it can't find a match for a given BCP-47 tag among Windows (platform 3) language ids.

In the context of #683 I think addMultilingualNames should just always add platform 1 versions. Or make it optional via a flag in the method signature. I don't understand what sense it makes to only add them if a windows version can't be made.

As it stands, addName by default adds both platform 1 and platform 3 versions, but addMultilingualNames only adds platform 3 (in the usual case), which appears inconsistent.

anthrotype · 2018-11-02T10:24:20Z

I agree, let's make it add both by default

justvanrossum · 2018-11-02T10:40:21Z

I can make a PR.

anthrotype · 2018-11-02T10:57:08Z

thanks!

…elname' now that addMultilingualName method also adds mac names by default, we can use it in varLib instead of addName. The language identifiers are expected to be minimized, i.e. 8000 not contain default script/region subtags -- until we implement the minimizeSubtags algorithm from ICU/CLDR: fonttools#930

anthrotype · 2019-01-13T17:23:33Z

here a few useful links, in case we work on this someday.

we want an equivalent of ICU uloc_minimizeTags function:
http://www.icu-project.org/apiref/icu4c/uloc_8h.html#acecda5c650c9a3a4e43900c676558e17
https://github.com/unicode-org/icu/blob/91d38d14e84feed46782aded16ce2038e92ce05e/icu4c/source/common/loclikely.cpp#L950

given a locale, find out which fields (language, script, or region) may be superfluous, in the sense that they contain the likely tags. For example, "en_Latn" can be simplified down to "en" since "Latn" is the likely script for "en", etc.

The algorithm is described here:
http://www.unicode.org/reports/tr35/#Likely_Subtags

The CLDR "likelySubtags" data can be found here:
http://www.unicode.org/cldr/charts/latest/supplemental/likely_subtags.html
https://github.com/unicode-cldr/cldr-core/blob/master/supplemental/likelySubtags.json

brawer · 2019-01-14T07:15:46Z

Here’s the upstream source for the likelySubtags data: https://www.unicode.org/repos/cldr/trunk/common/supplemental/likelySubtags.xml

behdad · 2019-01-14T18:21:00Z

Here’s the upstream source for the likelySubtags data: https://www.unicode.org/repos/cldr/trunk/common/supplemental/likelySubtags.xml

I think we want this in HarfBuzz as well. Maybe something @dscorbett can look into?

brawer · 2019-01-14T18:41:43Z

And here’s the upstream source for aliasing deprecated tags (iw → he), overlong tags (eng → en), etc. Note that some items such as the country code SU for the Soviet Union have multiple replacements; it’s probably reasonable to use the first item in the list (in this case, alias SU to Russia RU).

https://www.unicode.org/repos/cldr/trunk/common/supplemental/supplementalMetadata.xml

Note that ISO 639 and therefore also the IANA language subtag registry have a concept of macro-languages. When ICU sees a language subtag arb, it aliases the tag to ar (and likewise, cmn to zh) because of the <languageAlias reason="macrolanguage"> elements in the CLDR supplementalMetadata file. Go’s x.text library, on the other hand, preserves the input code and special-cases macrolanguages at matching time. The reason behind ICU’s behavior is that every programmer on the planet means Mandarin whey they say “Chinese”, so it’s (in ICU’s perspective) best to alias cmn to zh. Personally I’d recommend following ICU, i.e. treat macrolanguage aliases in CLDR like all the other aliases, simply because ICU is more widely used than Go. But no strong opinions, just wanted to point out this quirk.

behdad · 2019-01-14T18:46:47Z

The likelySubtags map Latin in Iran to Turkmen. In reality, it would most probably be just Latin transcription of Persian. I assume other scripts / countries have same issue...

behdad · 2019-01-14T18:46:58Z

But that's something for CLDR, not us...

brawer · 2019-01-14T18:59:33Z

Bug reports welcome: https://unicode.org/cldr/trac/newticket

brawer · 2019-01-15T06:43:35Z

Also relevant: http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers

behdad assigned brawer Apr 19, 2017

anthrotype mentioned this issue Sep 16, 2017

Localized family and style names LettError/designSpaceDocument#9

Open

anthrotype mentioned this issue Apr 2, 2018

Omit default labelnames googlefonts/glyphsLib#345

Merged

justvanrossum mentioned this issue Nov 2, 2018

[name] make addMultilingualNames() add mac names by default #1359

Merged

anthrotype mentioned this issue Jan 13, 2019

[varLib] use addMultilingualName method to support localized axis and instance names #1438

Merged

behdad closed this as completed Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve BCP-47 matching in name table #930

Improve BCP-47 matching in name table #930

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improve BCP-47 matching in name table #930

Improve BCP-47 matching in name table #930

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!