8000 Improve BCP-47 matching in name table · Issue #930 · fonttools/fonttools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Improve BCP-47 matching in name table #930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
behdad opened this issue Apr 19, 2017 · 19 comments
Closed

Improve BCP-47 matching in name table #930

behdad opened this issue Apr 19, 2017 · 19 comments
Assignees

Comments

@behdad
Copy link
Member
behdad commented Apr 19, 2017

Currently name table thinks "fa-IR" cannot be encoded in Windows names, because the table only has a "fa" entry. How should this be resolved? Keep dropping last item in the langcode while there's no match?

@brawer
Copy link
Collaborator
brawer commented Apr 20, 2017

Unicode defines an algorithm to minimize BCP47 language identifiers, with supporting data in CLDR. For example, this will transform all of fa-IR, fa-Arab and fa-Arab-IR to fa. The algorithm preserves some subtags; for example, fa-Arab-AF gets mapped to fa-AF because Iran (and not Afghanistan) is assumed to be the default region for the Farsi language.

ICU has implemented this algorithm, so we could call ICU. Alternatively, we could write a pure Python implementation somewhere in fonttools.misc. I think we should do the latter, because the algorithm isn’t all that complicated, and ICU would add a rather hefty dependency to fonttools.

I wouldn’t recommend blindly stripping language tags until we find a match in the old-style enums. The very point of BCP47 is to go beyond limited enums, and the reason for adding ltag to TrueType (respectively name format 1 to OpenType) was to support arbitrary BCP47 tags.

@behdad
Copy link
Member Author
behdad commented Apr 20, 2017

I agree a fontTools.bcp47 would be nice.

@brawer
Copy link
Collaborator
brawer commented May 1, 2017

Here’s two independent Python implementations. They both seem a bit heavy to me, but I’m putting the links here for reference.

@anthrotype
Copy link
Member
anthrotype commented Sep 16, 2017

I see that varLib's fvar builder is not using that addMultilingualNames method yet, and only adds English names (via addNames method). For the time being, couldn't we just use the hard-coded mappings we already have, or must we wait for full support for BCP-47 tags normalization?

@brawer
Copy link
Collaborator
brawer commented Sep 16, 2017

Agree, I think the current function is OK for now. It can always be improved later.

@anthrotype
Copy link
Member

The addMultilingualNames method adds Macintosh (platform 1) nameIDs only when it can't find a match for a given BCP-47 tag among Windows (platform 3) language ids.

The addNames method currently used by default adds both Mac and Windows (English) nameIDs.

If we un-comment that line and use the former instead of the latter, that means we would no longer add both sets of nameIDs for fvar axes names, but only one set (most often, the Windows ones).

Would that be ok? @robmck-ms reported an issue with OSX requiring platformID=1 names for variable fonts: #683

Is it only the postScriptNameID which OSX require as single-byte encoding, or also axisNameID and subfamilyNameID?
#756

@brawer
Copy link
Collaborator
brawer commented Sep 16, 2017

Uhm, why not try and see what works?

@justvanrossum
Copy link
Collaborator

The addMultilingualNames method adds Macintosh (platform 1) nameIDs only when it can't find a match for a given BCP-47 tag among Windows (platform 3) language ids.

In the context of #683 I think addMultilingualNames should just always add platform 1 versions. Or make it optional via a flag in the method signature. I don't understand what sense it makes to only add them if a windows version can't be made.

As it stands, addName by default adds both platform 1 and platform 3 versions, but addMultilingualNames only adds platform 3 (in the usual case), which appears inconsistent.

@anthrotype
Copy link
Member

I agree, let's make it add both by default

@justvanrossum
Copy link
Collaborator

I can make a PR.

@anthrotype
Copy link
Member

thanks!

anthrotype added a commit to anthrotype/fonttools that referenced this issue Jan 13, 2019
…elname'

now that addMultilingualName method also adds mac names by default, we can use it in
varLib instead of addName.
The language identifiers are expected to be minimized, i.e. 
8000
not contain default script/region
subtags -- until we implement the minimizeSubtags algorithm from ICU/CLDR:
fonttools#930
@anthrotype
Copy link
Member

here a few useful links, in case we work on this someday.

we want an equivalent of ICU uloc_minimizeTags function:
http://www.icu-project.org/apiref/icu4c/uloc_8h.html#acecda5c650c9a3a4e43900c676558e17
https://github.com/unicode-org/icu/blob/91d38d14e84feed46782aded16ce2038e92ce05e/icu4c/source/common/loclikely.cpp#L950

given a locale, find out which fields (language, script, or region) may be superfluous, in the sense that they contain the likely tags. For example, "en_Latn" can be simplified down to "en" since "Latn" is the likely script for "en", etc.

The algorithm is described here:
http://www.unicode.org/reports/tr35/#Likely_Subtags

The CLDR "likelySubtags" data can be found here:
http://www.unicode.org/cldr/charts/latest/supplemental/likely_subtags.html
https://github.com/unicode-cldr/cldr-core/blob/master/supplemental/likelySubtags.json

@brawer
Copy link
Collaborator
brawer commented Jan 14, 2019

Here’s the upstream source for the likelySubtags data: https://www.unicode.org/repos/cldr/trunk/common/supplemental/likelySubtags.xml

@behdad
Copy link
Member Author
behdad commented Jan 14, 2019

Here’s the upstream source for the likelySubtags data: https://www.unicode.org/repos/cldr/trunk/common/supplemental/likelySubtags.xml

I think we want this in HarfBuzz as well. Maybe something @dscorbett can look into?

@brawer
Copy link
< 8000 span aria-label="This user has been invited to collaborate on the fonttools repository." data-view-component="true" class="tooltipped tooltipped-n"> Collaborator
brawer commented Jan 14, 2019

And here’s the upstream source for aliasing deprecated tags (iwhe), overlong tags (engen), etc. Note that some items such as the country code SU for the Soviet Union have multiple replacements; it’s probably reasonable to use the first item in the list (in this case, alias SU to Russia RU).

https://www.unicode.org/repos/cldr/trunk/common/supplemental/supplementalMetadata.xml

Note that ISO 639 and therefore also the IANA language subtag registry have a concept of macro-languages. When ICU sees a language subtag arb, it aliases the tag to ar (and likewise, cmn to zh) because of the <languageAlias reason="macrolanguage"> elements in the CLDR supplementalMetadata file. Go’s x.text library, on the other hand, preserves the input code and special-cases macrolanguages at matching time. The reason behind ICU’s behavior is that every programmer on the planet means Mandarin whey they say “Chinese”, so it’s (in ICU’s perspective) best to alias cmn to zh. Personally I’d recommend following ICU, i.e. treat macrolanguage aliases in CLDR like all the other aliases, simply because ICU is more widely used than Go. But no strong opinions, just wanted to point out this quirk.

@behdad
Copy link
Member Author
behdad commented Jan 14, 2019

The likelySubtags map Latin in Iran to Turkmen. In reality, it would most probably be just Latin transcription of Persian. I assume other scripts / countries have same issue...

@behdad
Copy link
Member Author
behdad commented Jan 14, 2019

But that's something for CLDR, not us...

@brawer
Copy link
Collaborator
brawer commented Jan 14, 2019

Bug reports welcome: https://unicode.org/cldr/trac/newticket

@brawer
Copy link
Collaborator
brawer commented Jan 15, 2019

@behdad behdad closed this as completed Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0