[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Unicode Frequently Asked Questions

Variation Sequences

General variation sequences FAQ

Q: What are variation sequences?

Every character in Unicode can be displayed with many different glyphs: An "a" can be displayed with or without the top "hook" (a versus ɑ). A not-equals sign (≠) can be displayed with an angled or vertical slash, and so on.

In some situations, however, it is important to indicate in plain text that only a subset of the possible glyphs for a character should be used, such as a vertical slash for ≠. The variation sequences are a standardized mechanism for requesting such an appearance.

Q: What is the structure of a variation sequence?

Variation sequences consist of a base character followed by a variation selector.

Q: What variation sequences are valid?

Only those listed in StandardizedVariants.txt, emoji-variation-sequences.txt, or the registered sequences listed in the Ideographic Variation Database (IVD).

Q: What's the difference between standardized variation sequences and registered Ideographic Variation Sequences (IVSes)?

There is no difference in how these two types of variation sequences are used or supported by implementations. Standardized sequences, as the name implies, are defined in the Unicode Standard, as listed in StandardizedVariants.txt. The variation sequences which affect emoji presentation are listed in emoji-variation-sequences.txt, and are documented in UTS #51, Unicode Emoji. IVS are formally registered by their submitters according to the procedures listed in UTS #37, Unicode Ideographic Variation Database. They are listed in the Ideographic Variation Database (IVD). After an IVS has been registered, it can be used by anyone.

Q: Can I define my own variation sequences?

No, that is the equivalent of trying to define an unassigned code point to be your own character. Private use characters should be used instead.

Q: Which glyph variations can be represented with variation sequences?

Only a selected subset of glyph variations has a corresponding variation sequence; in most cases, it is up to a font to select the specific glyph from within a range of acceptable or recognizable shapes defined by convention.

Sometimes a font needs to follow particular conventions to be useful. For example, any font for IPA must display the "a" at U+0061 with a handle glyph, not a bowl glyph. Otherwise, it would be impossible to express the distinction from the IPA character with the bowl a at U+0251. The same applies for the contrasting shapes used for “g”. In a mathematical context, similar considerations apply to Greek characters, where particular forms of beta, theta, phi, etc. have been encoded for mathematical purposes, so the font must consistently supply the “other” shape for the regular character while for general text, the convention would allow free alternation of these forms. [AF]

Q: When are variation sequences not appropriate?

Variation sequences are inappropriate if two different shapes of a character carry very distinct meaning. This was the case for IPA, where a separate character was encoded for the hooked “g” (U+0261) and also for the “ɑ” that doesn't have a handle (U+0251), instead of defining variation sequences to represent the different glyphic presentations.

Q: How are the glyphs for variation sequences described?

A standardized variation sequence, such as <222A, FE00>, associates a sequence with a description, such as “UNION with serifs”. Here, “with serifs” indicates that the presence of serifs distinguishes the glyph variant from the ordinary glyph (which does not have serifs). In this case with a mathematical operator, the form without serifs would be predominant. There are other cases, where glyph variants occur more equally—in those cases, it would be problematic to assign only one of them a variation sequence, as the other one isn't necessarily a “default.”

The appearance of the variant glyph is not as tightly restricted as the design of a logo, for example. It still can vary in all aspects, except that it is expected to retain its distinguishing characteristic—and it should remain a recognizable glyph for the character.

In order to standardize a variation sequence, the variant glyph at a minimum needs to be identified and described. It should also be applicable generically, not restricted to a single font, such as the many stylistic variations of the ampersand only found in Poetica Ampersand. [AF]

Q: What is the difference between stylistic glyph variants and positional forms?

Some characters adopt different shapes depending on the characters around them. These are called positional forms. Unlike “random” stylistic variations, these are standard forms for these characters, in the sense that a reader can look at the shape and say “this is the final form of character xxx.”

Where the display of positional forms is predictable, such as in Arabic, variation sequences are not necessary. In cases where positional variants need to be displayed outside their normal context, this rendering can be handled with two special characters ZWJ (ZERO WIDTH JOINER) and ZWNJ (ZERO WIDTH NON-JOINER), instead of variation sequences. Mongolian is more complex, so there are special variation selectors for it. For more information, see Section 13.5, Mongolian in The Unicode Standard.

Q:What are examples of positional forms encoded as separate characters?

In Greek, the small sigma has a special form, which is used at the end of words. It was given an explicit character code in early Greek character sets, so Unicode continued this practice. With identifiers, like IDNs, where words are often written together without space, the final sigma can also appear in the middle of a word. Such exceptions defy easy automation. In the Latin script, the contrast between “long s” and regular (round) “s” is in some sense positional, but the rules are not easy to automate, and even then exceptions would apply. Therefore, again, an explicit character was encoded. Similar characters are encoded for Hebrew.

Variation sequence display and support FAQ

Q: How should variation sequences be displayed?

When they are valid variation sequences, they should be displayed as illustrated in the Unicode code charts, the emoji charts, or in the Ideographic Variation Database. When a variation sequence is not valid or its display is not supported, the base character is displayed as usual, and the variation selector is invisible. See Display of Unsupported Characters.

Q:What about applications that don't support variation sequences?

Applications not supporting variation sequences should act as if the variation selector is not present. That normally applies to all text processes such as searching, sorting, parsing, and so forth.

Q: How can variation sequences be handled in fonts?

For handling variation sequences with OpenType fonts, see “Format 14: Unicode Variation Sequencesexternal link” in the OpenType specification.

The following font development tools are helpful for implementing and verifying variation sequences in OpenType fonts via the Format 14 'cmap' subtable:

A significant number of OpenType fonts now support variation sequences. Please consult the font's documentation to determine the extent to which variation sequences are supported.

Q: What changes does a browser developer need to make to support variation sequences?

Browsers generally use a font fallback mechanism to display web pages. This allows users to read text when the font specified in the web page is unavailable or doesn't support all the characters that are referenced on that web page. A simple but insufficient mechanism is to display characters in a font up until the first character that can't be displayed. Such a mechanism fails with variation sequences. A better mechanism is to treat a combining character sequence as a single entity for the purpose of font substitution. Because variation selectors have the General_Category property value of Nonspacing_Mark, this treatment allows variation sequences to be handled correctly. This applies more generally, to developers of any OS or application, and not only to browser developers.

Q: How should variation sequences be handled in search?

There are a number of different methods. The first and simplest method is to ignore any variation selectors when doing a search. Another method is to have a query without variation selectors match terms with any variation selectors, but a query with a specific variation selector will only match a term with that variation selector. Thus:

Q: How should variation sequences be handled in IMEs (input method editors for CJK)?

They can be listed as separate options to choose from, just like single code points. However, if there are many options it may be worth having a pull-down or fly-out menu associated with the base character.

Standardized variation sequences FAQ

Q: How can I propose a standardized variation sequence?

You can initiate the process of requesting a variation sequence by submitting an inquiry via the contact form. A thorough understanding of how Variation Selectors are used will make a proposal more likely to be accepted by the UTC. Read Section 23.4, Variation Selectors, UTR #25, Unicode Support for Mathematics and UAX #34, Unicode Named Character Sequences, as well as the rest of this FAQ for background information. [AF]

Q: I'm proposing an addition to a historic script that is a variant of an existing character. Should I propose it as a new character or as a new variation sequence?

Variation sequences provide a means to specify a certain significant glyphic variation of a character, without encoding each variation as a separate character. This is particularly useful whenever such distinction is not universally necessary.

Because the character itself is part of the variation sequence, one should be able to search and find all the instances of that particular character, independent of variation in its appearance, a task which would be more complicated if the variants were encoded as separate characters. If you can replace the variant by the existing character without significantly distorting the content of the text, then a variation sequence is the appropriate way to represent the variant, and you should propose your addition as a variation sequence.

For historic scripts, the variation sequence provides a useful tool, because it can show mistaken or nonce glyphs and relate them to the base character. It can also be used to reflect the views of scholars, who may see the relation between the glyphs and base characters differently. Also, new variation sequences can be added for new variant appearances (and their relation to the base characters) as more evidence is discovered.

Q: In what situations does Unicode define variation sequences?

Standardized variation sequences are intended as an exceptional mechanism to deal with certain difficult edge-cases where the character versus glyph question cannot be decided. To qualify as a standardized variant an entity must clearly be the same character, in most cases. In most contexts that means that substituting the base character is not only harmless to the meaning of the text, but ideally not even noticeable by many readers. [AF]

Q: Will standardized variation sequences be defined for italics or bold?

The Unicode Technical Committee is on record that defining variation selectors for italics or bold is out of scope: "variation selector characters are not used for and will not be designated for textual effects which are inherently scoped across spans of text, as is the case for italic styling." [AF]

Ideographic Variation Sequence (IVS) FAQ

Q: How can I register an IVS?

Registrations are subject to the requirements and process specified in UTS #37. [AF]

Q: Can Ideographic Variation Sequences (IVSes) be registered for non-variant or standard forms of CJK Unified Ideographs?

Yes. The Han Unification process that resulted in the standard repertoire of CJK Unified Ideographs treats all glyphic forms that can be represented by a CJK Unified Ideograph as variants. By definition, therefore, the glyph chosen as the standard representation of a CJK Unified Ideograph is itself a variant. To emphasize this point, starting with Unicode 5.2, the Unicode code charts no longer show a single glyph, but instead typically show several national variants for each CJK Unified Ideograph. The purpose of a registered IVS is to allow one to pin down a CJK Unified Ideograph to a more specific glyphic form, regardless of whether that glyph is commonly considered a variant or non-variant form.