[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Unicode Frequently Asked Questions

Coping with Change

Q: I don't see why I should update to the latest version of the Unicode Standard. Are there any important new characters?

There are important changes in almost every release of the Unicode Standard. If your implementation is still at an earlier version, then you are missing many characters added in more recent versions. For example, a significant number of characters important for support of languages in India and Southeast Asia are added in new versions. For East Asia, characters have been added to fill out compatibility with major standards such as JIS X 213, GB 18030, and HKSCS. Additionally, many symbols, including very popular emoji, appear in each new version. All of the characters are important to some user community.

Q: Which characters exactly were added?

If you look at https://www.unicode.org/Public/UCD/latest/ucd/DerivedAge.txt you can see which have been added to each successive version of the standard. For information on new scripts being added, see the list of Supported Scripts.

Q: Fonts and input methods or keyboards are really expensive to produce. Do I have to support all the new character for them?

Supporting the latest version of Unicode does not require that you have fonts or keyboards for all the characters. You always have a choice of what repertoire of Unicode characters you want to support in your product. Fonts and keyboards can be added incrementally. That said, you will probably want to monitor additions for popular characters, such as emoji, as your users may expect you to support them.

Q: But what else would I want to support in the latest version of the standard?

Even if you are not supplying keyboards and fonts you will probably need your software to handle the properties of the new characters correctly. And if you support emoji, you may want to support new emoji sequences.

Q: Why should I support Unicode properties?

Unicode properties are widely used under the covers. Text parsers will use them to separate out letters from punctuation and symbols. Anything that uses regular expressions, such as XMLSchema, will use them. They are used in uppercase/lowercase conversions, and in case-insensitive matching. They also coordinate with the latest versions of the Unicode Collation Algorithm, for sorting.

In globalization coding guidelines, we strongly recommend that hard-coded expressions like:

if ('a' <= x && x <= 'z' || 'A' <= x && x <= 'Z') doSomething();

should normally change to use appropriate Unicode properties, something like the following (depending on what was originally meant):

if (getCategory(x) == LETTER) doSomething(); or
if (getCategory(x) == LETTER && getScript(x) == LATIN) doSomething();

Using an old version of Unicode will mean that new characters will be ignored in such processing, or included where they are not meant to be. Importantly, fixes in properties — even for old characters — are made over time, and using the latest version of the properties ensures that you have the most accurate data you can.

Q: Is it cost-effective to update the Unicode character properties in my product?

There are good reasons to always update the Unicode characters properties to the latest version when you can, since the cost is rather small (i.e. typically updating data tables) compared to the benefits. For servers and middleware, the support for new Unicode characters will typically amount to just updating the property tables appropriately.

Q: How do I find out about all the different versions of Unicode?

Documentation of the contents of each version of the Unicode Standard is found on the Enumerated Versions page. That page also provides links to blog posts which provide information about what was especially important in each new version.

Q: How do I cite the Unicode Standard in my references?

See Versions of the Unicode Standard.

Q: How much does the Unicode Standard change between different versions?

Characters can be added in each major or minor version of the standard. Properties and other specifications can be added or changed. However, all changes are subject to the Unicode stability policy. See the Character Encoding Stability Policy for more information.

Q: Will there be incompatible changes between versions?

Some aspects of the Unicode Standard or related specifications may occasionally change in ways that produce different results or require updated processing. Generally these are called out as "Migration Issues" on the main page for each Unicode version.

Typically, the changes are additive or otherwise upwardly compatible. Some aspects are subject to Unicode Consortium Policies, such as the policy for Character Encoding Stability, that either strictly prohibit changes, or place strict limits on the type of allowed changes. For example, the character codes and names will never change once published. There are also constraints regarding properties themselves. For example, the set of possible values that the General_Category property can have is fixed and will never change.

However, properties that are not constrained can and will be updated, often because better information has come to light on how certain characters are used. Similar changes may be made to many of the algorithms published by the Unicode Consortium, and updating to them may at times require patching an implementation; often all that is necessary would be to incorporate updated property data tables. [AF]