[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
[Unicode]  Technical Reports
 

Unicode Standard Annex #24

Script Names

Version

4.0.0

Authors Mark Davis (mark.davis@us.ibm.com)
Date 2003-04-17
This Version http://www.unicode.org/reports/tr24/tr24-5.html
Previous Version http://www.unicode.org/reports/tr24/tr24-4.html
Latest Version http://www.unicode.org/reports/tr24/tr24
Tracking Number

5


Summary

This document provides an assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents

  1. Introduction
  2. Usage Model
  3. Values
  4. Data File
References
Modifications

1 Introduction

The Unicode Character Database (UCD) provides data for a mapping from Unicode characters to script names. This information is useful for mechanisms such as regular expressions, where it produces much better results than simple matches on block names. There are quite a number of problems with using block names to distinguish characters:

For more information, see Character Blocks in UTR #18: Unicode Regular Expression Guidelines [UTR18].

2 Usage Model

Although script names are generally much more useful than simple block names, they cannot be applied blindly. The script assignment is particularly oriented towards mechanisms such as regular expressions, and is not intended to be used for other purposes such as graphology, history, or other unrelated purposes. The definition of script names in the data file do not preclude the assignment of scripts in appropriate ways for these other purposes.

The script name data provides a mapping from each Unicode code point to either a specific script such as Cyrillic, or to one of two special values:

The script names form a full partition of the code space: every code point is assigned a single script name. As new scripts are added to the standard, additional script names will be added.

In many cases, programs will override the script name based upon the context of the surrounding characters, especially for the case of Common. A simple heuristic is to use the script of the preceding character, which works well in many cases. However, this may not always produce optimal results: for example, in the text "... gamma (γ) is ..." this heuristic would cause matching parentheses to be in different scripts. Thus more sophisticated programs may use more complex heuristics.

In general, programs should only use specific script values in conjunction with both Common and Inherited. That is, to distinguish a sequence of characters appropriate for Greek, one would use:

((Greek | Common) (Inherited | Me | Mn)?)*

That is, characters that are either in Greek or in Common, optionally followed by those in Inherited. Specific languages may commonly use multiple scripts, so for Japanese one might use:

((Hiragana | Katakana | Han | Latin | Common) (Inherited | Me | Mn)?)*

Given this usage model, the current data is weighted on inclusiveness: a character is in a specific script (rather than Common or Inherited) only when it is clearly not used within other scripts. As more data on individual characters is collected, characters may move from the Common group to a more specific script (including Inherited).

The script property is useful in regular expression syntax for easy specification of spans of text which consist of a single script (or mixture of scripts). However, users should be very careful to not misapply it. The script values form a full partition of the Unicode code space, but that partition does not exhaust the possibilities for useful and relevant script-like subsets of Unicode characters.

For example, a user might wish to define a regular expression to span typical mathematical expressions, but the subset of Unicode characters used in mathematics does not correspond to any particular script. Instead, it requires use of the Math property, other character properties, and particular subsets of Latin, Greek, and Cyrillic letters. For information on other character properties, see the UCD.

The script property values may also be useful in providing user feedback to help signal possible spoofing, where visually-similar characters (confusable characters) are substituted in an attempt to mislead a user. For example, a domain name such as macchiato.com could be spoofed with macchiatο.com (with some Greek characters) or maссhiato.com (with some Cyrillic characters). The user can be alerted to odd cases by displaying mixed scripts with different color, highlighting, or boundary marks, such as macchiatο.com or maссhiato.com.

Possible spoofing is not limited to mixtures of scripts. Even in ASCII, there are confusable characters such as 0 and O, or 1 and l. Thus the use of script values would need to be augmented with other information such as general category values, plus exception lists of individual characters that are not distinguished by other Unicode properties.

3 Values

For illustration, the following table lists some of the the Script Name values used in the data file. For a complete list of values, see [Scripts]. The names are not case-sensitive, and the order in which the scripts are listed here or in the data file is not significant.

Although Braille is not a script in the same sense that Latin or Greek is, it is given a script name in [Scripts]. This is useful because of the nature of the application of these script names, as in matching spans of similar characters in regular expressions.

In the Property Value Aliases file [PropValue], corresponding codes from ISO 15924: Code for the Representation of Names of Scripts [ISO15924] are provided as short names for the scripts.

Table 1: Script Values
Script Name ISO 15924
COMMON Zyyy
INHERITED Qaai
LATIN Latn (Latf, Latg)
CYRILLIC Cyrl (Cyrs)
ARMENIAN Armn
HEBREW Hebr
ARABIC Arab
SYRIAC Syrc (Syrj, Syrn, Syre)
GEORGIAN Geor (Geon, Geoa)
... ...

Note: ISO 15924 provides an enumeration of four-letter script codes. In some cases the match between these script names and the ISO 15924 codes is not precise, since the goals are somewhat different. ISO 15924 is aimed primarily at the bibliographic identification of scripts; because of that it occasionally identifies varieties of scripts that may be useful for book cataloging, but which are not considered distinct as scripts in the Unicode Standard. For example, ISO 15924 has separate script codes for the Fraktur and Gaelic varieties of the Latin script.

Where there are no corresponding ISO 15924 codes, the "private use" ones starting with Q are used. Such values are likely to change in the future. In such a case, the Q-names will be retained as aliases in the UCD for backwards compatibility.

4 Data File

The Scripts.txt data file is available at [Scripts]. The format of the file is similar to that of Blocks.txt [Blocks]. The fields are separated by semicolons. The first field contains either a single code point, or the first and last code points in a range separated by "..". The second field provides the script name for that range. The comment (after a #) indicates the general category, and the character name. For a range, it adds the count in square brackets and uses the names for the first and last characters in the range. For example:

0B01;       ORIYA # Mn ORIYA SIGN CANDRABINDU
0B02..0B03; ORIYA # Mc [2] ORIYA SIGN ANUSVARA..ORIYA SIGN VISARGA

The value COMMON is the default value, given to all code points that are not explicitly mentioned in the data file.

There is an additional set of Script Charts [Charts] that can be used to see the assignment of scripts. These charts show the entire range of Unicode characters broken down by script name (and general category where the script is Common or Inherited). If your browser is not set up for Unicode, see Display Problems.

References

[Blocks] Blocks.txt
For the latest version, see:
http://www.unicode.org/Public/UNIDATA/Blocks.txt
For other versions, see:
http://www.unicode.org/standard/versions/
[Charts] Script Charts
http://www.unicode.org/reports/tr24/charts/
[Feedback] Reporting Errors and Requesting Information Online
http://www.unicode.org/reporting.html
[FAQ] Unicode Frequently Asked Questions
http://www.unicode.org/faq/
For answers to common questions on technical issues.
[Glossary] Unicode Glossary
http://www.unicode.org/glossary/
For explanations of terminology used in this and other documents.
[ISO15924] ISO 15924: Code for the Representation of Names of Scripts
http://www.evertype.com/standards/iso15924/
[PropValue] Property Value Aliases data file
For the latest version, see:
http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
For other versions, see:
http://www.unicode.org/standard/versions/
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[Scripts] Scripts data file
For the latest version, see:
http://www.unicode.org/Public/UNIDATA/Scripts.txt
For other versions, see:
http://www.unicode.org/standard/versions/
[UCD] Unicode Character Database
http://www.unicode.org/ucd
For and overview of the Unicode Character Database and a list of its associated files
[Unicode] The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1.
[UTR18] UTR #18: Unicode Regular Expression Guidelines
http://www.unicode.org/reports/tr18/
[Versions] Versions of the Unicode Standard
http://www.unicode.org/standard/versions
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.

Modifications

The following summarizes modifications from the previous version of this document.

5
  • Changed to Proposed Update UAX
  • Added note on the stability of Q names
  • Abbreviated the list of names, so that people would not get the mistaken impression that it was complete.
  • Added note on Braille.
  • Added note on Mn, Me characters
  • Added note on use of scripts with regard to spoofing.
  • Minor edits
4
  • Updated references, including reference to Property Value Aliases
  • Clarified that the list is for illustration only; the definitive values are in the UCD.
  • Minor edits
3
  • Minor link editing only