For reasons of compatibility, the row of the BMP with R = 00 has been given the structure of an 8-bit code according to ISO/IEC 2022. This requires that
This enables the coded representation of a control function to be obtained by a simple algorithm from its coded representation in an 8-bit code in accordance with ISO/IEC 2022. The algorithm is described elsewhere in this guide.
The graphic characters in the remaining 190 code positions of row 00 are allocated in accordance with the 8-bit code specified in
That code, and therefore row 00 of the BMP, contains graphic characters used for general purpose applications in typical office environments in at least the following languages:
This incorporation of ISO/IEC 8859-1 in particular makes the cells 21-7E of row 00 have the same allocations as the graphic characters of ASCII, which in its internationally standardized form is also known as the International Reference Version (IRV) of:
To aid its interpretation and development, the Basic Multilingual Plane is divided into five zones corresponding to the following code positions:
The R-zone terminates at FFFD as positions FFFE and FFFF are reserved; see the section of this guide on the 4-octet code structure of the UCS.
Each zone has a distinctive use:
The transformation format UTF-16 was introduced by Amendment 1
to the first edition of ISO/IEC 10646-1, which also created the
S-zone by a splitting of the O-zone. Prior to that amendment the
O-zone extended to code position DFFF. UTF-16 extends the two-octet
coding of the BMP into a variable-length coding. In that coding
the characters of all zones of the BMP (P=00) other than the S-zone
are encoded in two octets while in addition characters of any
of the fifteen planes P=01 to P=10 (remember that 10 here is a
hexadecimal value) are encoded in four octets.
The A-zone is structured into named blocks, each consisting of a consecutive range of cells. Each block is allocated to a related set of characters, although a block may contain individual cells that are currently unallocated. The characters in the UCS from a particular script may be grouped together in a single block (such as BENGALI) or they may be divided among several blocks (such as BASIC ARABIC and ARABIC EXTENDED). The characters of the Latin script occupy the first four named blocks BASIC LATIN, LATIN-1-SUPPLEMENT, LATIN EXTENDED-A, LATIN EXTENDED-B but in addition there is one further block of Latin characters, LATIN EXTENDED ADDITIONAL, which occurs further into the code table.
Separate from the block structure, but closely related to it, is the concept of a collection of characters. A collection is the subset of characters allocated to a specified range of cells. The difference between a block and a collection is that the cells of a collection need not be consecutive and two collections may overlap. Collections are assigned both a name and a number. Blocks divide the code space into separate areas that are allocated for a coherent purpose. Collections put blocks and/or individual characters together to form subsets of practical significance. A user may then put several collections together to form a subset meeting a particular need, such as communication in English and Hebrew.
The following table shows the blocks and collections of the first
nine rows of the A-zone, comprising cells 0000-08FF. It gives
both the name and the range of cells that comprise the block.
With the exception of the collection HEBREW EXTENDED, which is
formed from two blocks, there is a one-to-one correspondence between
blocks and collections for the characters in these seven rows.
The table also gives the number assigned to the collection in
the first column; the collection name is the same as that of the
block.
1 | BASIC LATIN | 0020-007E |
2 | LATIN-1-SUPPLEMENT | 00A0-00FF |
3 | LATIN EXTENDED-A | 0100-017F |
4 | LATIN EXTENDED-B | 0180-024F |
5 | IPA EXTENSIONS | 0250-02AF |
6 | SPACING MODIFIER LETTERS | 02B0-02FF |
7† | COMBINING DIACRITICAL MARKS | 0300-036F |
8 | BASIC GREEK | 0370-03CF |
9 | GREEK SYMBOLS AND COPTIC | 03D0-03FF |
10† | CYRILLIC | 0400-04FF |
(Reserved for future standardization) | 0500-052F | |
11 | ARMENIAN | 0530-058F |
HEBREW EXTENDED-A
(31 further Hebrew characters have been allocated to previously reserved cells in this block by Amd. 7) | 0590-05CF | |
12 | BASIC HEBREW | 05D0-05EA |
HEBREW EXTENDED-B | 05EB-05FF | |
13* | HEBREW EXTENDED (This collection comprises the two blocks HEBREW EXTENDED-A and HEBREW EXTENDED-B) | |
14* | BASIC ARABIC | 0600-065F |
15* | ARABIC EXTENDED | 0660-06FF |
85 | SYRIAC
(added by Amd.27, hence the out-of sequence number) | 0700-074F |
(Reserved for future standardization) | 0750-077F | |
86* | THAANA
(added by Amd.24, hence the out-of sequence number) | 0780-07BF |
(Reserved for future standardization) | 07C0-08FF |
Certain characters in the blocks LATIN-1-SUPPLEMENT AND LATIN-EXTENDED-B have had their names changed by Technical Corrigendum 1 (1996) since the publication of the first edition of the standard in 1993. In the first of these blocks the characters affected are:
In the other block the affected characters are these same characters with added diacritical marks MACRON or ACUTE. The same name changes will be made in the next editions of the parts of ISO/IEC 8859 in which these characters appear.
The next five rows, 09-0D, are allocated to scripts that require the two special characters
in the coding of languages written in those scripts. As with rows 00-06, there is a collection corresponding to each block, but for these rows the collection consists of the characters allocated to that block together with these two special characters.
The following table shows the blocks and collections of rows 09-0D
of the A-zone, comprising cells 0900-0DFF. It gives both the name
and the range of cells that comprise the block. The table also
gives the number assigned to the collection that consists of the
characters allocated to the block together with the additional
characters at positions 200C and 200D. The collection name is
the same as that of the block on which it is based.
16* | DEVANAGARI | 0900-097F |
17* | BENGALI | 0980-09FF |
18* | GURMUKHI | 0A00-0A7F |
19* | GUJARATI | 0A80-0AFF |
20* | ORIYA | 0B00-0B7F |
21* | TAMIL | 0B80-0BFF |
22* | TELUGU | 0C00-0C7F |
23* | KANNADA | 0C80-0CFF |
24* | MALAYALAM | 0D00-0D7F |
84* | SINHALA
(added by Amd.21, hence the out-of sequence number) | 0D80-0DFF |
The remainder of the first 32 rows, namely rows 0E-1F, are either
reserved or allocated to further scripts that correspond to collections
on a one-to-one basis without additional characters. These are
shown in the following table:
25* | THAI | 0E00-0E7F |
26* | LAO | 0E80-0EFF |
72* | BASIC TIBETAN
(added by Amd.6, hence the out-of sequence number) | 0F00-0FBF |
(Reserved for future standardization) | 0FC0-109F | |
28 | GEORGIAN EXTENDED
(note that the collection number is out of sequence) | 10A0-10CF |
27 | BASIC GEORGIAN | 10D0-10FF |
29 | HANGUL JAMO | 1100-11FF |
73 | ETHIOPIC
(added by Amd.10, hence the out-of sequence number) | 1200-137F |
(Reserved for future standardization) | 1380-139F | |
75 | CHEROKEE
(added by Amd.12, hence the out-of sequence number) | 13A0-13FF |
74 | UNIFIED CANADIAN ABORIGINAL SYLLABICS
(added by Amd.11, hence the out-of sequence number) | 1400-167F |
82 | OGHAM
(added by Amd.20, hence the out-of sequence number) | 1680-169F |
83 | RUNIC
(added by Amd.19, hence the out-of sequence number) | 16A0-16FF |
87* | BURMESE
(added by Amd.26, hence the out-of sequence number) | 1700-177F |
88* | KHMER
(added by Amd.25, hence the out-of sequence number) | 1780-17FF |
(Reserved for future standardization) | 1800-1DFF | |
30 | LATIN EXTENDED ADDITIONAL
(one additional Latin character has been allocated to a previously reserved cell in this block by Amd.7.) | 1E00-1EFF |
31 | GREEK EXTENDED | 1F00-1FFF |
The next eight rows of the A-zone contains symbols of various
sorts and for various scripts, including technical and special
purpose symbols. These take up rows 20-28 and they are followed
by a further seven rows that are at present unallocated. This
area of the A-zone is structured as follows:
32 | GENERAL PUNCTUATION | 2000-206F |
33 | SUPERSCRIPTS AND SUBSCRIPTS | 2070-209F |
34 | CURRENCY SYMBOLS | 20A0-20CF |
35† | COMBINING DIACRITICAL MARKS FOR SYMBOLS | 20D0-20FF |
36 | LETTERLIKE SYMBOLS | 2100-214F |
37 | NUMBER FORMS | 2150-218F |
38 | ARROWS | 2190-21FF |
39 | MATHEMATICAL OPERATORS | 2200-22FF |
40 | MISCELLANEOUS TECHNICAL | 2300-23FF |
41 | CONTROL PICTURES | 2400-243F |
42 | OPTICAL CHARACTER RECOGNITION | 2440-245F |
43 | ENCLOSED ALPHANUMERICS | 2460-24FF |
44 | BOX DRAWING | 2500-257F |
45 | BLOCK ELEMENTS | 2580-259F |
46 | GEOMETRIC SHAPES | 25A0-25FF |
47 | MISCELLANEOUS SYMBOLS | 2600-26FF |
48 | DINGBATS | 2700-27BF |
(Reserved for future standardization) | 27C0-27FF | |
80 | BRAILLE PATTERNS
(added by Amd.16) | 2800-28FF |
(7 more rows reserved for future standardization) | 2900-2FFF |
The next 30 rows contain alphabetic scripts and symbols that
are used by languages that also make use of ideographic scripts.
The reference to CJK in the titles of some of the blocks of these
rows is to unified Chinese/Japanese/Korean characters; see the
section on ideographic scripts for more information. The blocks
and collections of these rows are as follows:
49* | CJK SYMBOLS AND PUNCTUATION | 3000-303F |
50* | HIRAGANA | 3040-309F |
51 | KATAKANA | 30A0-30FF |
52 | BOPOMOFO | 3100-312F |
53 | HANGUL COMPATIBILITY JAMO | 3130-318F |
54 | CJK MISCELLANEOUS | 3190-319F |
55 | ENCLOSED CJK LETTERS AND MONTHS | 3200-32FF |
56 | CJK COMPATIBILITY | 3300-33FF |
81 | CJK UNIFIED IDEOGRAPHS EXTENSION A
(Amd.17) | 3400-4DBF |
(Reserved for future standardization) | 4DC0-4DFF |
The CJK COMPATIBILITY block includes many symbols for scientific units that have been coded in Chinese national standards as if they were ideographs. Examples, together with their coding, are
The last 26 rows 34-4D of the A-Zone, now contain CJK Unified Ideographs Extension A (Amendment 17). However, these rows were
allocated in the first edition of ISO/IEC 10646-1 to the Hangul
syllabic script, divided into three blocks and corresponding collections
numbered 57-59. Amendment 5 to this first edition deleted these
allocations and created instead an allocation for a substantially
larger set of Hangul syllabic characters in the O-zone. This was
accepted as a violation of the principle that published allocations
would not be changed, but there were compelling reasons to adopt
this change. It will not be taken as a precedent for future changes
of a similar nature.
The I-zone of the BMP is allocated as a single block to Chinese/Japanese/Korean
unified ideographs, and it correspondingly forms a single collection.
For completeness this is shown in the following table:
60 | CJK UNIFIED IDEOGRAPHS | 4E00-9FFF |
An informative annex S has been added to ISO/IEC 10646-1 by Amendment 8 which describes the unification procedure. This section of the guide is based on that annex.
The I-zone contains 20992 code positions, of which 20902 are currently allocated to specific ideographs. These ideographs were derived from over 54000 ideographs which are found in various different national and regional standards for coded character sets. A process of unification was applied in which single ideographs from two or more of the source standards were associated together and assigned to a single code position in the I-zone. The ideographs that are thus associated are described, for the purposes of the UCS, as unified. To preserve data integrity, any ideographs that are separately encoded in any one of the source standards were not unified. Also ideographs that are unrelated in historical derivation are not unified. However, some ideographs encoded in two different standards for the same language may have been unified.
The unification process is based on the shapes of the ideographs, analyzed according to a systematic procedure. Any ideograph is composed of geometric elements which may themselves be composite structures and possibly ideographs in their own right. This enables the structure of an ideograph to be described by a component tree, where the top node is the ideograph itself and the bottom nodes are primitive elements. When two ideographs are compared, their component trees are compared to see if they agree in all of the following aspects:
If all of these aspects agree then the ideographs are considered to have the same abstract shape and are therefore unified. Annex S to ISO/IEC 10646-1 contains a listing of pairs or triples of ideographs that would have been unified under these rules except for the criteria concerning historical derivation or separate encoding in an existing standard.
Unified ideographs are named and listed in the code pages of ISO/IEC 10646-1 in a manner separate from that used for other scripts. For each unified ideograph, the listing reproduces all (which may only be one) of the graphic symbols (source ideographs) that have been unified into that code position. For each graphic symbol it specifies the source standard from which the graphic symbol is taken and the coded representation of the symbol in that standard. The name assigned to each unified ideograph is algorithmically generated by appending their two-octet coded representation to "CJK UNIFIED IDEOGRAPH-", for example CJK UNIFIED IDEOGRAPH-4E00.
The information concerning CJK united ideographs has now been replaced by Amd.13.
Amendment 5 to the first edition of ISO/IEC 10646-1 specified a change in the encoding of Hangul syllabic script. Prior to that Amendment, the last 26 rows of the A-zone (row numbers 34-4D) were allocated to the Hangul syllabic script and the entire O-zone was reserved for future standardization. Due to a major revision of the corresponding Korean national standard shortly after the final text of the first edition was agreed, it became necessary to accommodate substantially more syllabic characters into the UCS. To include these additional characters, the total space required would be almost 44 rows.
It was decided that this was sufficient of an exceptional circumstance to merit violating the principle that code positions, once allocated, should not be changed. The Hangul syllabic characters already encoded would be moved from the A-zone to the I-zone, where there was sufficient space to include both the original and the additional characters in a single block, with a corresponding single collection. The amendment contains the statement that this change is not intended to be regarded as a precedent for other changes of allocation in future editions. This statement will itself be incorporated into future editions.
Amendment 14 has added the syllables and radicals of the Yi script to the O-Zone.
Following these amendments, the O-zone has the structure shown in the following table:
76 | YI SYLLABLES | A000-A48F |
77 | YI RADICALS | A490-A4CF |
(Reserved for future standardization) | A4D0-ABFF | |
71 | HANGUL EXTENDED | AC00-D7A3 |
(Reserved for future standardization) | D7A4-D7FF |
Amendment 5 contains a mapping table giving the correspondence between the code positions before and after this amendment for the characters originally allocated to rows 34-4D.
The Hangul syllabic characters are assigned names that follow
the naming rules used for alphabetic scripts, e.g. HANGUL SYLLABLE
GEOLH (KEOLH) rather than the algorithmic name structure used
for the CJK unified ideographs of the O-zone.
The R-zone is distinguished from the remainder of the BMP in that its code positions are allocated for use only in special circumstances. There are three distinct uses for the R-zone:
As with the other zones, it is divided into blocks and collections
but the block for private use consists, by its very nature, only
of unallocated code positions. The structure of this zone is as
follows:
61 | PRIVATE USE AREA | E000-F8FF |
62 | CJK COMPATIBILITY IDEOGRAPHS | F900-FAFF |
63* | ALPHABETIC PRESENTATION FORMS | FB00-FB4F |
64 | ARABIC PRESENTATION FORMS-A | FB50-FDFF |
(Reserved for future standardization) | FE00-FE1F | |
65† | COMBINING HALF MARKS | FE20-FE2F |
66 | CJK COMPATIBILITY FORMS | FE30-FE4F |
67 | SMALL FORM VARIANTS | FE50-FE6F |
68 | ARABIC PRESENTATION FORMS-B | FE70-FEFE |
(The single character at code position FEFF is not in any of the blocks into which the BMP is divided. Its significance is explained in the chapter of this guide on Serial Transmission of the UCS) | FEFF | |
69 | HALFWIDTH AND FULLWIDTH FORMS | FF00-FFEF |
70 | SPECIALS | FFF0-FFFD |
Recall that the final two positions FFFE, FFFF are required to
be left unused in every plane of the UCS. The collection numbered
200 is one of a number of special-purpose collections that have
been assigned numbers in the range 200-299. See the chapter of
this guide on repertoires and subsets for more information.