ISO/IEC JTC1 SC22 WG21 N2209 = 07-0069 - 2007-03-08
This document replaces N2159 = 07-0019 - 2007-01-10.
Many users of C++ need to manipulate Unicode character strings. While N2149 New Character Types for C++ addresses most low-level issues, it does not provide a mechanism to ensure UTF-8 literals. For portable international code, the standard needs such a mechanism.
We propose to add a new lexical token for UTF-8 string literals. No new types or other language changes are required. In particular, we do not propose character literals.
Adoption of this paper requires all conforming implementations to have bytes of at least eight bits in size. We believe that all existing systems already conform.
Note that this paper does not presume adoption of N2149 New Character Types for C++ and some editorial merge will be necessary.
Likewise, this paper does not presume adoption of N2053 Raw String Literals, for which some editorial merge will also be necessary.
See section 2.5 "Encoding Forms" in
The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)The online version (printing prohibited) is at http://www.unicode.org/versions/Unicode5.0.0/.
See Annex C of ISO 10646-1, which is online at http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2005/n2005-2.doc.
See ISO/IEC 10646:2003, which is publicly available in several text and PDF files within a zip archive from http://standards.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003%28E%29.zip.
See UTF-8, UTF-16, UTF-32 & BOM.
To paragraph 1, edit
The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain
any member of the basic execution character setand is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address.
To the grammar, edit
- " c-char-sequenceopt "
- L" c-char-sequenceopt "
To paragraph 1, edit
A string literal is a sequence of characters (as defined in 2.13.2) surrounded by double quotes, optionally beginning with the letter L, as in
L"...". A string literal that does not begin with
Lis an ordinary string literal, also referred to as
anarrow string literal. An ordinarystring literal has type "array of n
const char" and static storage duration (3.7), where n is the size of the string as defined below
, and is initialized with the given characters. A string literal that begins with
L, such as
L"asdf", is a wide string literal. A wide string literal has type "array of n
const wchar_t" and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters.
In paragraph 3, edit
In translation phase 6 (2.1), adjacent string literals are concatenated. If
a narrowstring literal token is adjacent to a wide string literal token, the result is a wide string literal.
Paragraph 5 already admits a multi-byte encoding of narrow string literals.
To paragraph 1, after the first sentence, add
Objects declared as characters (
char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character.