Document number:	N3463=12-0153
Date:	2012-11-02
Project:	Programming Language C++
Reply-to:	Beman Dawes <bdawes at acm dot org>

Portable Program Source Files

Even this simple program cannot be written portably in C++ and may fail to compile:

int main() {}

The problem is that "The set of physical source file characters accepted is implementation-defined" (2.2 Phases of translation [lex.phases] paragraph 1). So even though the above program's code is portable, to actually compile the file may require the user to do a cumbersome encoding conversion plus conversion of all characters not available in the compiler accepted physical character set to the appropriate universal character names, alternate tokens, and trigraphs.

This creates two problems:

It plays into the stereotype of C++ being unfriendly to Unicode, particularly UTF-8.
It can create problems for widely portable programs. Boost source files, for example, cannot use the copyright sign (©) or the names of some Boost developers even in comments because they cause some compilers to reject the source file.

The proposed fix is simple; continue to allow implementations their current latitude with regard to physical source file character sets, but add a requirement that the UTF-8 character encoding be accepted.

This ensures that a UTF-8 encoded source file is acceptable to all compilers on all systems. It does not address other character set related challenges in writing truly portable code, such the effect of conversion of character or string literals of source character set members to the execution character set ([lex.phases], paragraph 5). Every journey begins with a single step, and that's all this proposal provides; the first step.

The proposed wording below is intended to require no changes to any existing source files and no changes to compilers that already recognize UTF-8 source files with byte-order marker.

Proposed change to the standard

Change 2.2 Phases of translation [lex.phases] paragraph 1, bullet 1 as indicated:

The precedence among the syntax rules of translation is specified by the following phases.¹¹

1. An implementation accepts one or more physical source file character sets. One of the physical source file character sets accepted shall be the UTF-8 character encoding form of the Unicode character set with 0xEFBBBF byte-order marker. Any additional physical source file character sets accepted are implementation-defined. A source file shall contain only one physical source file character set; how that character set is determined is implementation defined. Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. ~~The set of physical source file characters accepted is implementation-defined.~~ Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

Rationale

UTF-8

UTF-8 is recommended as the required physical source file character set encoding form of the Unicode character set because Unicode is an ISO standard (ISO/IEC 10646 annex D), is the de facto standard for such encodings, and is already supported by several C++ compilers.

Byte-order marker

The proposed wording requires a byte-order marker to assist compilers that wish to auto-detect UTF-8 encoding. Compilers are still permitted to accept UTF-8 source files without byte-order markers. The Unicode Technical Committee says of a 0xEFBBBF byte-order marker that "its presence does not affect conformance to the UTF-8 encoding scheme."

Digraphs and Trigraphs

Jeremiah Willcock asks "would it make sense to disable digraphs and trigraphs in UTF-8 files..."? This proposal makes no changes in the handling of digraph and trigraph sequences to avoid possible breakage of existing source files. But if the number of existing C++ source files that are UTF-8 encoded with BOM and use digraphs or trigraphs is believed to be very small, the committee may wish to consider such a change.

Acknowledgements

Thanks to Clark Nelson for reviewing a draft of this proposal. Further improvements were made in response to comments from Lawrence Crowl, Jeremiah Willcock, and Alberto Ganesh Barbati.