ISO/IEC JTC1 SC22 WG21 N3499 - 2012-12-19
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org
Problem
Solution
Constraints
    Program Ambiguity
    Lexical Language Compatibility
    Extension Language Compatibility
Existing Grammar
    2.3 Character sets [lex.charset]
    2.5 Preprocessing tokens [lex.pptoken]
    2.10 Preprocessing numbers [lex.ppnumber]
    2.11 Identifiers [lex.name]
    2.14.2 Integer literals [lex.icon]
    2.14.3 Floating literals [lex.fcon]
    2.14.8 User-defined literals [lex.ext]
    16 Preprocessing directives [cpp]
Approaches
    Remove User-Defined Literals
    Typographic
    Grave Accent
    Single Quote
    Underscore
        Double Underscore
        Scope Operator
        Non-Digit Literal Suffix
        Spacing
        Double Radix Point
        Backslash
Proposal
    2.10 Preprocessing numbers [lex.ppnumber]
    2.14.2 Integer literals [lex.icon]
    2.14.4 Floating literals [lex.fcon]
    2.14.8 User-defined literals [lex.ext]
References
Numeric literals of more than a few digits are hard to read. Consider the following tasks.
7237498123.237498123
with 237499123 for equality.237499123
or 20249472 is larger.The problem has a long history of solutions in writing and typography, digit separators. In the English-speaking world, commas are usually used to separate digits.
7,237,498,123.237,498,123
with 237,499,123 for equality.237,499,123
or 20,249,472 is larger.We wish to introduce digit separators into C++. The exact syntax is still open. The remainder of this paper discusses various approaches to the solution.
Constraints on digit separators arise from three distinct sources.
Adding digit separators introduces the potential for ambiguous C++ programs. We would prefer to avoid ambiguity, and failing that would prefer to have usable rules for disambiguating the source. In particular, the interaction with user-defined literals [N2747] [N2765] should be carefully considered.
The lexical structure of C++ is shared with C, Objective C/C++, and other tools through the preprocessor. Any introduction of digit separators should carefully consider compatibility with the existing lexical structure of these languages.
Richard Smith questions the value of compatibility here.
This problem only arises if:
- Someone is attempting to write a file which is to be shared between C++14 and other languages, and
- They include text in that header which simply does not work in those other languages.
I find it hard to believe that this will be a real problem, and it seems like a clear case of user error. (If you're writing a header which works in C and C++, the burden is on you to make sure it works in C).
This is not a new issue. The same problem already exists with C++11's raw string literals, and to a lesser extent with user-defined-literals and with C's hex floats (which allow 'p+' within pp-numbers).
C++ is often used as the basis for extended languages, notably Objective C/C++, but also many languages that are smaller and less widely used. Invalidating those extension languages has costs that are hard to predict.
The existing grammar provides both constraints and opportunities.
Paragraph 1 is as follows.
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters: [Footnote: The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files. —end footnote]
a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ! = , \ " '
Of particular note,
the only printable ASCII characters
not used in the C++ basic character set
are
$ (dollar),
@ (commercial at sign), and
` (grave accent, back tick).
All of these characters have been used for extension characters.
Dollar has also been used as an identifier character,
e.g. in VAX/VMS system functions names.
The grammar is as follows.
- preprocessing-token:
- header-name
- identifier
- pp-number
- character-literal
- user-defined-character-literal
- string-literal
- user-defined-string-literal
- preprocessing-op-or-punc
- each non-white-space character that cannot be one of the above
Paragraph two is of special note.
A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a
'or a"character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (2.8), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in Clause 16, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.
The implication here is that no valid C++ program should have an isolated single or double quote character. Unfortunately, that information is less useful that it might appear because an isolated single quote could be in use to signal an extension language interpretation.
The grammar is as follows.
- pp-number:
- digit
.digit- pp-number digit
- pp-number nondigit
- pp-number
esign- pp-number
Esign- pp-number
.
We would like numeric literals to fit within this syntax, as it would require the least change to existing tools, e.g editor syntax highlighting and mouse word grabbing.
The grammar is as follows.
- nondigit: one of
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z _- digit: one of
0 1 2 3 4 5 6 7 8 9
The implication in this grammar is that ignored code must still be made up of valid tokens.
The grammar is as follows.
- integer-literal:
- decimal-literal integer-suffixopt
- octal-literal integer-suffixopt
- hexadecimal-literal integer-suffixopt
- decimal-literal:
- nonzero-digit
- decimal-literal digit
- octal-literal:
0- octal-literal octal-digit
- hexadecimal-literal:
0xhexadecimal-digit
0Xhexadecimal-digit- hexadecimal-literal hexadecimal-digit
- nonzero-digit: one of
1 2 3 4 5 6 7 8 9- octal-digit: one of
0 1 2 3 4 5 6 7- hexadecimal-digit: one of
0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F
This syntax is entirely contained with the pp-number syntax.
The grammar is as follows.
- floating-literal:
- fractional-constant exponent-partopt floating-suffixopt
- digit-sequence exponent-part floating-suffixopt
- fractional-constant:
- digit-sequenceopt
.digit-sequence- digit-sequence
.- exponent-part:
esignopt digit-sequence
Esignopt digit-sequence- sign: one of
+ -- digit-sequence:
- digit
- digit-sequence digit
This syntax is entirely contained with the pp-number syntax.
The grammar is as follows.
- user-defined-literal:
- user-defined-integer-literal
- user-defined-floating-literal
- user-defined-string-literal
- user-defined-character-literal
- user-defined-integer-literal:
- decimal-literal ud-suffix
- octal-literal ud-suffix
- hexadecimal-literal ud-suffix
- user-defined-floating-literal:
- fractional-constant exponent-partopt ud-suffix
- digit-sequence exponent-part ud-suffix
- user-defined-string-literal:
- string-literal ud-suffix
- user-defined-character-literal:
- character-literal ud-suffix
- ud-suffix:
- identifier
The grammar is as follows.
- text-line:
- pp-tokensopt new-line
- pp-tokens:
- preprocessing-token
- pp-tokens preprocessing-token
The implication here is that #if-ignored program source
must still be made up of valid preprocessor tokens,
not arbitrary text.
Many preprocessors will skip arbitrary text, though.
There are several approaches to the solution. We evaluate them in turn.
At least Daveed Vandevoorde and N.M. Maclaren have suggested removing user-defined literals. However, removing a feature that we just introduced could be difficult.
There are three primary typographic conventions for digit separators: a comma, base-line dot, and a (thin) space.
C++ already uses the comma for an operator,
and using it for a digit separator would introduce ambiguities
in expressions such as ++a-3,4-b++,
or even more simply, f(12,345).
C++ already uses the base-line dot as a radix point, and so it is essentially not usable as a digit separator.
Bjarne Stroustrup has suggested using a space as a separator.
7 237 498 123.237 498 123
with 237 499 123 for equality.237 499 123
or 20 249 472 is larger.While this approach is consistent with one common typeographic style, it suffers from some compatibility problems.
Ville Voutilainen, among others, suggests using a grave accent (`) (back tick) as a digit separator.
7`237`498`123.237`498`123
with 237`499`123 for equality.237`499`123
or 20`249`472 is larger.This character is not part of the C++ basic source character set. The proposal has the advantage that introducing for this purpose cannot yield any ambiguity with existing C++ code. There are two disadvantages. First, using this character in the language invalidates any meta-languages using this character to distinguish between the C++ base layer and any meta information. Second, existing preprocessors would not recognize the grave accent as part of a preprocessor number, and may thus yield incorrect results.
Daveed Vandevoorde suggests using a single quote [N2747]. The single quote can be thought of as an "upper comma".
7'237'498'123.237'498'123
with 237'499'123 for equality.237'499'123
or 20'249'472 is larger.
There are two problems with this approach.
First, an odd number of single quotes would result in a line of text
that does not meet the preprocessor syntax for a token.
While most preprocessors do not tokenize lines
that are ignored in #if/#else,
some preprocessors are known to emit errors for such cases.
Second, existing preprocessors
would not recognize the single quote as part of a preprocessor number,
and may thus yield incorrect results.
Daveed Vandevoorde explains the incompatibility in more detail.
For example:
#if defined(__cplusplus) double pie = 3.141'593; #endifIn C, the preprocessor-tokens that are
#if'ed out are (not including the double quotes) "double", "pie", "=", "3.141", "'", "593", and ";".However, single and double quotes that aren't part of a larger preprocessor-token are deemed undefined behavior (C99, 6.4/3).
Typical C compilers (GCC, clang, EDG, and MSVC for example) have no problem with it (presumably they don't try to tokenize #if'ed-out lines), but James Dennett mentioned at least one older C compiler didn't like it.
Pete Becker points out that many tools, such as syntax highlighting in editors, rely on quotes being paired. The adaptability of the tools to new expressions is an open issue.
N.M. Maclaren suggests that single quote will lead to very bad error messages with some macro-based libraries.
The Ada programming language uses an underscore (technically, a low line) for the digit separator [AdaLRMnumlit] [AdaRDnumlit]. This approach seems to be used in VHDL and Verilog, also possibly in Algol68. (VHDL also appears to have literal suffixes.) This approach has been proposed more than once for C++, going at least as far back as 1993 [N0259].
7_237_498_123.237_498_123
with 237_499_123 for equality.237_499_123
or 20_249_472 is larger.In all known cases, the primary proposal has been to permit only a single underscore between digits [N0259] [N2281] [N3342]. However, [N0259] presents an option to permit underscores between the digit sequence and any prefix or suffix.
Underscores work well as a digit separator for C++03 [N0259] [N2281]. But with C++11, there exists a potential ambiguity with user-defined literals [N2747]. While the likely resolution will be some form of "max munch" rule, some mechanism must be present to disambiguate when max munch is too much. We use the term suffix separator to indicate this mechanism.
[N2747] suggests a double underscore as a suffix separator.
Mike Miller provides more detail.
... one possibility that occurs to me would be to allow a trailing underscore in an integer literal. The ambiguity with user-defined literals would be resolved in favor of the plain integer literal; a user could disambiguate a user-defined literal by ending the integer part with a trailing underscore. (Double underscores would not be permitted in an integer literal.) Thus:
1_=>1
1_2=>12
1__2=> value1passed tooperator "" _2
0xdead_bee_f=>0xdeadbeef
0xdead_bee__f=> value0xdeadbeepassed tooperator "" _f
The ambiguity with this approach arises when the suffix begins with one or more underscores.
John Spicer suggests something slightly different.
At some point I had suggested using underscore and having a special lookup rule so that something like
0xabc_dewould look for the "de" user-defined literal operator, and if not found, would treat the "de" as part of the hex literal. If you wanted to force the use of the operator, you could write0xabc__de. If you wanted to force the use of a_deoperator, you would have to write0xabc___de.Another alternative would be to look for the "
de" form and then the "_de" form if the first was not found. That way would only require the use of three underscores in cases where you had both a "de" and "_de" operator and wanted to force use of the second.
[N2747] suggests the scope operator (::)
as a potential suffix separator.
The scope operator would be a pure syntactic extension,
as it could not otherwise follow a literal.
However, it would make substrings of a literal
separately subject to preprocessor symbol substitution.
[N3342] suggests disallowing
a leading underscore followed by a digit as a user-defined literal suffix.
The intent was to make a suffix separator unnecessary.
However, [N3448] points out
that [N3342] fails to disambiguate hexadecimal digits,
particularly in hte example 0xdead_beef_db,
where db could be either decibel
or the hexadecimal digits d and b.
One could simply not allow user-defined literals with hexadecimal literals. However, this restriction is not desirable.
Discussions in the October 2012 standards meeting settled on using whitespace as the suffix separator. Unfortunately, that approach causes parsing problems for Objective C/C++.
Richard Smith explains.
An Objective-C message send works like this:
- message-expression:
[expression message-selector]- message-selector:
- identifier
- keyword-arguments
- keyword-arguments:
- identifieropt : expression keyword-argumentsopt
In particular, this is a valid Objective-C message send:
[self setValue: 0xff units: "cm"]Hence any proposal which folds a pp-number followed by an identifier into a single literal will break a significant quantity of Objective-C code.
Doug Gregor elaborates.
There are two issues with allowing spaces between a literal and its suffix for Objective-C. One is a true ambiguity and one is a problem for error recovery.
The true ambiguity occurs because one can omit a parameter name from the method declaration, in which case there is no identifier before the ':' in the call. For example, one could have a message send that looks like this:
[a method:10 :11]which calls the method "
method::". Now, consider[a method:10 _suffix:11]Currently, this parses (unambiguously) as a message send to "
method:_suffix:", i.e., it's parsed as[a method:(10) _suffix:11] // _suffix is the name of the second argument; calls method:_suffix:However, if we allow a space between a literal and its suffix, there is a second potential parse:
[a method:(10_suffix) :11] // _suffix is a suffix to the literal 10; calls method::which is completely ambiguous.
The error-recovery issue is that Objective-C(++) parsers tend to rely heavily on the fact that an expression in C/C++ cannot be immediately followed by an identifier. If we see an expression followed by an identifier in an expression context, it's fairly likely that this is a message send for which the '[' has been dropped. For example, Clang detects these cases and automatically inserts the '[' for the user; this was one of the top error-recovery requests, and a regression here would be considered a major problem for our users.
Jeremiah Willcock suggests using ".." as the suffix separator.
This notation is already permitted by the pp-number syntax.
It is also not presently permitted by any numeric literal.
Its primary disadvantage seems to be that it is unfamilar.
Clark Nelson suggests using "\" as the suffix separator.
This notation is not permitted by the pp-number syntax.
It is also not presently permitted by any numeric literal.
In this section we present likely wording edits, parameterized by the possible choices.
Edit the grammar as follows. Note that the additional rule for pp-number may not be necessary, depending on the specific chosen format.
- digit-separator:
- to be determined
- pp-number:
- digit
.digit- pp-number digit
- pp-number nondigit
- pp-number
esign- pp-number
Esign- pp-number
.- pp-number digit-separator
Edit the grammar as follows.
- integer-literal:
- decimal-literal integer-suffixopt
- octal-literal integer-suffixopt
- hexadecimal-literal integer-suffixopt
- decimal-literal:
- nonzero-digit
- decimal-literal digit-separatoropt digit
- octal-literal:
0- octal-literal digit-separatoropt octal-digit
- hexadecimal-literal:
0xhexadecimal-digit
0Xhexadecimal-digit- hexadecimal-literal digit-separatoropt hexadecimal-digit
- nonzero-digit: one of
1 2 3 4 5 6 7 8 9- octal-digit: one of
0 1 2 3 4 5 6 7- hexadecimal-digit: one of
0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F
Edit paragraph 1 as follows.
Note that each ?
will be replaced by the actual chosen digit separator character(s).
An integer literal is a sequence of digits that has no period or exponent part, with optional digit separators. These separators are ignored when determining its value. .... [Example:
theThe number twelve can be written12,014, or0XC. The literals1048576,1?048?576,0X100000,0x10?0000, and0?004?000?000all have the same value. —end example]
Edit the grammar as follows.
- floating-literal:
- fractional-constant exponent-partopt floating-suffixopt
- digit-sequence exponent-part floating-suffixopt
- fractional-constant:
- digit-sequenceopt
.digit-sequence- digit-sequence
.- exponent-part:
esignopt digit-sequence
Esignopt digit-sequence- sign: one of
+ -- digit-sequence:
- digit
- digit-sequence digit-separatoropt digit
Edit within paragraph 1 as follows.
Note that each ?
will be replaced by the actual chosen digit separator character(s).
.... The integer and fraction parts both consist of a sequence of decimal (base ten) digits, with optional digit separators. These separators are ignored when determining its value. [Example: The literals
1.602?176?565e-19and1.602176565e-19have the same value. —end example] ....
Edit the grammar as follows.
- user-defined-literal:
- user-defined-integer-literal
- user-defined-floating-literal
- user-defined-string-literal
- user-defined-character-literal
- user-defined-integer-literal:
- decimal-literal
ud-suffixseparated-suffix- octal-literal
ud-suffixseparated-suffix- hexadecimal-literal
ud-suffixseparated-suffix- user-defined-floating-literal:
- fractional-constant exponent-partopt
ud-suffixseparated-suffix- digit-sequence exponent-part
ud-suffixseparated-suffix- user-defined-string-literal:
- string-literal ud-suffix
- user-defined-character-literal:
- character-literal ud-suffix
- separated-suffix:
- literal-separatoropt ud-suffix
- literal-separator:
- to be determined
- ud-suffix:
- identifier
Edit paragraph 1 as follows.
Note that each ?
will be replaced by the actual chosen digit separator character(s)
and each ??
will be replaced by the actual chosen literal separator character(s).
If a token matches both user-defined-literal and another literal kind, it is treated as the latter. [Example:
123_kmand123??kmis a user-defined-literalare user-defined-literals, but 123?456 and 12LLis an integer-literalare integer-literals —end example] ....