2 General
2.1 Scope
2.2 References
4 Characteristics of decimal floating types <decfloat.h>
5 Conversions
5.1 Conversions
between
decimal floating and integer
5.2 Conversions among
decimal
floating types, and between decimal float types and non-decimal
floating
types
5.3 Conversions
between decimal floating and complex
5.4 Usual arithmetic
conversions
5.5 Default
argument
promotion
7 Floating-point environment <fenv.h>
8 Arithmetic operations
8.1 Operators
8.2 Functions
8.3 Conversions
9 Library
9.1 Decimal mathematics <math.h>
9.2 New functions
9.2.1 divide_integer functions
9.2.2 remainder_near functions
9.2.3 quantize functions
9.2.4 round_to_integer functions
9.2.5 normalize functions
9.3 Formatted
input/output
specifiers
9.4 strtod32, strtod64, and strtod128
functions
<stdlib.h>
9.5 wcstod32, wcstod64, and wcstod128
functions
<wchar.h>
9.6 Type-generic macros
<tgmath.h>
However, human computation and communication of numeric values almost always uses decimal arithmetic and decimal notations. Laboratory notes, scientific papers, legal documents, business reports and financial statements all record numeric values in decimal form. When numeric data are given to a program or are displayed to a user, binary to-and-from decimal conversion is required. There are inherent rounding errors involved in such conversions; decimal fractions cannot, in general, be represented exactly by binary floating-point values. These errors often cause usability and efficiency problems, depending on the application.
These problems are minor when the application domain accepts, or requires results to have, associated error estimates (as is the case with scientific applications). However, in business and financial applications, computations are either required to be exact (with no rounding errors) unless explicitly rounded, or be supported by detailed analysis that are auditable to be correct. Such applications therefore have to take special care in handling any rounding errors introduced by the computations.
The most efficient way to avoid conversion error is to use decimal arithmetic. Currently, the IBM z-architecture (and its predecessors since System/360) is a widely used system that supports builtin decimal arithmetic. This, however, provides integer arithmetic only, meaning that every number and computation has to have separate scale information preserved and computed in order to maintain the required precision and value range. Such scaling is difficult to code and is error-prone; it affects execution time significantly, and the resulting program is often difficult to maintain and enhance.
Even though the hardware may not provide decimal arithmetic operations, the support can still be emulated by software. Programming languages used for business applications either have native decimal types (such as PL/I, COBOL, C#, or Visual Basic) or provide decimal arithmetic libraries (such as the BigDecimal class in Java). The arithmetic used, nowadays, is almost invariably decimal floating-point; the COBOL 2002 ISO standard, for example, requires that all standard decimal arithmetic calculations use 32-digit decimal floating-point.
At present, all languages use software for decimal arithmetic. Even the best packages are slow, and can be 100 times slower than a corresponding hardware implementation, and in some cases much slower. At least one processor manufacturer, therefore, is adding decimal floating-point in hardware.
Arguably, the C language hits a sweet spot within the wide range of programming languages available today – it strikes an optimal balance between usability and performance. Its simple and expressive syntax makes it easy to program; and its close-to-the-hardware semantics makes it efficient. Despite the advent of newer programming languages, C is still often used together with other languages to code the computationally intensive part of an application. In many cases, entire business applications are written in C/C++. To maintain the vitality of C, the need for decimal arithmetic by the business and financial community cannot be ignored.
The importance of this has been recognized by the IEEE. The IEEE 754 standard is currently being revised, and the major change in that revision is the addition of decimal floating-point formats and arithmetic. These decimal data types are almost as efficient as the binary types, and are especially suitable for hardware implementation; it is possible that they will become the most widely used primitive data types once hardware implementations are available.
Historically there has been a close tie between IEEE-754 and C
with
respect to floating-point specification. With the revised IEEE-754
nearing
the final approval stage, it is now the appropriate time for C to
consider
adding decimal types and arithmetic to its specification.
There are three components to the model:
The model defines these components in the abstract. It neither defines the way in which operations are expressed (which might vary depending on the computer language or other interface being used), nor does it define the concrete representation (specific layout in storage, or in a processor's register, for example) of numbers or context.
- numbers - which represent the values which can be manipulated by, or be the results of, the core operations defined in the model
- operations - the core operations (such as addition, multiplication, etc.) which can be carried out on numbers
- context - which represents the user-selectable parameters and rules which govern the results of arithmetic operations (for example, the rounding mode to be used)
From the perspective of the C language, numbers are
represented
by data types, operations are defined within expressions, and context
is the floating environment specified in fenv.h. This Technical Report
specifies how the C language implements these components.
Note: A description of the arithmetic model can be found in
http://www2.hursley.ibm.com/decimal/decarith.html.
Note: A description of the encodings can be found in http://www2.hursley.ibm.com/decimal/decbits.html.
C99 specifies floating-point arithmetic using a two-layer organization. The first layer provides a specification using an abstract model. The representation of floating-point number is specified in an abstract form where the constituent components of the representation is defined (sign, exponent, significand) but not the internals of these components. In particular, the exponent range, significand size and the base (or radix), are implementation defined. This allows flexibility for an implementation to take advantage of its underlying hardware architecture. Furthermore, certain behaviors of operations are also implementation defined, for example in the area of handling of special numbers and in exceptions.
The reason for this approach is historical. At the time when C was first standardized, there were already various hardware implementations of floating-point arithmetic in common use. Specifying the exact details of a representation would make most of the existing implementations at the time not conforming.
C99 provides a binding to IEEE-754 by specifying an annex F and adopting that standard by reference. An implementation not conforming to IEEE-754 can choose to do so by not defining the macro __STDC_IEC_559__. This means not all implementations need to support IEEE-754, and the floating-point arithmetic need not be binary.
This Technical Report specifies decimal floating-point
arithmetic
according to the IEEE-754R, with the constituent components
of the representation defined. This is more stringent than the existing
C99 approach for the floating types. Since it is
expected that all decimal floating-point hardware implementations will
conform to the revised IEEE 754, binding to this standard
directly
benefits both implementors and programmers.
This Technical Report does not specify
binary
floating-point arithmetic.
2.2.1 ISO/IEC 9899:1999, Information technology - Programming languages, their environments and system software interfaces - Programming Language C.
2.2.1.1 ISO/IEC 9899:1999, Technical Corrigendum 1 to Programming Language C.
2.2.2 ANSI/IEEE 754-1985 - IEEE Standard for Binary Floating-Point Arithmetic. The Institute of Electrical and Electronic Engineers, Inc., New York, 1985.
2.2.2.1 The IEEE 754 revision working group is currently revising the specification for floating-point arithmetic:
ANSI/IEEE 754R - IEEE Standard for
Floating-Point
Arithmetic. The Institute of Electrical and Electronic Engineers,
Inc.
Draft.
2.2.3 ANSI/IEEE 854-1987 - IEEE Standard for Radix-Independent Floating-Point Arithmetic. The Institute of Electrical and Electronic Engineers, Inc., New York, 1987.
2.2.4 A Decimal Floating-Point Specification, Schwarz,
Cowlishaw,
Smith, and Webb, in the Proceedings of the 15th IEEE Symposium on
Computer
Arithmetic (Arith 15), IEEE, June 2001.
Note: Reference materials relating to
IEEE-754R
can be found in http://grouper.ieee.org/groups/754/ and
http://www.validlab.com/754R/.
A single token is used as a type name to make it easy for C++ to implement the types as classes.
Within the type hierarchy, decimal floating types are base types, real types and arithmetic types.
The types float, double and long double are also called generic floating types for the purpose of this Technical Report.
Note: C does not specify a radix for float, double and long double. An implementation can choose the representation of float, double and long double to be the same as the decimal floating types. In any case, the decimal floating types are distinct from float, double and long double regardless of the representation.
Note: This Technical Report does not define decimal complex types or decimal imaginary types. The three complex types remain to be float _Complex, double _Complex and long double _Complex, and the three imaginary types remain to be float _Imaginary, double _Imaginary and long double _Imaginary.
Following are suggested changes to the C99:
Change the first sentence of 6.2.5#10.
[10] There are three generic floating types, designated as float, double and long double.
Add the following paragraphs after 6.2.5#10.
[10a] There are three decimal floating types, designated as _Decimal32, _Decimal64 and _Decimal128. The set of values of the type _Decimal32 is a subset of the set of values of the type _Decimal64; the set of values of the type _Decimal64 is a subset of the set of values of the type _Decimal128. Support for _Decimal128 is optional. Decimal floating types are real floating types.
[10b] The generic floating types and decimal floating types are real floating types.
Add the following to 6.7.2 Type specifiers:
type-specifier:
_Decimal32
_Decimal64
_Decimal128
The characteristics of decimal floating types are defined in terms of a model specifying general decimal arithmetic (refer to 1.2). The encodings are specified in IEEE-754R (refer to 1.3).
The three decimal encoding formats defined in IEEE-754R correspond to the three decimal floating types as follows:
The finite numbers are defined by a sign, an exponent (which is a power of ten), and a decimal integer coefficient. The value of a finite number is given by (-1)^{sign} x coefficient x 10^{exponent}. Refer to IEEE-754R for details of the format.
- _Decimal32 is a decimal32 number, which is encoded in four consecutive octets (32 bits)
- _Decimal64 is a decimal64 number, which is encoded in eight consecutive octets (64 bits)
- _Decimal128 is a decimal128 number, which is encoded in 16 consecutive octets (128 bits)
These formats are characterized by the length of the
coefficient,
and
the maximum and minimum exponent. The table below shows these
characteristics
by format:
Format | _Decimal32 | _Decimal64 | _Decimal128 |
Coefficient length in digits | 7 | 16 | 34 |
Maximum Exponent (E_{max}) | 96 | 384 | 6144 |
Minimum Exponent (E_{min}) | -95 | -383 | -6143 |
The new header <decfloat.h> defines several macros that expand to various limits and parameters of the decimal floating-types. These macros have the similar names and meaning as to the corresponding ones in <float.h>.
Suggested change to C99.
Add the following after 5.2.4.2.2:
5.2.4.2.2a Characteristics of decimal floating types <decfloat.h>
[1] The characteristics of decimal floating types are defined in terms of the format described in IEEE-754R. The finite numbers are defined by a sign, an exponent (which is a power of ten), and a decimal integer coefficient. The value of a finite number is given by (-1)^{sign} x coefficient x 10^{exponent}. The macros defined in decfloat.h provide the characteristics of these representations, which is defined in the Decimal Arithmetic Encoding. The prefixes DEC32_ , DEC64_, and DEC128_ are used to denote the types _Decimal32, _Decimal64, and _Decimal128 respectively.
[2] Except for assignment and casts, the values of operations with decimal floating operands and values subject to the usual arithmetic conversions and of decimal floating constants are evaluated to a format whose range and precision may be greater than required by the type. The use of evaluation formats is characterized by the implementation-defined value of DEC_EVAL_METHOD:
-1 indeterminable;All other negative values for DEC_EVAL_METHOD characterize implementation-defined behavior.
0 evaluate all operations and constants just to the range and precision of the type;
1 evaluate operations and constants of type _Decimal32 and _Decimal64 to the range and precision of the _Decimal64 type, evaluate _Decimal128 operations and constants to the range and precision of the _Decimal128 type;
2 evaluate all operations and constants to the range and precision of the _Decimal128 type.
[3] The values given in the following list shall be replaced by constant expressions suitable for use in #if preprocessing directives:
- number of digits in the coefficient
DEC32_MANT_DIG 7
DEC64_MANT_DIG 16
DEC128_MANT_DIG 34
- minimum exponent
DEC32_MIN_EXP -95
DEC64_MIN_EXP -383
DEC128_MIN_EXP -6143
- maximum exponent
DEC32_MAX_EXP 96
DEC64_MAX_EXP 384
DEC128_MAX_EXP 6144
- maximum representable finite decimal floating number (there are 6, 15 and 33 9's after the decimal points respectively)
DEC32_MAX 9.999999E96DF
DEC64_MAX 9.999999999999999E384DD
DEC128_MAX 9.999999999999999999999999999999999E6144DL
- the difference between 1 and the least value greater than 1 that is representable in the given floating point type
DEC32_EPSILON 1E-6DF
DEC64_EPSILON 1E-15DD
DEC128_EPSILON 1E-33DL
- minimum normalized positive decimal floating number
DEC32_MIN 1E-95DF
DEC64_MIN 1E-383DD
DEC128_MIN 1E-6143DL
- minimum denormalized positive decimal floating number
DEC32_DEN 0.000001E-95DF
DEC62_DEN 0.000000000000001E-383DD
DEC128_DEN 0.000000000000000000000000000000001E-6143DL
When the new type is a decimal floating type, we have these choices: the most positive/negative number representable, positive/negative infinity, and quiet NaN. The first provides no indication to the program that something exceptional has happened. The second provides indication, and since other operations that produce infinity also raise exception, an exception would be raised here for consistency. The third allows the program to detect the condition and provides a way for the implementation to encode the condition (for example, where it occurs). The third is used here.
When the new type is an unsigned integral type, the values that create problems are those less than 0 and those greater than Utype_MAX. There is no overflow/under-flow processing for unsigned arithmetic. A possible choice for the result would be Utype_MAX. Also, common existing implementations do not raise signals for signed integer arithmetic. When the new type is a signed integral type, the values that create problems are those less than type_MIN and those greater than type_MAX. The result here could be type_MIN or type_MAX depending on whether the original value is negative or positive.
To make the behavior consistent among all
real floating types, the suggested changes below apply to all real
floating
types, not just decimal floating types.
Suggested changes to C99.
Change the last sentence of 6.3.1.4 paragraph 1 to:
[1] ... If the value of the integral part cannot be represented by the integer type, the result is the largest representable number if the type is unsigned, and the most negative or positive number according to the sign of the floating point number if the type is signed.
Change the last sentence of 6.3.1.4 paragraph 2 to:
[2] ... If the value being converted is
outside
the range of values that can be represented, the result is quiet NaN.
Suggested change to C99:
Add after 6.3.1.5#2.
[3] When a _Decimal32 is promoted to _Decimal64 or _Decimal128, or a _Decimal64 is promoted to _Decimal128, the value is converted to the type being promoted to.
[4] When a _Decimal64 is demoted to _Decimal32, a _Decimal128 is demoted to _Decimal64 or _Decimal32, or conversion is performed among decimal and generic floating types other than the above, if the value being converted can be represented exactly in the new type, it is unchanged. If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is correctly rounded. If the value being converted is outside the range of values that can be represented, the result is dependent on the rounding mode. If the rounding mode is:
near, if the value being converted is less than the maximum representable value of a hypothetical representation having one more digit in the mantissa of the target type, the result is the maximum value of the target type (note 1); otherwise the absolute value of the result is one of HUGE_VAL, HUGE_VALF, HUGE_VALL, HUGE_VAL_D64, HUGE_VAL_D32 or HUGE_VAL_D128 depending on the result type and the sign is the same as the value being converted.zero, the value is the most positive representable if the value being converted is positive, and the most negative number representable otherwise.
positive infinity, the value is same as zero if the value being converted is negative, and is same as near otherwise.
negative infinity, the value is same as near if the value being converted is negative, and is same as zero otherwise.
note 1: That is, the values that are between MAX and MAX*(1+ulp/10)
This is covered by C99 6.3.1.7.
One major difficulty of allowing mixed operation is in the determination of the common type. C99 does not specify exactly the range and precision of the generic real types. The pecking order between them and the decimal types is therefore unspecified. Given two (or more) mixed type operands, there is no simple rule to define a common type that would guarantee portability in general.
For example, we can define the common type to be the one with greater range (the suggested change below). But since a double type may have different range under different implementations, a program cannot assume the resulting type of an addition, say, involving both _Decimal64 and double. This imposes limitations on how to write portable programs.
If the generic real type is a type defined
in IEEE-754R, and if we use the greater-range rule, the common
type
is easily determined. When mixing decimal and binary types of the same
type size, decimal type is the common type. When mixing types of
different
sizes, the common type is the one with larger size. The suggested
change
below uses this approach but does not assume the generic real type to
follow
IEEE-754R. This guarantees consistent behaviors among implementation
that
uses IEEE-754 in their binary floating-point arithmetic, and at the
same
time provides reasonable behavior for those that don't. Annex C
presents
an alternate suggestion that disallows mixed operands.
Following are suggested changes to C99.
Insert the following to 6.3.1.8#1, after "This pattern is called the usual arithmetic conversions:"
6.3.1.8[1]
... This pattern is called the usual arithmetic conversions:
If one operand is a decimal floating type, all other operands shall not be generic floating type, complex type or imaginary type:
First if either operand is _Decimal128, the other operand is converted to _Decimal128.Otherwise, if either operand is _Decimal64, the other operand is converted to _Decimal64.
Otherwise, if either operand is _Decimal32, the other operand is converted to _Decimal32.
If there are no decimal floating type
in the operands:
First, if the corresponding real type of either operand is long double, the other operand is converted, ... <the rest of 6.3.1.8#1 remains the same>
Suggested changes to C99.
Add the following to 6.4.4.2 floating-suffix.
floating-suffix: one off l F L df dd dl DF DD DL
Add the following paragraph after 6.4.4.2#2:
6.4.4.2
...
[2a]
Constraints
The df, dd, dl, DF, DD and DL shall not be used in a hexadecimal-floating-constant.
Add the following paragraph after 6.4.4.2#4:
6.4.4.2
...
[4a] If a floating constant is suffixed by df or DF,
it has type _Decimal32. If suffixed by dd or DD,
it
has type _Decimal64. If suffixed by dl or DL, it has
type _Decimal128.
Suggested changes to C99.
Add the following after 7.6#7:
7.6
...
[7a] Each of the macros
FE_DEC_DOWNWARD
FE_DEC_TONEAREST
FE_DEC_TONEARESTFROMZERO
FE_DEC_TOWARDZERO
FE_DEC_UPWARD
are defined and used by fegetround and fesetround
functions for getting and setting the rounding mode of decimal
floating-pointer
operations.
Add the following paragraph after 7.6#5.
7.6
...
[5a] Each of the macros
FE_DEC_DIVBYZERO
FE_DEC_INEXACT
FE_DEC_INVALID
FE_DEC_OVERFLOW
FE_DEC_UNDERFLOW
are defined and
used by functions defined in C99 7.6.2, and can occur as side effects
of decimal floating point operations.
Square root, min, max, fused multiply-add and remainder are
implemented as library functions. Refer to section 9 below.
Conversions between different formats and to integer formats
are covered under section 5.
The name of the functions are derived by adding suffixes d32,
d64
and d128 to the double version of the function name.
Suggested changes to C99:
Add at the end of 7.12 paragraph 3 the following macros.
7.12
[3] The macro
Add at the end of 7.12 paragraph 4 the following macro.
7.12
[4] ...
DEC_INFINITY
Add at the end of 7.12 paragraph 5 the following macro.
7.12
[5] ...
DEC_NAN
expands to quiet decimal floating NaN for the type _Decimal32.
7.12.10.4 The divide integer functionsSynopsis
#include <math.h>Description
_Decimal32 divide_integerd32 (_Decimal32 x, _Decimal32 y);
_Decimal64 divide_integerd64 (_Decimal64 x, _Decimal64 y);
_Decimal128 divide_integerd128(_Decimal128 x, _Decimal128 y);The divide_integer functions perform the divide-integer operation as defined in IEEE 754R.
Suggested addition to C99:
7.12.10.5 The remainder near functionsSynopsis
#include <math.h>
_Decimal32 remainder_neard32 (_Decimal32 x, _Decimal32 y);
_Decimal64 remainder_neard64 (_Decimal64 x, _Decimal64 y);
_Decimal128 remainder_neard128(_Decimal128 x, _Decimal128 y);
DescriptionThe remainder_near functions perform the remainder-near operation as defined in IEEE 754R.
7.12.11.5 The quantize functionsSynopsis
#include <math.h>
_Decimal32 quantized32 (_Decimal32 x, _Decimal32 y);
_Decimal64 quantized64 (_Decimal64 x, _Decimal64 y);
_Decimal128 quantized128(_Decimal128 x, _Decimal128 y);
DescriptionThe quantize functions perform the quantize operation as defined in IEEE 754R.
7.12.11.6 The samequantum functionsSynopsis
#include <math.h>
_Bool samequantumd32 (_Decimal32 x, _Decimal32 y);
_Bool samequantumd64 (_Decimal64 x, _Decimal64 y);
_Bool samequantumd128 (_Decimal128 x, _Decimal128 y);
DescriptionThe samequantum functions perform the samequantum operation as defined in IEEE 754R.
Suggested addition to C99:
7.12.11.6 The round to integral functionsSynopsis
#include <math.h>
_Decimal32 round_to_integerd32 (_Decimal32 x, _Decimal32 y);
_Decimal64 round_to_integerd64 (_Decimal64 x, _Decimal64 y);
_Decimal128 round_to_integerd128(_Decimal128 x, _Decimal128 y);
DescriptionThe round_to_integer functions perform the round-to-integer operation as defined in IEEE 754R.
7.12.15 The normalize functionsSynopsis
#include <math.h>
_Decimal32 normalized32 (_Decimal32 x);
_Decimal64 normalized64 (_Decimal64 x);
_Decimal128 normalized128 (_Decimal128 x);
DescriptionThe normalize functions perform the normalize operation as defined in IEEE 754R.
Similarly, the modifiers HD, D and LD can be appended to f, F,
e, E,
g, and G to form input specifiers that indicate the argument is a
pointer
to _Decimal32, _Decimal64 or _Decimal128 respectively.
Synopsis
#include <stdlib.h>_Decimal32 strtod32 (const char * restrict nptr, char ** restrict endptr);
_Decimal64 strtod64 (const char * restrict nptr, char ** restrict endptr);
_Decimal128 strtod128(const char * restrict nptr, char ** restrict endptr);
Synopsis
#include <wchar.h>_Decimal32 wcstod32 (const char * restrict nptr, char ** restrict endptr);
_Decimal64 wcstod64 (const char * restrict nptr, char ** restrict endptr);
_Decimal128 wcstod128(const char * restrict nptr, char ** restrict endptr);
If there is more than one arguments, usual arithmetic conversions are applied so that both arguments have compatible types. Then,
Below is the suggested text for strtod32,
strtod64, and strtod128, copied from C99 7.20.1.3 with editing. Editing
is indicated by strike through (delete) and underline (change, new).
Refer
also to the handling of Signaling NaNs suggested by WG14 paper N1011.
7.20.1.5 The strtod32, strtod64, and strtod128 functions
Synopsis
[#1]
#include
<stdlib.h>
_Decimal32 strtod32
(const char * restrict nptr, char ** restrict endptr);
_Decimal64 strtod64
(const char * restrict nptr, char ** restrict endptr);
_Decimal128
strtod128(const
char * restrict nptr, char ** restrict endptr);
Description
[#2] The strtod32, strtod64, and strtod128
functions convert the initial portion of the string pointed to by nptr
to float_Decimal32, double _Decimal64,
and long double _Decimal128 representation,
respectively.
First, they decompose the input string into three parts: an initial,
possibly
empty, sequence of white-space characters (as specified by the isspace
function), a subject sequence resembling a floating-point
constant
or representing an infinity or NaN; and a final string of one or more
unrecognized
characters, including the terminating null character of the input
string.
Then, they attempt to convert the subject sequence to a floating-point
number, and return the result.
[#3] The expected form of the subject sequence is an optional plus or minus sign, then one of the following:
n-char-sequence:
The length of the n-char-sequence shall be shorter than D32_COEFF_DIG, D64_COEFF_DIG or D128_COEFF_DIG respectively depending on the return type. The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form. The subject sequence contains no characters if the input string is not of the expected form.
[#4] If the subject sequence has the
expected
form for a floating-point number, the sequence of characters starting
with
the first digit or the decimal-point character (whichever occurs first)
is interpreted as a floating constant according to the rules of
6.4.4.2,
except that it is not a hexadecimal floating number, that the
decimal-point
character is used in place of a period, and that if neither an exponent
part nor a decimal-point character appears in a decimal floating point
number, or if a binary exponent part does not appear in a
hexadecimal
floating point number, an exponent part of the appropriate
type
with value zero is assumed to follow the last digit in the string. If
the
subject sequence begins with a minus sign, the sequence is interpreted
as negated. note 1)
A character sequence INF or INFINITY is
interpreted
as an infinity, if representable in the return type, else like
a floating constant that is too large for the range of the return type.
A character sequence NAN or NAN(n-char-sequence-opt),
or SNAN or SNAN(n-char-sequence-opt),
is interpreted as a quiet NaN or signalling NaN respectively; the
meaning
of the n-char sequences is implementation-defined. note2) A
pointer
to the final string is stored in the object pointed to by endptr,
provided
that endptr is not a null pointer.
[#5] If the subject
sequence
has the hexadecimal form and FLT_RADIX is a power of 2, the The
value is converted according to F.5. The result from the conversion is
correctly rounded.
[#6] In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.
[#7] If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of nptr is stored in the object pointed to by endptr, provided that endptr is not a null pointer.
Recommended practice
[#8] If the
subject
sequence has the hexadecimal form, FLT_RADIX is not a power
of 2, and the result is not exactly representable, the
result
should be one of the two numbers in the appropriate internal format
that
are adjacent to the hexadecimal floating
source
value, with the extra stipulation that
the
error should have a correct sign for the current
rounding
direction.
[#9] If the
subject
sequence has the decimal form and at most DECIMAL_DIG (defined
in <float.h>)DEC128_COEFF_DIG (defined in
<decfloat.h>)
significant digits, the result should be correctly rounded. If the
subject
sequence D has the decimal form and more than
DEC128_COEFF_DIG
significant digits, consider the two bounding, adjacent decimal strings
L and U, both having DEC128_COEFF_DIG significant digits, such that the
values of L, D, and U satisfy L <= D <= U. The result should be
one
of the (equal or adjacent) values that would be obtained by correctly
rounding
L and U according to the current rounding direction, with the extra
stipulation
that the error with respect to D should have a correct sign for the
current
rounding direction. 252)
Returns
[#10] The functions return the
converted
value, if any. If no conversion could be performed, zero is
returned.
If the correct value is outside the range of representable values, plus
or minus HUGE_VALHUGE_VAL_D64, HUGE_VALFHUGE_VAL_D32,
or
HUGE_VALL HUGE_VAL_D128 is returned
(according
to the return type and sign of the value), and the value of the macro
ERANGE
is stored in errno. If the result underflows (7.12.1), the functions
return
a value whose magnitude is no greater than the smallest normalized
positive
number in the return type; whether errno acquires the value ERANGE is
implementation-defined.
252 DECIMAL_DIG, defined in
<float.h>,
should be sufficiently large that L and U will usually round to the
same
internal floating value, but if not will round to adjacent values.
note1 It is unspecified whether a
minus-signed sequence is converted to a negative number directly or by
negating the value resulting from converting the corresponding
unsigned
sequence (see F.5); the two methods may yield different results
if
rounding is toward positive or negative infinity. In either case, the
functions
honor the sign of zero if floating-point arithmetic supports signed
zeros.
F.5 shall be followed.
note2 An implementation may use the n-char
sequence to determine extra information to be represented in the NaN's
significand. No signal should be raised at the point of returning
the
signaling
NaN.
7.24.4.1.3 The strtod32, strtod64, and strtod128 functions
Synopsis
[#1]
#include
<stdlib.h>
_Decimal32 strtod32
(const char * restrict nptr, char ** restrict endptr);
_Decimal64 strtod64
(const char * restrict nptr, char ** restrict endptr);
_Decimal128
strtod128(const
char * restrict nptr, char ** restrict endptr);
Description
Similar to 7.20.1.5 in annex A,
replacing
references to character with wide character where
appropriate.
Insert the following to 6.3.1.8#1, after "This pattern is called the usual arithmetic conversions:"
6.3.1.8[1]
... This pattern is called the usual
arithmetic
conversions:
If one operand is a decimal floating type and there are no complex types in the operands:
If one operand is a decimal floating type and the other is a generic floating type, the one with a smaller value range is converted to the other.Otherwise, i If either operand is _Decimal128 or long double, the other operand is converted to _Decimal128.
Otherwise, if either operand is _Decimal64 or double, the other operand is converted to _Decimal64.
Otherwise, if either operand is _Decimal32, the other operand is converted to _Decimal32.
If one operand is a decimal floating type and the other is a
complex type, the decimal floating type is converted to the first type
in the following list that can represent the value range: float,
double,
long double. It is converted to long double if no type in the list can
represent its value range. In either case, the complex type is
converted
to a type whose corresponding real type is this converted type. Usual
arithmetic
conversions is then applied to the converted operands.
During any of the above conversions, if the value being converted can be represented exactly in the new type, it is unchanged. If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is correctly rounded. If the value being converted is outside the range of values that can be represented, the result is dependent on the rounding mode. If the rounding mode is:
near, if the value being converted is less than the maximum representable value of a hypothetical representation having one more digit in the mantissa of the target type, the result is the maximum value of the target type (note 1); otherwise the absolute value of the result is one of HUGE_VAL, HUGE_VALF, HUGE_VALL, HUGE_VAL_D64, HUGE_VAL_D32 or HUGE_VAL_D128 depending on the result type and the sign is the same as the value being converted.note 1: That is, the values that are between MAX and MAX*(1+ulp/10)zero, the value is the most positive representable if the value being converted is positive, and the most negative number representable otherwise.
positive infinity, the value is same as zero if the value being converted is negative, and is same as near otherwise.
negative infinity, the value is same as near if the value being converted is negative, and is same as zero otherwise.
First, if the corresponding real type of either operand is long double, the other operand is converted, ... <the rest of 6.3.1.8#1 remains the same>