"We are stuck with technology when what we really want is just stuff that works." ― Douglas Adams
1. Introduction
[P2845] made it possible to format and print
with
correct handling of Unicode. Unfortunately, some common path accessors still
exhibit broken behavior, which results in mojibake and data loss. This paper
proposes deprecating these accessors, making the path API more reliable and
eliminating a common source of bugs.
2. Changes since R3
-
Reintroduced
andsystem_string
accessors per LEWG feedback.display_string -
Renamed
tosystem_string
.system_encoded_string -
Added LEWG poll results for R3.
3. Changes since R2
-
Added SG16 poll results for R2.
4. Changes since R1
-
Added SG16 poll results for R0.
5. Changes since R0
-
Removed
andsystem_string
per SG16 feedback focusing just on deprecating broken accessors.display_string
6. LEWG poll results for R3
POLL: Re-add the function
which was presented in P2319R0
SF | F | N | A | SA |
---|---|---|---|---|
7 | 7 | 4 | 1 | 3 |
Outcome: Consensus in favor
POLL: Re-add the function
(result of calling format) which
was presented in P2319R0
SF | F | N | A | SA |
---|---|---|---|---|
13 | 7 | 2 | 0 | 0 |
Outcome: Strong consensus in favor
POLL: Approve the design (re-add
,
)
presented in "P2319R3 Prevent path presentation problems" (deprecating functions
in
).
SF | F | N | A | SA |
---|---|---|---|---|
10 | 8 | 2 | 1 | 2 |
Outcome: Consensus in favor
7. SG16 poll results for R2
Poll 1: P2319R2: Forward to LEWG.
SF | F | N | A | SA |
---|---|---|---|---|
2 | 6 | 0 | 0 | 0 |
Strong consensus in favor.
From https://github.com/cplusplus/papers/issues/1987#issuecomment-2482417123:
The general consensus of the group is that portable code should be written to use
to format paths or to use the
std :: format () member function and convert appropriately to the desired encoding.
native ()
8. SG16 poll results for R0
Poll 1: P2319R0: The
member function of
should be deprecated.
SF | F | N | A | SA |
---|---|---|---|---|
2 | 4 | 1 | 0 | 0 |
Outcome: Consensus
Poll 2: P2319R0: The proposed
member function should be
added to
.
SF | F | N | A | SA |
---|---|---|---|---|
0 | 2 | 4 | 1 | 0 |
Outcome: No consensus
9. Problem
Consider the following example:
std :: filesystem :: path p ( L"Выявы" ); // Выявы is Images in Belarusian. std :: cout << p << std :: endl ; std :: cout << p . string () << std :: endl ;
Even if all code pages and localization settings are set to Belarusian and both the source and literal encodings are UTF-8, this still results in mojibake on Windows:
"┬√ т√" ┬√ т√
Unfortunately, we cannot change the behavior of iostreams but at least the new
facilities such as
and
correctly handle Unicode in
paths. For example:
std :: filesystem :: path p ( L"Выявы" ); std :: ( "{} \n " , p );
prints
Выявы
However, the
accessor still exhibits the broken behavior, e.g.
std :: filesystem :: path p ( L"Выявы" ); std :: ( "{} \n " , p . string ());
prints
�����
The reason for this is that
transcodes the
path into the native encoding
([fs.path.type.cvt]) defined as:
The native encoding of an ordinary character string is the operating system dependent current encoding for pathnames ([fs.class.path]).
It is neither the literal encoding nor a locale encoding, and transcoding is usually lossy, which makes it almost never what you want. For example:
std :: filesystem :: path p ( L"Obrázky" ); std :: string s = p . string ();
throws
with the message "unknown error" on the same system
which is a terrible user experience.
The string can be passed to system-specific APIs that accept paths provided that
the system encoding hasn’t changed in the meantime. But even this use case is
limited because the transcoding is lossy, and it’s better to use an equivalent
API or
instead.
On Windows, the native encoding is effectively the Active Code Page (ACP), which is separate from the console code page. This is why paths often cannot be correctly displayed. Even Windows documentation ([CODE-PAGES]) cautions against using code pages:
New Windows applications should use Unicode to avoid the inconsistencies of varied code pages and for ease of localization.
Encoding bugs are even present in standard library implementations, see e.g. [LWG4087], where a path in the "native" encoding is incorrectly combined with text in literal and potentially other encodings when constructing an exception message.
Moreover, the result of
is affected by a runtime setting and may
work in a test environment but easily break after deployment. This is similar
to one of the problems with
but worse because in this case C++
doesn’t even provide a way to set or query the encoding. It disproportionately
affects non-English C++ users making the language not as attractive for
writing internationalized and localized software.
To summarize,
has the following problems:
-
It uses encoding that is generally incompatible with nearly all standard text processing and I/O facilities including iostreams,
andstd :: format
.std :: print -
It is extremely error-prone, causing easy to miss transcoding issues that may arise after the program is deployed in a different environment or after a runtime configuration change.
-
It makes writing portable code hard because the issues may not be obvious on POSIX platforms where
is just an inefficient equivalent ofstring ()
with extra memory allocation and copy.native ()
10. Proposal
The current paper proposes deprecating the
,
providing alternatives that make the target encoding clear:
-
returningsystem_encoded_string ()
in the operating system dependent current encoding for pathnames (native ordinary encoding). Similarly to the currentstd :: string
, it is lossy and only useful for passing to legacy system APIs.string () -
returningdisplay_string ()
in the literal encoding suitable for display, e.g. formatting withstd :: string
and printing withstd :: format
. It is lossless if the literal encoding is UTF-8 and the path is valid Unicode which is almost all paths on Windows.std :: print
We use "system" instead of "native" because the latter is ambiguous: it can either refer to encoding or format (path separators, etc.)
Similarly,
, which has the same problems, is also deprecated
with
and
alternatives provided.
can be used to simplify bulk bug-to-bug compatible
migration of a large existing codebase. This will make problematic call sites
easy to grep and fix incrementally, prioritizing more critical parts of the
codebase where potential data loss due to transcoding is especially undesirable.
There is usually a better way to accomplish the same task with non-legacy
APIs, e.g. using the lossless
that takes a path object
instead of
:
std :: filesystem :: remove ( p ); // Lossless, portable and more efficient.
Ideally,
should be deprecated but this is out of scope of the
current paper.
For lossless display, deprecated accessors can be replaced with
or formatting
using new facilities such as
or
.
11. Wording
Modify [https://eel.is/c++draft/fs.class.path.general]:
... // [fs.path.native.obs], native format observers const string_type & native () const noexcept ; const value_type * c_str () const noexcept ; operator string_type () const ; template < class EcharT , class traits = char_traits < EcharT > , class Allocator = allocator < EcharT >> basic_string < EcharT , traits , Allocator > string ( const Allocator & a = Allocator ()) const ; std :: string string () const ; std :: string display_string () const ; std :: string system_encoded_string () const ; std :: wstring wstring () const ; std :: u8string u8string () const ; std :: u16string u16string () const ; std :: u32string u32string () const ; // [fs.path.generic.obs], generic format observers template < class EcharT , class traits = char_traits < EcharT > , class Allocator = allocator < EcharT >> basic_string < EcharT , traits , Allocator > generic_string ( const Allocator & a = Allocator ()) const ; std :: string generic_string () const ; std :: string generic_display_string () const ; std :: string generic_system_encoded_string () const ; std :: wstring generic_wstring () const ; std :: u8string generic_u8string () const ; std :: u16string generic_u16string () const ; std :: u32string generic_u32string () const ; ...
Modify [fs.path.native.obs]:
...
std :: string string () const ; std :: string system_encoded_string () const ; std :: wstring wstring () const ; std :: u8string u8string () const ; std :: u16string u16string () const ; std :: u32string u32string () const ;
Returns: native().
Remarks: Conversion, if any, is performed as specified by [fs.path.cvt].
Returns:std :: string display_string () const ;
format ( "{}" , * this )
.
[Note: The returned string is suitable for use with formatting ([format.functions]) and print functions ([print.fun]). — end note]
Modify [fs.path.generic.obs]:
...
std :: string generic_string () const ; std :: string generic_system_encoded_string () const ; std :: wstring generic_wstring () const ; std :: u8string generic_u8string () const ; std :: u16string generic_u16string () const ; std :: u32string generic_u32string () const ;
Returns: The pathname in the generic format.
Remarks: Conversion, if any, is specified by [fs.path.cvt].
Returns:std :: string generic_display_string () const ;
format ( "{:g}" , * this )
.
[Note: The returned string is suitable for use with formatting ([format.functions]) and print functions ([print.fun]). — end note]
Add a new subclause in Annex D:
Deprecated filesystem path format observers [depr.fs.path.obs]
The following
member is defined in addition to those
specified in [fs.path.native.obs]:
Returns: native().std :: string string () const ;
Remarks: Conversion, if any, is performed as specified by [fs.path.cvt].
The following
member is defined in addition to those
specified in [fs.path.generic.obs]:
Returns: The pathname in the generic format.std :: string generic_string () const ;
Remarks: Conversion, if any, is specified by [fs.path.cvt].