"Your scientists were so preoccupied with whether or not they could, they
didn’t stop to think if they should."
— Dr. Ian Malcolm
1. Introduction
P1030
is a paper with a long and troubled history,
consistently falling short of its original goals and, in some cases, even
regressing. While recent revisions have removed some of the more questionable
parts of the design, such as the use of locales, numerous critical issues remain
unresolved. This paper highlights some of these issues and argues that
standardizing
in its current form would not only perpetuate past
design flaws but also make future fixes nearly impossible. Additionally, it
points out the severe lack of implementation and practical usage experience with
the latest design.
2. Changes from R0
-
Updated to reflect changes in P1030R7.
-
Provided more data on usage experience to Implementation and usage experience.
-
Added a benchmark to Performance.
3. Problems
3.1. Encoding
A significant portion of the initial revision of the paper ([P1030R0]) was
devoted to examining the issues surrounding
and the use
of "ANSI encodings" (code pages - [WIN-CODEPAGE]) on Windows:
came originally from Boost.Filesystem, which in turn underwent three major revisions during the Boost peer review as it was such a lively debate. During those reviews, it was considered very important that paths were passed through, unmodified, to the system API. There are very good reasons for this, mainly that filesystems, for the most part, treat filenames as a bunch of bytes without interpreting them as anything. So any character reencoding could cause a path entered via copy-and-paste from the user to be unopenable, unless the bytes were passed through exactly.
std :: filesystem This is a laudable aim, and it is preserved in this path view proposal. Unfortunately it has a most unfortunate side effect: on Microsoft Windows,
when supplied with
std :: filesystem :: path not
char , is considered to be in ANSI encoding. This is because the
wchar_t accepting syscalls on Microsoft Windows consume ANSI for compatibility with Windows 3.1, and they simply thunk through to the UTF-16 accepting syscall after allocating a buffer and copying the input bytes into shorts. Therefore on Microsoft Windows,
char duly expands
std :: filesystem :: path input into its internal UTF-16
char storage via direct casting. It does not perform a UTF-8 to UTF-16 conversion.
wchar_t Unfortunately any Microsoft Windows IDE or text editor that I have used recently defaults to creating C++ source files in UTF-8, exactly the same as on every other major platform including Linux and MacOS. This in turn means that source code with a char string literal such as
makes a UTF-8 char string, not an ANSI char string, which is consistent across all the major platforms. Thus,
"UTF♠stringΩliteral" ’s behaviour on Microsoft Windows is quite surprising: your portable program will not work. What works on all the other platforms, without issue, does not work on Microsoft Windows, for no obvious reason to the uninitiated.
std :: filesystem :: path This author can only speak from his own personal experience, but what he has found over many years of practice in writing portable code based on
is that one ends up inevitably using preprocessor macros to emit
std :: filesystem :: path L
when"UTF♠stringΩliteral" and
_WIN32 are macro defined, and otherwise emit
_UNICODE . The reason is simple: the same string literal, with merely a
"UTF♠stringΩliteral" or not prefix, works identically on all platforms, no locale induced surprises, because we know that string literals in UTF source code will be in some UTF-x format. The side effect is spamming your ‘portable’ program code with string literal wrapper macros as if we were still writing for MFC, and/or
L all over your code. I do not find this welcome.
#if defined(_WIN32) && defined(_UNICODE)
R0 goes as far as to switch to UTF-8 as the default encoding for
:
I propose that when char strings are supplied as a path string literal, and if and only if a conversion is needed, that we interpret those chars as UTF-8.
I know that this is a breaking change from
, but I would argue that
std :: filesystem :: path needs to be similarly changed. UTF-8 source code is very, very commonplace now, much more so than even a few years ago, and it is extremely likely that almost all new C++ written will be in UTF-8. So best to change
std :: filesystem :: path appropriately, and if that is too great a breaking change, then these proposed path views are ‘fixed’ instead.
std :: filesystem :: path
While this revision confuses source and literal encoding and presents an overly
ambitious solution, the problems described by the author are very real. In fact,
they have worsened as UTF-8 adoption has increased on Windows, particularly with
the ease of enabling UTF-8 via the
compiler flag in MSVC.
Working with certain parts of
is very error-prone for
the increasingly common case of literal encoding being UTF-8. Unfortunately,
later revisions of P1030 not only dropped any attempt to address this problem
but exacerbated it by adopting the legacy ANSI encoding throughout the API.
Worse still, this encoding has been embedded in the internal representation,
making it part of the ABI — a major regression compared to
, where the use of ANSI encoding is far more limited and
rightfully avoided in the internal representation.
[P2319], which was recently approved by SG16 with strong support, proposes
to deprecate the most problematic (from the encoding standpoint) parts of
. [P1030R7] does the opposite and massively increases
the public API (and ABI) surface that relies on error-prone legacy code pages.
In addition to problems described in [P1030R0], the use of ANSI encoding makes
it hard for
to interoperate with modern facilities
such as C++20
and C++23
(see Formatting).
Handling of transcoding errors wasn’t specified up until revision R7 where it was introduced in an inconsistent manner and wasn’t approved or even seen by SG16.
3.2. Implementation and usage experience
[P1030R7] claims:
If you wish to use an implementation right now, a highly-conforming reference implementation of the proposed path view can be found at https://github.com/ned14/llfio/blob/master/include/llfio/v2.0/path_view.hpp.
Unfortunately, at the time of writing, important parts of the proposal are
missing from that implementation. Specifically, more than 80 new overloads (for
functions like
to
) remain unimplemented. Even
worse, the paper itself lacks wording for these functions:
Wording note: The definitions for the function declared in the synopsis above are not provided at this time. All of them delegate to the overload taking a
.
path
Additionally, there is no implementation of a path-view-like equivalent that
was designed on-the-fly during one of the LEWG reviews. As a result, there is no
way to evaluate the effects of switching to
in these functions on
real-world user code.
As of November 2024, GitHub Search reports only 38 files using
in C++ files, about half of which are in llfio itself or
its forks. This suggests that usage experience with this implementation, even in
its current form not fully matching the paper, is minimal. For comparison, there
are 144 thousand results for
despite the latter
being largely superseded by its standard counterpart.
The only notable open-source project that we could find that considered using
is the Nix package manager ([NIX-ISSUE9205]). However,
they went with a different design that doesn’t exhibit encoding, performance
and complexity problems of P1030.
3.3. Performance
in its current form exacerbates encoding problems, but does it at
least offer performance improvements?
Unfortunately,
goes to great lengths to avoid providing any
performance benefits for existing users. This is achieved through obscure path-view-like overloads so that
existing C++ code would need to ‘opt in’ to using the path view overloads
This stands in stark contrast to the common use of
, which
typically allows users to avoid
allocations:
void f ( std :: string_view s ); f ( "foo" ); // No allocation std :: filesystem :: file_size ( "/path/to/file" ); // Allocates std::filesystem::path // in P1030R7.
Additionally, due to lazy transcoding,
can be
slower than
, which transcodes eagerly, when used
multiple times.
[P1030R7] doesn’t say much about performance, just hints at avoiding memory allocations in some cases. Since there is no implementation provided for most of the APIs it is hard to evaluate it. So we implemented a small subset of the API based on specification, however vague, and benchmarked it instead:
#include <benchmark/benchmark.h>#include <fmt/format.h>#include <llfio/llfio.hpp>namespace llfio = llfio_v2_b1279174 ; uintmax_t file_size_impl ( const char * p , std :: error_code * ec = nullptr ); namespace fs { using llfio :: path_view ; struct path_view_like { path_view view ; template < typename T , std :: enable_if_t < std :: is_convertible_v < T , path_view > && ! std :: is_convertible_v < T , std :: filesystem :: path > , int > = 0 > path_view_like ( const T & p ) : view ( p ) {} }; using std :: filesystem :: file_size ; uintmax_t file_size ( path_view_like p ) { return file_size ( p . view . path ()); } uintmax_t file_size ( path_view_like p , std :: error_code & ec ) noexcept { return file_size ( p . view . path (), ec ); } uintmax_t file_size_optimized ( path_view_like p , std :: error_code * ec = nullptr ) { return file_size_impl ( p . view . render_zero_terminated ( p . view ). c_str (), ec ); } } // namespace fs class fast_path { private : fmt :: basic_memory_buffer < char , PATH_MAX > buf_ ; public : fast_path ( const char * p ) { auto len = strlen ( p ); buf_ . resize ( len + 1 ); strcpy ( buf_ . data (), p ); } const char * c_str () const { return buf_ . data (); } }; inline uintmax_t file_size_optimized ( const fast_path & p ) { return file_size_impl ( p . c_str ()); } const char * filename = __FILE__ ; static void path ( benchmark :: State & state ) { for ( auto _ : state ) { std :: filesystem :: file_size ( std :: filesystem :: path ( filename )); } } static void path_optimized ( benchmark :: State & state ) { for ( auto _ : state ) { file_size_optimized ( fast_path ( filename )); } } static void path_view ( benchmark :: State & state ) { for ( auto _ : state ) { fs :: file_size ( fs :: path_view ( filename )); } } static void path_view_optimized ( benchmark :: State & state ) { for ( auto _ : state ) { fs :: file_size_optimized ( fs :: path_view ( filename )); } } static void native_string ( benchmark :: State & state ) { for ( auto _ : state ) { file_size_impl ( filename ); } } BENCHMARK ( path ); BENCHMARK ( path_optimized ); BENCHMARK ( path_view ); BENCHMARK ( path_view_optimized ); BENCHMARK ( native_string ); BENCHMARK_MAIN ();
is an implementation of
taken from libc++.
Results on macOS compiled with Apple clang version 15.0.0 (clang-1500.3.9.4):
Unable to determine clock rate from sysctl : hw . cpufrequency : No such file or directory This does not affect benchmark measurements , only the metadata output . *** WARNING *** Failed to set thread affinity . Estimated CPU frequency may be incorrect . 2024 -07-07 T11 : 42 : 23 -07 : 00 Running . / path - view - test Run on ( 8 X 24 MHz CPU s ) CPU Caches : L1 Data 64 KiB L1 Instruction 128 KiB L2 Unified 4096 KiB ( x8 ) Load Average : 1.94 , 2.56 , 2.58 -------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------- path 812 ns 810 ns 717698 path_optimized 765 ns 764 ns 910143 path_view 812 ns 809 ns 869576 path_view_optimized 780 ns 779 ns 899396 native_string 749 ns 748 ns 928123
uses the implementation suggested in the wording and, as expected,
has the same performance as the path overload it forwards to.
avoids constructing
and gives a minor improvement
of ~4%.
demonstrates that even bigger improvement can be
achieved without any complexity of
just by providing an
API-compatible version of path with a larger inline buffer. It gives ~6%
improvement.
as specified in R7 is inherently slower than a path(view)
implementation that uses a single representation because of an additional
runtime dispatch.
3.4. Formatting and output
Unlike
,
proposed by
the paper did not provide a formatter until R7 published in September 2024.
Sadly, the newly added formatter is not fully specified and even its current partial specification has major problems due to unfortunate choices in the latest design.
One issue is related to encoding. The representation of
uses a single
encoding that remains constant at runtime, making it feasible — though not
trivial — to specify a good formatter. In contrast,
complicates
matters by using multiple representations with different encodings, one of which
can be a legacy encoding that can change at runtime. As a result, there is no
way to determine which encoding
was constructed with at the time of
use. This is conceptually similar to the Time of Check to Time of Use
([TOCTOU]) class of problems common in filesystem operations, which in this
case can lead to mojibake, data corruption and other problems.
Another issue is the binary representation, which is severely underspecified and may conflict with other representations, making output hard or impossible to round-trip, even within a single implementation. Writing as an author of the path formatter ([P2845]), it remains unclear from [P1030R7] how it is expected to work.
is defined in terms of path-from-binary which appears to have the
same problems.
3.5. Complexity
The proposed
roughly doubles the API surface area
of
, both in terms of its own definition and by proposing
to add an overload that takes path-view-like arguments for every existing
overload that takes
. For example:
bool equivalent ( const path & p1 , const path & p2 ); bool equivalent ( const path & p1 , const path & p2 , error_code & ec ) noexcept ;
bool equivalent ( path - view - like p1 , path - view - like p2 ); bool equivalent ( path - view - like p1 , path - view - like p2 , error_code & ec ) noexcept ;
Contrary to its name, the proposed
is not truly a
view of
in the same way that
can be
considered a view of
.
has a single representation that is
suitable for the current system. In contrast,
is effectively a
discriminated union of some (but not all) of the types from which
can be
constructed, with a lazy conversion to path. To further complicate things,
is also constructible from inputs,
is not constructible from.
It is unclear what such an unusual API should be called, but it probably should
not be referred to as a "view."
3.6. Conclusion
In summary, the proposed
presents significant
concerns that need to be resolved before standardization. Its design exacerbates
encoding problems and adds unnecessary complexity to the API. The reliance on
legacy code pages undermines modern practices and complicates
interoperability with other C++ facilities.
Additionally, the increased API surface area and the requirement for users to
opt in to specific overloads detract from its usability. To maximize the utility
of
, future revisions should focus on simplifying its design,
addressing encoding issues, enhancing compatibility with existing libraries and
getting actual implementation and usage experience. Standardizing the current
proposal risks introducing more problems than it solves.