Document #: | P3355R2 |
Date: | 2024-10-29 |
Project: | Programming Language C++ Library Evolution |
Reply-to: |
Mark Hoemmen <mhoemmen@nvidia.com> |
Revision 0 submitted 2024-07-14
submdspan
feature test macro
__cpp_lib_submdspan
from its current value
(202403L
, set by adoption of P2642R6 (“Padded mdspan
layouts”) into the Working Draft for C++26).Revision 1 to be submitted for the 2024-10-16 mailing
Remove “user-defined pair types as slices” feature. We will make this a separate paper.
Increment __cpp_lib_submdspan
from its current value
(202403L
).
Implement suggestion during LEWG wording review to define a
“wording macro” for all the slice specifier types that have unit stride.
We call it unit-stride slice for M
, where
M
is a layout mapping type. (The definition only depends on
an index_type
, but the wording is most natural when it
depends on the layout mapping type for which
submdspan-mapping-impl
is a private member
function.)
Revision 2 to be submitted for the next mailing
Use green and red text in Wording to make changes more clear.
Per pre-LWG feedback, change “unit-stride slice for
decltype(*this)
” to “unit-stride slice for
mapping
.”
We propose to change submdspan_mapping
for the following
layouts’ layout mappings:
layout_left
,layout_right
,layout_left_padded
, andlayout_right_padded
,so that a strided_slice
slice with compile-time unit
stride results in the returned mapping having the same layout as if the
slice were a pair of integers. This preserves compile-time optimization
information for common layouts.
This change needs to be merged into the Working Draft before C++26. Otherwise, it would be a breaking change.
Suppose that one wants to vectorize a 1-D array copy operation using
mdspan
and aligned_accessor
(P2897). One has a
copy_8_floats
function that optimizes the special case of
copying a contiguous array of 8 float
s, where the start of
the array is aligned to 8 * sizeof(float)
(32) bytes. In
practice, plenty of libraries exist to optimize 1-D array copy. This is
just an example that simplifies the use cases for explicit 8-wide SIMD
enough to show in a brief proposal.
template<class ElementType, size_t ext, size_t byte_alignment>
using aligned_array_view = mdspan<ElementType,
<int, ext>, layout_right,
extents<ElementType, byte_alignment>>;
aligned_accessor
void
(aligned_array_view<const float, 8, 32> src,
copy_8_floats<float, 8, 32> dst)
aligned_array_view{
// One would instead use a hardware instruction for aligned copy,
// or a "simd" or "unroll" pragma.
for (int k = 0; k < 8; ++k) {
[k] = src[k];
dst}
}
The natural next step would be to use copy_8_floats
to
implement copying 1-D float
arrays by the usual
“strip-mining” approach.
template<class ElementType>
using array_view = mdspan<ElementType, dims<1, int>>;
void slow_copy(array_view<const float> src, array_view<float> dst)
{
assert(src.extent(0) == dst.extent(0));
for (int k = 0; k < src.extent(0); ++k) {
[k] = src[k];
dst}
}
template<size_t vector_length>
void strip_mined_copy(
<const float, dynamic_extent,
aligned_array_view* sizeof(float)> src,
vector_length < float, dynamic_extent,
aligned_array_view* sizeof(float)> dst)
vector_length {
assert(src.extent(0) == dst.extent(0));
assert(src.extent(0) % vector_length == 0);
for (int beg = 0; beg < src.extent(0); beg += vector_length) {
constexpr auto one = std::integral_constant<int, 1>{};
constexpr auto vec_len = std::integral_constant<int, vector_length>{};
// Using strided_slice lets the extent be a compile-time constant.
// tuple{beg, beg + vector_length} would result in dynamic_extent.
constexpr auto vector_slice =
{.offset=dst_lower, .extent=vector_length, .stride=one};
strided_slice
// PROBLEM: Current wording makes this always layout_stride,
// but we know that it could be layout_right.
auto src_slice = submdspan(src, vector_slice);
auto dst_slice = submdspan(dst, vector_slice);
(src_slice, dst_slice);
copy_8_floats}
}
void copy(array_view<const float> src, array_view<float> dst)
{
assert(src.extent(0) == dst.extent(0));
constexpr int vector_length = 8;
// Handle possibly unaligned prefix of less than vector_length elements.
auto aligned_starting_index = [](auto* ptr) {
constexpr auto v = static_cast<unsigned>(vector_length);
auto ptr_value = reinterpret_cast<uintptr_t>(ptr_value);
auto remainder = ptr_value % v;
return static_cast<int>(ptr_value + (v - remainder) % v);
};
int src_beg = aligned_starting_index(src.data());
int dst_beg = aligned_starting_index(dst.data());
if (src_beg != dst_beg) {
(src, dst);
slow_copyreturn;
}
(submdspan(src, tuple{0, src_beg}),
slow_copy(dst, tuple{0, dst_beg}));
submdspan
// Strip-mine the aligned vector_length segments of the array.
int src_end = (src.size() / vector_length) * vector_length;
int dst_end = (dst.size() / vector_length) * vector_length;
<8>(submdspan(src, tuple{src_beg, src_end}),
strip_mined_copy(dst, tuple{dst_beg, dst_end}));
submdspan
// Handle suffix of less than vector_length elements.
(submdspan(src, tuple{src_end, src.extent(0)}),
slow_copy(dst, tuple{dst_end, dst.extent(0)}));
submdspan}
The strip_mined_copy
function must use
strided_slice
to get slices of 8 elements at a time, rather
than tuple
. This ensures that the resulting extent is a
compile-time constant 8, even though the slice starts at a run-time
index beg
.
The current C++ Working Draft has two issues that hinder optimization of the above code.
The above submdspan
results always have
layout_stride
, even though we know that they are contiguous
and thus should have layout_right
.
The submdspan
operations in
strip_mined_copy
should result in
aligned_accessor
with 32-byte alignment, but instead give
default_accessor
. This is because
aligned_accessor
’s offset
member function
takes the offset as a size_t
. This discards compile-time
information, namely that the offset can be expressed as the product of
some integer and the overalignment factor, where the overalignment
factor is known at compile time.
This proposal fixes (1) for all layouts currently in the Working
Draft that have a submdspan_mapping
customization:
layout_left
, layout_right
,
layout_left_padded
, and layout_right_padded
.
We can do that without breaking changes, as long as this proposal is
merged before C++26 is finalized. After that, merging the proposal would
be a breaking change.
This proposal does not fix (2), because that would require a breaking change to both the layout mapping requirements and the accessor requirements, and because it would complicate both of them quite a bit.
aligned_accessor::offset
Regarding (2),
[mdspan.submdspan.submdspan]
6 says that submdspan(src, slices...)
has effects
equivalent to the following.
auto sub_map_offset = submdspan_mapping(src.mapping(), slices...);
return mdspan(src.accessor().offset(src.data(), sub_map_offset.offset),
.mapping,
sub_map_offset::offset_policy(src.accessor())); AccessorPolicy
The problem is
AccessorPolicy::offset_policy(src.accessor())
. The type
offset_policy
is the wrong type in this case,
default_accessor<const float>
instead of
aligned_accessor<const float, 32>
. If we want an
offset with suitable compile-time alignment to have a different accessor
type, then we would need at least the following changes.
The Standard Library would need a new type that represents the
product of a compile-time integer (that is, an
integral-constant-like
type) and a “run-time”
integer (an integral
-not-bool
type). It would
need overloaded arithmetic operators that preserve this product form as
much as possible. For example, 8x + 4 for a run-time integer x should result in 4y where y = 2x + 1 is a run-time
integer.
At least the Standard layout mappings’ operator()
would need to compute with these types and return them if possible. The
layout mapping requirements would thus need to change, as currently
operator()
must return index_type
(see
[[mdspan.layout.reqmts]]
7).
aligned_accessor::offset
would need an overload
taking a type that expresses the product of a compile-time integer (of
suitable alignment) and a run-time integer. The accessor requirements
[[mdspan.accessor.reqmts]]
may also need to change to permit this.
The definition of submdspan
would need some way to
get the accessor type corresponding to the new offset
overload, instead of aligned_accessor::offset_policy
(which
in this case is default_accessor
).
The work-around is to convert the result of submdspan
by
hand to use the desired accessor. In the above copy
example, one would replace the line
(src_slice, dst_slice); copy_8_floats
with the following, that depends on aligned_accessor
’s
explicit
constructor from
default_accessor
.
(aligned_array_view<const float, 8, 32>{src},
copy_8_floats<float, 8, 32>{dst}); aligned_array_view
Given that this work-around is easy to do, should only be needed for a few special cases, and avoids a big design change to the accessor policy requirements, we don’t propose trying to fix this issue in the C++ Working Draft.
Daisy Hollman’s original implementation of submdspan
implemented strided slices in this way.
C++26 / IS.
Text in blockquotes is not proposed wording, but rather instructions for generating proposed wording.
__cpp_lib_submdspan
feature test macroIn [version.syn], increase the value of the
__cpp_lib_submdspan
macro by replacing YYYMML below with the integer literal encoding the appropriate year (YYYY) and month (MM).
#define __cpp_lib_submdspan YYYYMML // also in <mdspan>
submdspan_mapping
resultsAppend the following to the end of [mdspan.sub.map.common]. Additions are shown in green text.
9
Given a layout mapping type M
, a type S
is a
unit-stride slice for M
if
Throughout [mdspan.sub], wherever the text says
Sk
models
index-pair-like
<index_type>
or
is_convertible_v<
Sk, full_extent_t>
is true
,
replace it with
Sk is a
unit-stride slice for mapping
.
Additions are shown in green text and removals in red text. Apply the analogous transformation if the text says Sp or S0, but is otherwise the same. Make this set of changes in the following places.
[mdspan.sub.map.left] (1.3.2), (1.4), (1.4.1), and (1.4.3);
[mdspan.sub.map.right] (1.3.2), (1.4), (1.4.1), and (1.4.3);
[mdspan.sub.map.leftpad] (1.3.2), (1.4), (1.4.1), and (1.4.3); and
[mdspan.sub.map.rightpad] (1.3.2), (1.4), (1.4.1), and (1.4.3).
For example, here are the changes to [mdspan.sub.map.left]. The other sections have analogous changes.
Returns:
(1.1)
submdspan_mapping_result{*this, 0}
, if
Extents::rank() == 0
is true
;
(1.2)
otherwise,
submdspan_mapping_result{layout_left::mapping(sub_ext), offset}
,
if SubExtents::rank() == 0
is true
;
(1.3)
otherwise,
submdspan_mapping_result{layout_left::mapping(sub_ext), offset}
,
if
(1.3.1)
for each k
in the range [
0
, SubExtents::rank() - 1
), is_convertible_v<
Sk
, full_extent_t>
is true
; and
(1.3.2)
for k
equal to SubExtents::rank() - 1
, Sk models
is a unit-stride slice for
index-pair-like<index_type>
or
is_convertible_v<
Sk
, full_extent_t>
is
true
mapping
;
[Note 1: If the above conditions are true, all Sk with k larger than
SubExtents::rank() - 1
are convertible to
index_type
. - end note]
(1.4)
otherwise,
submdspan_mapping_result{layout_left_padded<S_static>::mapping(sub_ext, stride(u + 1)), offset}
if for a value u for which
u + 1 is the smallest value
p larger than zero for which
Sp models
is a unit-stride slice for
index-pair-like<index_type>
or
is_convertible_v<
Sp
, full_extent_t>
is
true
mapping
, the following conditions
are met:
(1.4.1)
S0 models
is a unit-stride slice for
index-pair-like<index_type>
or
is_convertible_v<
S0
, full_extent_t>
is
true
mapping
; and
(1.4.2)
for each k in the range [u + 1, u + SubExtents::rank()
- 1 ),
is_convertible_v<
Sk
, full_extent_t>
is true
; and
(1.4.3)
for k equal to u + SubExtents::rank()
- 1, Sk
models
is a unit-stride slice for
index-pair-like<index_type>
or
is_convertible_v<
Sk
, full_extent_t>
is
true
mapping
;
and where S_static
is: