While reviewing the C++17 CD and preparing my CppCon talk, I found some issues
revolving around transform_reduce()
and the parallel form of inner_product()
.
1. inner_product()
Cannot Be Parallelized
The parallel version of inner_product()
(e.g. ExecutionPolicy
overload) is
not useful because the wording limits parallelization. 26.8.5 [inner.product]
p1 of the C++17 CD states that inner_product()
:
computes its result by initializing the accumulatoracc
with the initial valueinit
and then modifying it withacc = acc + (*i1) * (*i2)
oracc = binary_op1(acc, binary_op2(*i1,*i2))
Similar to accumulate()
, while an inner_product()
operation can be
parallelized in principle, the wording for inner_product()
prevents
parallelization for reduction operations that are noncommutative and nonassociative,
like floating point addition. For such operations, the wording forces each
iteration to depend on the prior iteration to ensure the correct ordering of
accumulation. This introduces a loopcarried dependency that makes it impossible
to parallelize.
2. inner_product()
is a Form of transform_reduce()
Just as reduce()
is the parallel counterpart of accumulate()
(same basic
operation but without a specific ordering), inner_product()
has a natural
counterpart. Consider this possible implementation of inner_product()
:
template <class InputIt1, class InputIt2, class T, class ReductionOperation, class BinaryTransformOperation> T inner_product(InputIt1 first1, InputIt1 last1, InputIt2 first2, T init, ReductionOperation reduce_op BinaryTransformOperation transform_op) { while (first1 != last1) { init = reduce_op(init, transform_op(*first1, *first2)); ++first1; ++first2; } return value; }
The application of transform_op
to the sequence is a binary transform()
:
template <class InputIt1, class InputIt2, class OutputIt, class BinaryTransformOperation> OutputIt transform(InputIt1 first1, InputIt1 last1, InputIt2 first2, OutputIt o_first, BinaryTransformOperation transform_op) { while (first1 != last1) { *o_first++ = transform_op(*first1++, *first2++); } return o_first; }
And the application of reduce_op
accumulates the transformed values:
template <class InputIt, class T, class ReductionOperation> T accumulate(InputIt first, InputIt last, T init, ReductionOperation reduce_op) { for (; first != last; ++first) { init = reduce_op(init, *first); } return init; }
So, inner_product()
is an
accumulate()
sreduce()
s
of a transform()
ed
sequence. It’s a transform_reduce()
!.
3. transform_reduce()
is Missing an Overload
However, transform_reduce()
is missing an overload for a binary transform;
only overloads with unary transform operations are currently specified by the
C++17 CD. I call this missing overload binarybinary transform_reduce()
, since both the reduction and transformation
operation are binary operations.
This overload, which is equivalent to parallel inner_product()
with weaker
ordering, is very useful. Many of the typical examples of transform_reduce()
usage that I’ve seen use tricks to perform a binarybinary transform_reduce()
using the unarybinary transform_reduce()
that is in the C++17 CD.
The typical transform_reduce()
dot product example (similar to what is
found in the original transform_reduce()
proposal [N3960]) looks like this:
std::vector<std::tuple<double, double> > XY = // ... double dot_product = std::transform_reduce( std::par_unseq, // Input sequence. XY.begin(), XY.end(), // Unary transformation operation. [](std::tuple<double, double> const& xy) { // Array of struct means this is tricky to execute on vector hardware: // // memory layout: X[0] Y[0] X[1] Y[1] X[2] Y[2] X[3] Y[3] ... // // op #0: load a pack of Xs (they aren’t contiguous; the load will be // strided, may need to access multiple // cache lines and may be harder for the // hardware prefetcher to handle) // op #1: load a pack of Ys (same as above) // op #2: multiply the pack of Xs by vector of Ys return std::get<0>(xy) * std::get(xy); }, double(0.0), // Initial value for reduction. // Binary reduction operation. std::plus<double>() );
Note that this is arrayofstructs NOT structofarrays. The HPX and THRUST dot product examples use iterator tricks (zip iterators or counting iterators) to switch to a structofarrays scheme. The zip iterator trick is shown below, using Boost:
std::vector<double> X = // ... std::vector<double> Y = // ... double dot_product = std::transform_reduce( std::par_unseq, // Input sequence. boost::make_zip_iterator(X.begin(), Y.begin()), boost::make_zip_iterator(X.end(), Y.end()), // Unary transformation operation. [](auto&& xy) // std::tuple<double, double>esque { // Struct of arrays means this is easier to execute on vector hardware: // // memory layout: X[0] X[1] X[2] X[3] ... Y[0] Y[1] Y[2] Y[3] ... // // op #0: load a pack of Xs (elements are contiguous, load will access // only one cache line and the hardware // prefetcher will easily track the memory // stream) // op #1: load a pack of Ys (same as above) // op #2: multiply the pack of Xs by the pack of Ys return std::get<0>(xy) * std::get<1>(xy); }, double(0.0), // Initial value for reduction. // Binary reduction operation. std::plus<double>() );
More examples of this pattern can be found in HPX (zip iterator example and counting iterator example) and THRUST (zip iterator example).
4. transform_reduce()
Parameter Order is Odd
The missing binarybinary transform_reduce()
would look like this:
template <class InputIt1, class InputIt2, class BinaryTransformOperation, class T, class ReductionOperation> // Always binary. T transform_reduce(InputIt1 first1, InputIt1 last1, InputIt2 first2, BinaryTransformOperation transform_op, T init, ReductionOperation reduce_op);
Note that the order of parameters, which is consistent with the other transform_reduce()
overloads, is inconsistent with inner_product()
, transform()
and reduce()
:
template <class InputIt1, class InputIt2, class T, class ReductionOperation, class BinaryTransformOperation> T inner_product(InputIt1 first1, InputIt1 last1, InputIt2 first2, T init, ReductionOperation reduce_op BinaryTransformOperation transform_op) template <class InputIt1, class InputIt2, class OutputIt, class BinaryTransformOperation> OutputIt transform(InputIt1 first1, InputIt1 last1, InputIt2 first2, OutputIt o_first, BinaryTransformOperation transform_op); template <class InputIt, class T, class ReductionOperation> T accumulate(InputIt first, InputIt last, T init, ReductionOperation reduce_op);
The inconsistencies are:

In
inner_product()
,accumulate()
andreduce()
(which is not shown but has the same interface asaccumulate()
), the initial valueinit
parameter comes after the iterator parameters and before the operation parameters. Intransform_reduce()
, it is in between the transformation operation parameter and the reduction operation parameter. 
In
inner_product()
, the reduction operation parameter comes before the transformation operation parameter. 
inner_product()
,accumulate()
andreduce()
have overloads which do not take any operations and useoperator+
for reduction andoperator*
for transformation.
5. Proposed Resolutions
I filed a US national body ballot comment about the issue with the parallel inner_product()
overload. The following solutions would resolve that comment.
5.1. Resolution 1: Rename inner_product()
to transform_reduce()
and Fix transform_reduce()
Rename the useless parallel inner_product()
with a binarybinary transform_reduce()
, and we should change the transform_reduce()
interface
to have the same parameter order as inner_product()
, accumulate()
, reduce()
and transform()
.
Specifically, we should:

Rename the
ExecutionPolicy
overload forinner_product()
totransform_reduce()
. 
Add a new serial binarybinary
transform_reduce()
overload. 
Add two new binarybinary
transform_reduce()
overloads (one serial overload and one parallelExecutionPolicy
overload). 
Change the parameter order for all
transform_reduce()
overloads so that they are consistent withinner_product()
,accumulate()
,reduce()
andtransform()
:
The initial value parameter comes after the iterator parameters and before the operation parameters.

The reduction operation parameter comes before the transformation operation parameters.


Add
transform_reduce()
overloads that do not take operation parameters and useoperator+
for reduction andoperator*
for transformation.
With these changes, a parallel dot product over structofarrays data can be written as:
std::vector<double> X = // ... std::vector<double> Y = // ... double dot_product = std::transform_reduce(std::par_unseq, x.begin(), x.end(), y.begin(), double(0.0));
A parallel word count could be written with binarybinary transform_reduce()
as:
bool is_word_beginning(char left, char right) { return std::isspace(left) && !std::isspace(right); } std::size_t word_count(std::string_view s) { if (s.empty()) return 0; // Count the number of characters that start a new word. return std::transform_reduce( std::par_unseq, // "Left" input sequence: s[0], s[1], ..., s[s.size()  2] s.begin(), s.end()  1, // "Right" input sequence: s[1], s[2], ..., s[s.size()  1] s.begin() + 1, // Initial value for reduction: if the first character // is not a space, then it’s the beginning of a word. std::size_t(!std::isspace(s.front()) ? 1 : 0), // Binary transformation operation. std::plus<std::size_t>(), // Binary transformation operation: Return 1 when we hit // a new word. is_word_beginning ); }
5.2. Resolution 2: Weaken inner_product()
Alternatively, we could weaken ordering required by inner_product()
, which
would make it equivalent to binarybinary transform_reduce()
. This approach
is not backwards compatible; implementations could
rewrite their serial inner_product()
in a way that uses a different ordering
than the one previously required, breaking user code. Additionally, different
implementations might provide different orderings, which could cause portably
issues.
However, this would still leave transform_reduce()
with an interface that is
inconsistent with other algorithms that implement very similar behavior. If we
ship transform_reduce()
with a broken interface, it will be difficult to fix
it later because we would have to break users code.
5.3. Resolution 3: Weaken inner_product()
and Fix transform_reduce()
Adopt Resolution 2, and also the parts of Resolution 1 that fix the
inconsistencies in transform_reduce()
:

Change the parameter order for all
transform_reduce()
overloads so that they are consistent withinner_product()
,accumulate()
,reduce()
andtransform()
:
The initial value parameter comes after the iterator parameters and before the operation parameters.

The reduction operation parameter comes before the transformation operation parameters.


Add
transform_reduce()
overloads that do not take operation parameters and useoperator+
for reduction andoperator*
for transformation.
5.4. Resolution 4: Remove Parallel inner_product()
The simplest approach, and least desirable, is to simply remove the parallel inner_product()
overloads. This removes useful functionality which was
intended to go into the standard (e.g. binary transform_reduce()
).
Additionally, it would leave transform_reduce()
s interface inconsistent with
the other algorithms, and it will be difficult to fix that interface in the
future because it would be a breaking change.
6. Acknowledgement
Thanks to:

Agustín Bergé for helping identify this issue.

Hartmut Kaiser, JF Bastien, Michael Garland and Jared Hoberock for providing feedback.