1. Overview
US 161 and US 184 identify issues relating to the ExecutionPolicy
(e.g.
parallel) signatures of the four preC++17 sequence algorithms (accumulate
, inner_product
, partial_sum
and adjacent_difference
) found in <numeric>
.
All of these issues revolve around the strength of the ordering requirements of
these algorithms and the new <numeric>
algorithms with
weaker ordering guarantees that were introduced in the Parallelism TS v1 and
C++17 (reduce
, transform_reduce
, inclusive_scan
, exclusive_scan
, transform_inclusive_scan
and transform_exclusive_scan
).
This paper proposes two types of changes:

Addition/removal of
<numeric>
sequence algorithm signatures. 
Changes to the order of parameters of
<numeric>
sequence algorithm signatures.
1.1. inner_product
Cannot Be Parallelized
Currently, accumulate
, inner_product
and partial_sum
use
"accumulatorstyle" wording:
26.8.2 Accumulate [accumulate]:template <class InputIterator, class T> T accumulate(InputIterator first, InputIterator last, T init); template <class InputIterator, class T, class BinaryOperation> T accumulate(InputIterator first, InputIterator last, T init, BinaryOperation binary_op);1 Effects: Computes its result by initializing the accumulator
acc
with the initial valueinit
and then modifies it withacc = acc + *i
oracc = binary_op(acc, *i)
for every iteratori
in the range[first, last)
in order.2 Requires:
T
shall meet the requirements ofCopyConstructible
(Table 22) andCopyAssignable
(Table 24) types. In the range[first, last]
,binary_op
shall neither modify elements nor invalidate iterators or subranges.
26.8.5 Inner product [inner.product]:template <class InputIterator1, class InputIterator2, class T> T inner_product(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init); template <class InputIterator1, class InputIterator2, class T, class BinaryOperation1, class BinaryOperation2> T inner_product(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2);1 Effects: Computes its result by initializing the accumulator
acc
with the initial valueinit
and then modifying it withacc = acc + (*i1) * (*i2)
oracc = binary_op1(acc, binary_op2(*i1, *i2))
for every iteratori1
in the range[first1, last1)
and iteratori2
in the range[first2, first2 + (last1  first1))
in order.2 Requires:
T
shall meet the requirements ofCopyConstructible
(Table 22) andCopyAssignable
(Table 24) types. In the ranges[first1, last1]
and[first2, first2 + (last1  first1)]
binary_op1
andbinary_op2
shall neither modify elements nor invalidate iterators or subranges.
26.8.6 Partial sum [partial.sum]template <class InputIterator, class OutputIterator> OutputIterator partial_sum(InputIterator first, InputIterator last, OutputIterator result); template <class InputIterator, class OutputIterator, class BinaryOperation> OutputIterator partial_sum(InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op);1 Effects: For a nonempty range, the function creates an accumulator
acc
whose type isInputIterator
's value type, initializes it with*first
, and assigns the result to*result
. For every iteratori
in[first + 1, last)
in order,acc
is then modified byacc = acc + *i
oracc = binary_op(acc, *i)
and the result is assigned to*(result + (i  first))
.2 Returns:
result + (last  first)
.3 Complexity: Exactly
(last  first)  1
applications of the binary operation.4 Requires:
InputIterator
's value type shall be constructible from the type of*first
. The result of the expressionacc + *i
orbinary_op(acc, *i)
shall be implicitly convertible toInputIterator
's value type.acc
shall be writable (24.2.1) to theresult
output iterator. In the ranges[first, last]
and[result, result + (last  first)]
binary_op
shall neither modify elements nor invalidate iterators or subranges.5 Remarks:
result
may be equal tofirst
.
This wording specifies that these algorithms perform the following operations:

Create an accumulator
acc
which is initialized withinit
. 
For (the) iterator(s) in the range(s) in order, modify
acc
by applying an update operation which takesacc
and the dereferenced value(s) of the iterator(s) as arguments.
The ordering requirement is necessary to ensure that these algorithms are welldefined for nonassociative and/or noncommutative operations, such as floatingpoint addition and multiplication (nonassociative and commutative), and subtraction (nonassociative and noncommutative).
Currently, the wording forces each iteration to depend on the prior iteration to ensure the correct ordering of accumulation. This introduces a loopcarried dependency that makes it impossible to parallelize.
To parallelize accumulate
and inner_product
we need to be able to
reorder applications of the operations, partition the workload into subtasks
and then combine the results of the subtasks together using the operator.
This, however, would give a nondeterministic result for nonassociative or
noncommutative operations.
For partial_sum
, reordering is not possible due to the nature of the
algorithm. However, partitioning is possible. Thus, you get a nondeterministic
result for nonassociative operations but it is fine for noncommutative
operations.
To allow parallelization of accumulate
and partial_sum
without relaxing
the existing ordering, new algorithms with weaker ordering were introduced
instead of adding ExecutionPolicy
signatures of the existing ones: reduce
(a weaker accumulate
) and inclusive_scan
(a weaker partial_sum
).
However, for inner_product
, no new algorithm with weaker ordering was
introduced and an ExecutionPolicy
signature was added instead. US 184
points out that this signature is not useful as it cannot be parallelized due
to the ordering constraints.
1.2. inner_product
is a Form of transform_reduce
Just as reduce
is the parallelizable counterpart of accumulate
(same basic
operation but without a specific ordering), inner_product
has a natural
counterpart. Consider this possible implementation of inner_product
:
template <class InputIt1, class InputIt2, class T, class ReductionOperation, class BinaryTransformOperation> T inner_product(InputIt1 first1, InputIt1 last1, InputIt2 first2, T init, ReductionOperation reduce_op, BinaryTransformOperation transform_op) { while (first1 != last1) { init = reduce_op(init, transform_op(*first1, *first2)); ++first1; ++first2; } return value; }
The application of transform_op
to the sequence is a binary transform
:
template <class InputIt1, class InputIt2, class OutputIt, class BinaryTransformOperation> OutputIt transform(InputIt1 first1, InputIt1 last1, InputIt2 first2, OutputIt o_first, BinaryTransformOperation transform_op) { while (first1 != last1) { *o_first++ = transform_op(*first1++, *first2++); } return o_first; }
And the application of reduce_op
accumulates the transformed values:
template <class InputIt, class T, class ReductionOperation> T accumulate(InputIt first, InputIt last, T init, ReductionOperation reduce_op) { for (; first != last; ++first) { init = reduce_op(init, *first); } return init; }
So, inner_product
is an
accumulate
sreduce
s
of a transform
ed sequence. It’s a transform_reduce
!.
1.3. transform_reduce
is Missing an Overload
However, transform_reduce
is missing an signature for a binary transform;
only signatures with unary transform operations are currently specified by the
C++17 CD. I call this missing signature binarybinary transform_reduce
, since both the reduction and transformation
operation are binary operations.
This signature, which is equivalent to parallel inner_product
with weaker
ordering, is very useful. Many of the typical examples of transform_reduce
usage that I’ve seen use tricks to perform a binarybinary transform_reduce
using the unarybinary transform_reduce
that is in the C++17 CD.
The typical transform_reduce
dot product example (similar to what is
found in the original transform_reduce
proposal [N3960]) looks like this:
std::vector<std::tuple<double, double> > XY = // ... double dot_product = std::transform_reduce( std::par_unseq, // Input sequence. XY.begin(), XY.end(), // Unary transformation operation. [](std::tuple<double, double> const& xy) { // Array of struct means this is tricky to execute on vector hardware: // // memory layout: X[0] Y[0] X[1] Y[1] X[2] Y[2] X[3] Y[3] ... // // op #0: load a pack of Xs (they aren’t contiguous; the load will be // strided, may need to access multiple // cache lines and may be harder for the // hardware prefetcher to handle) // op #1: load a pack of Ys (same as above) // op #2: multiply the pack of Xs by vector of Ys return std::get<0>(xy) * std::get(xy); }, double(0.0), // Initial value for reduction. // Binary reduction operation. std::plus<double>() );
Note that this is arrayofstructs NOT structofarrays. The HPX and THRUST dot product examples use iterator tricks (zip iterators or counting iterators) to switch to a structofarrays scheme. The zip iterator trick is shown below, using Boost:
std::vector<double> X = // ... std::vector<double> Y = // ... double dot_product = std::transform_reduce( std::par_unseq, // Input sequence. boost::make_zip_iterator(X.begin(), Y.begin()), boost::make_zip_iterator(X.end(), Y.end()), // Unary transformation operation. [](auto&& xy) // std::tuple<double, double>esque { // Struct of arrays means this is easier to execute on vector hardware: // // memory layout: X[0] X[1] X[2] X[3] ... Y[0] Y[1] Y[2] Y[3] ... // // op #0: load a pack of Xs (elements are contiguous, load will access // only one cache line and the hardware // prefetcher will easily track the memory // stream) // op #1: load a pack of Ys (same as above) // op #2: multiply the pack of Xs by the pack of Ys return std::get<0>(xy) * std::get<1>(xy); }, double(0.0), // Initial value for reduction. // Binary reduction operation. std::plus<double>() );
More examples of this pattern can be found in HPX (zip iterator example and counting iterator example) and THRUST (zip iterator example).
1.4. transform_reduce
Parameter Order is Odd
The missing binarybinary transform_reduce
would look like this:
template <class InputIt1, class InputIt2, class BinaryTransformOperation, class T, class ReductionOperation> // Always binary. T transform_reduce(InputIt1 first1, InputIt1 last1, InputIt2 first2, BinaryTransformOperation transform_op, T init, ReductionOperation reduce_op);
Note that the order of parameters, which is consistent with the other transform_reduce
signatures, is inconsistent with inner_product
, transform
and accumulate
:
template <class InputIt1, class UnaryTransformOperation, class T, class ReductionOperation> // Always binary. T transform_reduce(InputIt1 first1, InputIt1 last1, UnaryTransformOperation transform_op, T init, ReductionOperation reduce_op); template <class InputIt1, class InputIt2, class T> T inner_product(InputIt1 first1, InputIt1 last1, InputIt2 first2, T init); template <class InputIt1, class InputIt2, class T, class ReductionOperation, // Always binary. class BinaryTransformOperation> T inner_product(InputIt1 first1, InputIt1 last1, InputIt2 first2, T init, ReductionOperation reduce_op, BinaryTransformOperation transform_op); template <class InputIt1, class InputIt2, class OutputIt, class BinaryTransformOperation> OutputIt transform(InputIt1 first1, InputIt1 last1, InputIt2 first2, OutputIt o_first, BinaryTransformOperation transform_op); template <class InputIt1, class T, class ReductionOperation> T accumulate(InputIt1 first1, InputIt1 last1, T init, ReductionOperation reduce_op);
The inconsistencies are:

In
inner_product
,accumulate
andreduce
(which is not shown but has the same interface asaccumulate
), theinit
parameter comes after the iterator parameters and before the operation parameters. Intransform_reduce
, it is in between the transformation operation parameter and the reduction operation parameter. 
In
inner_product
, the reduction operation parameter comes before the transformation operation parameter. Intransform_reduce
, the reduction operation parameter comes before the transformation operation parameter.
1.5. transform_inclusive_scan
and transform_exclusive_scan
Parameter Order is Odd
Like transform_reduce
, the new transform_inclusive_scan
and transform_exclusive_scan
algorithms have parameter orders that are
inconsistent with existing algorithms that they are closely related to, like accumulate
, partial_sum
and inner_product
. The specific issues:

In
exclusive_scan
, theinit
parameter comes after the iterator parameters and before the operation parameters. Intransform_exclusive_scan
, theinit
parameter is after the transformation operation parameter and the reduction operation parameter. 
In
inclusive_scan
, the reduction operation parameter comes before the transformation operation parameter. Intransform_inclusive_scan
andtransform_exclusive_scan
, the reduction operation parameter comes before the transformation operation parameter.
2. Proposed Resolution
The following changes would resolve the issues with the <numeric>
raised by US 161 and US 184:

Remove the
ExecutionPolicy
signature forinner_product
. 
Add binarybinary
transform_reduce
signatures (four signatures: with and withoutExecutionPolicy
parameters, and with and without defaulted operations). 
Change the parameter order for all
transform_reduce
signatures so that:
The initial value parameter comes after the iterator parameters and before the operation parameters.

The reduction operation parameter comes before the transformation operation parameters.


Change the parameter order for
transform_*_scan
signatures so that:
For
transform_exclusive_scan
, the initial value parameter comes after the iterator parameters and before the operation parameters. 
For
transform_inclusive_scan
andtransform_exclusive_scan
, the reduction operation parameter comes before the transformation operation parameters.

With these changes, a parallel dot product over structofarrays data can be written as:
std::vector<double> X = // ... std::vector<double> Y = // ... double dot_product = std::transform_reduce(std::par_unseq, x.begin(), x.end(), y.begin(), double(0.0));
A parallel word count could be written with binarybinary transform_reduce
as:
bool is_word_beginning(char left, char right) { return std::isspace(left) && !std::isspace(right); } std::size_t word_count(std::string_view s) { if (s.empty()) return 0; // Count the number of characters that start a new word. return std::transform_reduce( std::par_unseq, // "Left" input sequence: s[0], s[1], ..., s[s.size()  2] s.begin(), s.end()  1, // "Right" input sequence: s[1], s[2], ..., s[s.size()  1] s.begin() + 1, // Initial value for reduction: if the first character // is not a space, then it’s the beginning of a word. std::size_t(!std::isspace(s.front()) ? 1 : 0), // Binary transformation operation. std::plus<std::size_t>(), // Binary transformation operation: Return 1 when we hit // a new word. is_word_beginning ); }
3. FAQ
Q: Why does this need to be fixed now?

inner_product
currently has ordering guarantees which prevent parallelization. If we fix them in the future, it will have to be a breaking change. 
The order of parameters in some of the new algorithms from the Parallelism TS v1 is inconsistent. If we fix them in the future, it will have to be a breaking change.
Q: The initial value parameter comes before the operation parameters in exclusive_scan
, but after the operation parameters in inclusive_scan
. Isn’t this also inconsistent?
Initially, I thought this was inconsistent. However, I’ve since learned from
LWG that this was done intentionally. The goal is to keep parameters fixed
relative to the other signatures for that algorithm. For example, since inclusive_scan
does not require an initial value parameter, it would be odd
to change from:
OutputIt inclusive_scan(InputIt, InputIt, OutputIt); OutputIt inclusive_scan(InputIt, InputIt, OutputIt, BinaryOperation); OutputIt inclusive_scan(InputIt, InputIt, OutputIt, BinaryOperation, T init);
to this:
OutputIt inclusive_scan(InputIt, InputIt, OutputIt); OutputIt inclusive_scan(InputIt, InputIt, OutputIt, BinaryOperation); OutputIt inclusive_scan(InputIt, InputIt, OutputIt, T init, BinaryOperation);
because it makes the order of BinaryOperation
inconsistent.
Q: Why does this paper not address the parallel signature of adjacent_difference
?
adjacent_difference
uses accumulator style wording and lives in
<numeric>, but it is not truly an accumulation, and has none of the
associativity and commutativity concerns of the other <numeric>
algorithms. SG1 preferred to follow the approach taken with for_each
and
relax the ordering requirement for the parallel signature of adjacent_difference
instead of adding a new algorithm with weaker guarantees. [P0574r1] contains the necessary changes for adjacent_difference
.
4. Wording
The following changes are relative to the latest working paper as of 03022017 ([N4640]) with the changes from [P0467r2] applied.
The � character is used to denote a placeholder section number which the editor shall choose.
4.1. Wording for transform_reduce
/inner_product
Apply the following changes to the declarations of transform_reduce
in the <numeric>
header synopsis in 26.8.1 [numerics.ops.overview]:
template <class InputIterator1, class InputIterator2, class T> T transform_reduce(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init); template <class InputIterator1, class InputIterator2, class T, class BinaryOperation1, class BinaryOperation2> T transform_reduce(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2); template <class InputIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> T transform_reduce(InputIterator first, InputIterator last,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2, class T> T transform_reduce(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator1 first1, ForwardIterator1 last1, ForwardIterator2 first2, T init); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2, class T, class BinaryOperation1, class BinaryOperation2> T transform_reduce(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator1 first1, ForwardIterator1 last1, ForwardIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2); template <class ExecutionPolicy, class ForwardIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> T transform_reduce(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator first, ForwardIterator last,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op); template <class InputIterator1, class InputIterator2, class T> T inner_product(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init); template <class InputIterator1, class InputIterator2, class T, class BinaryOperation1, class BinaryOperation2> T inner_product(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2);template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class T>T inner_product(ExecutionPolicy&& exec, // see 25.2.5ForwardIterator1 first1, ForwardIterator1 last1,ForwardIterator2 first2, T init);template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class T, class BinaryOperation1, class BinaryOperation2>T inner_product(ExecutionPolicy&& exec, // see 25.2.5ForwardIterator1 first1, ForwardIterator1 last1,ForwardIterator2 first2, T init,BinaryOperation1 binary_op1,BinaryOperation2 binary_op2);
Apply the following changes to the definitions of transform_reduce
in 26.8.4 [transform.reduce]:
26.8.4 Transform reduce [transform.reduce]:template <class InputIterator1, class InputIterator2, class T> T transform_reduce(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2, class T> T transform_reduce(ExecutionPolicy&& exec, ForwardIterator1 first1, ForwardIterator1 last1, ForwardIterator2 first2, T init);� Effects: Equivalent to:return transform_reduce(first1, last1, first2, init, plus<>(), multiplies<>());
template <class InputIterator1, class InputIterator2, class T, class BinaryOperation1, class BinaryOperation2> T transform_reduce(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2, class T, class BinaryOperation1, class BinaryOperation2> T transform_reduce(ExecutionPolicy&& exec, ForwardIterator1 first1, ForwardIterator1 last1, ForwardIterator2 first2, T init, BinaryOperation1 binary_op1, BinaryOperation2 binary_op2);� Returns:GENERALIZED_SUM(binary_op1, init, binary_op2(*i, *(first2 + (i  first1))), ...)
for every iteratori
in[first1, last1)
.� Requires: Neither
binary_op1
norbinary_op2
shall invalidate subranges, or modify elements in the ranges[first1, last1)
and[first2, first2 + (last1  first1))
.� Complexity:
O(last1  first1)
applications each ofbinary_op1
andbinary_op2
.template <class InputIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> T transform_reduce(InputIterator first, InputIterator last,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> T transform_reduce(ExecutionPolicy&& exec, ForwardIterator first, ForwardIterator last,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op);1 Returns:
GENERALIZED_SUM(binary_op, init, unary_op(*i), ...)
for every iteratori
in[first, last)
.2 Requires: Neither
unary_op
norbinary_op
shall invalidate subranges, or modify elements in the range[first, last)
.3 Complexity:
O(last  first)
applications each ofunary_op
andbinary_op
.4 Notes:
transform_reduce
does not applyunary_op
toinit
.
Direction to the Editor: Reorder the declarations and definitions of inner_product
to come before the declarations of transform_reduce()
in the <numeric>
header synopsis and class definitions in 26.8 to
maintain consistency with accumulate
/reduce
and partial_sum
/inclusive_scan
.
4.2. Wording for transform_exclusive_scan
Apply the following changes to the declarations of transform_exclusive_scan
in the <numeric>
header synopsis in 26.8.1
[numerics.ops.overview]:
template <class InputIterator, class OutputIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> OutputIterator transform_exclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> ForwardIterator2 transform_exclusive_scan(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op);
Apply the following changes to the definitions of transform_exclusive_scan
in 26.8.9 [transform.exclusive.scan]:
template <class InputIterator, class OutputIterator,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> OutputIterator transform_exclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class T, class BinaryOperation, class UnaryOperation> ForwardIterator2 transform_exclusive_scan(ExecutionPolicy&& exec, ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,T init, BinaryOperation binary_op, UnaryOperation unary_op);
4.3. Wording for transform_inclusive_scan
Apply the following changes to the declarations of transform_inclusive_scan
in the <numeric>
header synopsis in 26.8.1
[numerics.ops.overview]:
template <class InputIterator, class OutputIterator,class UnaryOperation,class BinaryOperation, class UnaryOperation> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op); template <class InputIterator, class OutputIterator,class UnaryOperation,class BinaryOperation, class UnaryOperation, class T> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op, T init); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class BinaryOperation, class UnaryOperation> ForwardIterator2 transform_inclusive_scan(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class BinaryOperation, class UnaryOperation, class T> ForwardIterator2 transform_inclusive_scan(ExecutionPolicy&& exec, // see 25.2.5 ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op, T init);
Apply the following changes to the definitions of transform_inclusive_scan
in 26.8.10 [transform.inclusive.scan]:
template <class InputIterator, class OutputIterator,class UnaryOperation,class BinaryOperation, class UnaryOperation> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op); template <class InputIterator, class OutputIterator,class UnaryOperation,class BinaryOperation, class UnaryOperation, class T> OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last, OutputIterator result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op, T init); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class BinaryOperation, class UnaryOperation> ForwardIterator2 transform_inclusive_scan(ExecutionPolicy&& exec, ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op); template <class ExecutionPolicy, class ForwardIterator1, class ForwardIterator2,class UnaryOperation,class BinaryOperation, class UnaryOperation, class T> ForwardIterator2 transform_inclusive_scan(ExecutionPolicy&& exec, ForwardIterator1 first, ForwardIterator1 last, ForwardIterator2 result,UnaryOperation unary_op,BinaryOperation binary_op, UnaryOperation unary_op, T init);
5. Acknowledgement
Thanks to:

Agustín Bergé for helping identify this issue.

Alisdair Meredith for assisting with wording and tackling the related problem with
adjacent_difference
. 
Hartmut Kaiser, JF Bastien, Michael Garland and Jared Hoberock for providing feedback.