Document number: N3759

Date: 2013-08-30

Project: Programming Language C++, Library Working Group

Reply-to: Matthias Kretz <kretz@kde.org> <kretz@compeng.uni-frankfurt.de>

Introduction

Motivation

Problem

Proposal — SIMD Types

Acknowledgements

References

In Bristol N3571 (A Proposal to add Single Instruction Multiple Data Computation to the Standard Library) was discussed and the following straw poll was taken: “Should C++ include a fixed length vector type to abstract vector registers”

- 2/2/1/4/6 SF/WF/N/WA/SA
- Consensus not to move forward

This poll was assuming a slightly different approach than presented in this proposal. It did not make sense to take over the discussion then to talk about this approach. In/after that session I was told that I should write a new proposal.

Since many years the number of SIMD instructions and the size of SIMD registers have been growing. Newer microarchitectures introduce new operations to optimize certain (common or specialized) operations. Additionally, the size of SIMD registers has increased and may increase further in the future.

The typical minimal set of SIMD instructions for a given scalar data type comes down to the following:

- Load instructions: load N successive scalar values starting from a given address into a SIMD register. SSE examples:
movaps (%rax),%xmm0 movups 0x4(%rax),%xmm1

- Store instructions: store from a SIMD register to N successive scalar values at a given address. SSE examples:
movaps %xmm0,(%rax) movups %xmm1,0x4(%rax)

- Arithmetic instructions. SSE examples:
addps %xmm0,%xmm1 mulps %xmm1,%xmm1 divps %xmm1,%xmm0

- Compare instructions. SSE examples:
cmpeqps %xmm0,%xmm1

- Bitwise instructions. SSE examples:
andps %xmm0,%xmm1 xorps %xmm0,%xmm0

The set of available operations can differ considerably between different microarchitectures of the same CPU family. Furthermore there are different SIMD register sizes. Future extensions will certainly add more instructions and larger SIMD registers.

There is no need to motivate SIMD programming. It is very much needed, the open question is only: “How?”.

There have been several approaches to vectorization. I'd like to only discuss the merits of SIMD types here.

SIMD registers and operations are the low-level ingredients to SIMD programming. Higher-level abstractions can be built on top of these. If the lowest-level access to SIMD is not provided, users of C++ will be constrained to work within the limits of the provided abstraction.

In some cases the compiler might generate better code if only the intent is stated instead of an exact sequence of operations. Thus higher-level abstractions might seem preferable to low-level SIMD types. In my experience this is not the case because programming with SIMD types makes intent very clear and compilers can optimize sequences of SIMD operations just like they can for scalar operations. SIMD types do not lead to an easy and obvious answer to efficient and easily usable data structures, though.

One major benefit from SIMD types is that the programmer can gain an intuition for SIMD. This subsequently influences further design of data structures and algorithms to better suit SIMD architectures.

There are already many users of SIMD intrinsics (and thus a primitive form of SIMD types). Providing a cleaner and portable SIMD API would provide many of them with a better alternative. Thus SIMD types in C++ would capture existing practice.

The challenge remains in providing *portable* SIMD types and operations.

C++ has no means to use SIMD operations directly. There are indirect uses through loop vectorization or optimized algorithms (that use extensions to C/C++ or assembler for their implementation).

All compiler vendors (that I worked with) add intrinsics support to their compiler products to make these operations accessible from C. These intrinsics are inherently not portable and most of the time very directly bound to a specific instruction. (Compilers are able to statically evaluate and optimize SIMD code written via intrinsics, though.)

Solutions that encode data parallel operations via the application of operations or algorithms on a larger set of data, quickly lead to inefficient use of the CPUs caches. Consider valarray:

1 std::valarrayA compiler will be able to vectorize lines 3 and 4.data(N); 2 // initialize it somehow 3 data *= 2.f; 4 data += 1.f;

§ 6.2 [stmt.expr]: "All side effects from an expression statement are completed before the next statement is executed." does not necessarily stop the compiler from combining the operations on lines 3 and 4. Still, (for a larger number of expressions and containers) it cannot be reasonably expected/required that compilers will be able to combine the statements. Thus, we must assume that line 3 will iterate over N values and only afterwards line 4 is allowed to iterate over the same memory again. But modern CPUs require small working-sets to make efficient use of their caches. Therefore N should not be too large. General and exact bounds for N do not exist, though.

Any solution to achieving smaller working sets requires some form of loop construct with current C++.

The following is not a solution for this example unless expression templates were used in the implementation:

... 3 data = data * 2.f + 1.f;

For the reasons sketched in the section above, Intel experimented with an approach that would split the algorithm in optimized working sets at runtime. This required special sections of code which generated the code at runtime that subsequently was executed to do the actual work. The semantics in different sections of the code were thus slightly different: Something which can be really surprising and hard to understand for developers.

Whenever you take the approach of expressing the algorithm on the whole large data, you will either end up with inefficient cache usage or a solution resembling Intel Array Building Blocks. (I'd be happy to learn of another — better — solution, of course.)

Therefore loops still remain the state of the art for cache efficient processing of large data sets. The loop count can be reduced by executing the loop body simultaneously on SIMD vectors. (Additionally the loop can be divided onto multiple cores.)

The main portability concern stems from different SIMD register widths for different targets. This is a real problem mainly when SIMD types are used in I/O. But this is comparable to differing endianness. It simply requires software to use a portable interchange format (e.g. SoA or AoS of scalar types).

Typically the compiler is told the target microarchitecture via flags. Obviously this will create code that is not guaranteed to run on older/incompatible CPUs. An implementation might decide to compile the code for several targets, though and decide for the best one to use at runtime. But it is also possible to achieve runtime selection of the target microarchitecture without help from the compiler (e.g. Krita uses Vc this way).

This is a pure library proposal. It does depend on implementation-specific language extensions, though (e.g. SIMD intrinsics/builtins). The proposal builds upon the experience from the Vc library [1, 2].

- Mask types and solutions for conditional assignment / predicates
- Load/store optimizations: streaming and prefetching
- Gather/scatter
- Shuffles, swizzles, vector shift/rotate
- Portable optimized (de)interleaving
- Iterators / Ranges (make iteration with SIMD types easier)
- Containers for SIMD
- “valarray<simd_vector<T>, N>”
- support for load/store with half-precision floats
- “fix” allocators to support SIMD types (over-aligned) throughout the C++ standard (
`new`

,`std::allocator`

)

Provide at least the following new types:

`int16_v`

`int32_v`

`uint16_v`

`uint32_v`

`float_v`

`double_v`

`int8_v`

`uint8_v`

`int64_v`

`uint64_v`

Each class has a single data member, an implementation-specific object to access the SIMD register (this could be `__m256`

with AVX intrinsics).
If the type is not supported for SIMD on the target platform, the data member will be a single scalar value.
For example `double_v`

has one `double`

member for ARM NEON.

The sizes of these SIMD types therefore depend on the natural size suggested by the architecture of the execution environment.
This is similar to § 3.9.1 [basic.fundamental] p2: “Plain `int`

s have the natural size suggested by the architecture of the execution environment”.

These types should be instantiations of a SIMD vector template class: `typedef simd_vector`

.
This makes generic code slightly easier to create, without the need for SFINAE:

template<typename T> void someFunction(simd_vector<T> v); // vs. template<typename V> typename std::enable_if<is_simd_vector<V>::value, void>::type someFunction(V v);

The SIMD types all need to be inside a namespace whose name depends on the ABI/target. For instance the symbols for SSE and AVX SIMD types must be different so that incompatible object files/libraries fail to link.

inline namespaceABI/target-dependent{

The `simd_vector<T>`

class:

template<typename T> class simd_vector { typedefimplementation-definedsimd_type; simd_type data; public: typedef T value_type; // the number of values in the SIMD vector static constexpr size_t size = sizeof(simd_type) / sizeof(value_type); // in Vc operator new / delete are overloaded to work around the fact that new ignores the alignof // of the allocated type. If this does not get solved in the standard the following will be needed: void *operator new(size_t size); void *operator new(size_t, void *p) { return p; } void *operator new[](size_t size); void *operator new[](size_t , void *p) { return p; } void operator delete(void *ptr, size_t); void operator delete(void *, void *) {} void operator delete[](void *ptr, size_t); void operator delete[](void *, void *) {} // init to zero float_v(); // copy simd_vector(const simd_vector &); simd_vector &operator=(const simd_vector &); // implicit conversion from compatible simd_vector<T> template<typename U> simd_vector(simd_vector<U>, typename enable_if<is_integral<value_type>::value && is_integral<U>::value && size == simd_vector<U>::size, void *>::type = nullptr); // static_cast from vectors of possibly (depending on target) different size // (dropping values or filling with 0 if the size is not equal) template<typename U> explicit simd_vector(simd_vector<U>, typename enable_if<!( is_integral<value_type>::value && is_integral<U>::value && size == simd_vector<U>::size), void *>::type = nullptr); // broadcast with implicit conversions simd_vector(value_type); // load member functions void load(const value_type *mem); template<typename U> void load(const U *mem) ; // load ctors (optional) explicit Vector(const value_type *mem); template<typename U> explicit Vector(const U *mem); // store functions void store(value_type *mem) const; template<typename U> void store(U *mem) const; // unary operators simd_vector &operator++(); simd_vector operator++(int); simd_vector &operator--(); simd_vector operator--(int); simd_vector operator~() const; simd_vector operator+() const; simd_vector<typename negate_type<T>type> operator-() const; // assignment operators simd_vector &operator+=(simd_vector<T> x); simd_vector &operator-=(simd_vector<T> x); simd_vector &operator*=(simd_vector<T> x); simd_vector &operator/=(simd_vector<T> x); simd_vector &operator%=(simd_vector<T> x); simd_vector &operator&=(simd_vector<T> x); simd_vector &operator|=(simd_vector<T> x); simd_vector &operator^=(simd_vector<T> x); simd_vector &operator<<=(simd_vector<T> x); simd_vector &operator>>=(simd_vector<T> x); simd_vector &operator<<=(int x); simd_vector &operator>>=(int x); // scalar entries accessimplementation-defined&operator[](size_t index); value_type operator[](size_t index) const; }; typedef simd_vector< float> float_v; typedef simd_vector< double> double_v; typedef simd_vector< int64_t> int64_v; typedef simd_vector<uint64_t> uint64_v; typedef simd_vector< int32_t> int32_v; typedef simd_vector<uint32_t> uint32_v; typedef simd_vector< int16_t> int16_v; typedef simd_vector<uint16_t> uint16_v; typedef simd_vector< int8_t> int8_v; typedef simd_vector< uint8_t> uint8_v; } // namespace

Implicit conversion between vectors and implicit conversion from scalars to vectors follows the rules of conversion between scalar types as closely as possible. The rules are:

- If
`U`

implicitly converts to`T`

then implicit conversion from`U`

to`simd_vector<T>`

also works. - If
`simd_vector<U>::size == simd_vector<T>::size`

works portably, then implicit conversion from`simd_vector<U>`

to`simd_vector<T>`

also works.

In Vc the following guarantees are made:

`float_v::size == int32_v::size == uint32_v::size`

`int16_v::size == uint16_v::size`

int64_v::size == uint64_v::size

int32_v::size == uint32_v::size

int16_v::size == uint16_v::size

int8_v::size == uint8_v::size

`simd_vector<T>`

types.
Explicit conversions between vectors of possibly different size are allowed and will either drop the last values or fill the remaining values in the destination with zeros. (Together with vector shift/rotate functions and a bit of care this allows portable code that casts between int and float vectors.)

The most important portable way to efficiently fill a complete SIMD vector is via loads.
A load requires a pointer to at least `size`

consecutive values in memory.
Whether load constructors should be specified needs to be discussed.
Their use does not explicitly state the operation:

float *data = ...; float_v a(data); // it is not obvious what this line of code does float_v b; b.load(data); // this is a lot clearerOn the other hand there should be a way to fill a vector on construction (e.g. to ease the use of

`const`

).
As for loads, the most important portable way to efficiently store the results from a SIMD vector is via store functions.

The overloads of load/store with `value_type*`

exists to support arguments that have an implicit conversion to `value_type*`

.

Loads and stores can optionally do a conversion from/to memory of a different type (e.g. `float_v fun(short *mem) { return float_v(mem); }`

).
This feature is important because:

- hardware may provide special support for converting loads/stores (e.g. Intel MIC)
- the pattern is otherwise much harder to use (e.g. using
`float`

for storage and`double_v`

for calculation)

All arithmetic, logical, and bit-wise operators are overloaded for combinations of different simd_vector<T> and builtin scalar types.

template<typename L, typename R> typename determine_return_type<L, R>::type operator+(L &&x, R &&y) { typedef typename determine_return_type<L, R>::type V; returnIt is possible to get correct automatic conversion. Theinternal_add_function(V(std::forward<L>(x)), V(std::forward<R>(y))); }

`determine_return_type`

type decides which type combinations are allowed and what conversions are done.
Consider:

void f(int32_v x, unsigned int y) { auto z = x * y; ...We expect the type of

`z`

in to be `uint32_v`

.
Analogue to `1 * 1u`

yielding an `unsigned int`

.
On the other hand the following example must always fail to compile:

void f(int32_v x, double_v y) { auto z = x * y; ...While

`1 * 1.`

is well-defined and yields a `double`

, the issue with SIMD vectors is that their number of entries is not guaranteed to match.
(Consider x holding four values and y holding two values.
The semantics of a multiplication of `x`

with `y`

is unclear.
It could mean [x0 * y0, x1 * y0, x2 * y1, x3 * y1], or [x0 * y0, x1 * y1, x2, x3], or [x0 * y0, x1 * y1, x2 * y0, x3 * y1], or [x0 * y0, x1 * y1], or [x2 * y0, x3 * y1].
Or, for a different target, it could even simply mean [x0 * y0].)
Via SFINAE the operators can be defined such that only the following type combinations work:

```
simd_vector<T> × is_convertible<From, simd_vector<T> && !is_narrowing_float_conversion<From, T>
```

A follow-up proposal will define the `determine_return_type`

class.
This proposal could also discuss operator implementations for better error reporting (via `static_assert`

) if a `simd_vector`

is combined with an incompatible second operand.

All the functions in `<cmath>`

can/should be overloaded to accept SIMD vectors as input.

With the `simd_vector<T>`

class it is possible to explicitly vectorize simple loops:

std::array<float, 1024> x_data, y_data, r_data, phi_data; // fill x and y with random values from -1 to 1 // The following loop converts the x,y coordinates into r,phi polar coordinates. for (int i = 0; i < 1024; i += float_v::size) { const float_v x(&x_data[i]); const float_v y(&y_data[i]); const float_v r = sqrt(x * x + y * y); const float_v phi = atan2(y, x) * 57.29578f; r.store(&r_data[i]); phi.store(&phi_data[i]); }already shows a few issues that will be solved in follow-up proposals:

- The alignment of the float arrays is not guaranteed to work for aligned SIMD loads and stores.
- The loads and stores are not important to the algorithm, but still they take up most of the code in the loop.
- If the size of the arrays is not a multiple of the SIMD width, out-of-bounds accesses would result or an extra scalar epilogue were required.

If this example is compiled for an x86 target with SSE2 instructions the loop body calculates four results in parallel.
If it is compiled for an x86 target with AVX instructions the loop body calculates eight results in parallel.
If the target does not support `float`

SIMD vectors the loop body calculates just one result.

This is a (slightly reduced) Kalman-Filter example from CERN/RHIC track reconstruction. It is used to fit the track parameters of several particles in parallel.

struct Covariance { float_v C00, C10, C11, C20, C21, C22, C30, C31, C32, C33, C40, C41, C42, C43, C44; }; struct Track { float_v x, y, tx, ty, qp, z; float_v chi2; Covariance C; float_v NDF; ... }; struct HitInfo { float_v cos_phi, sin_phi, sigma2, sigma216; }; void Filter(Track &track, const HitInfo &info, float_v m) { Covariance &C = track.C; const float_v residual = info.cos_phi * track.x + info.sin_phi * track.y - m; // ζ = Hr - m const float_v F0 = info.cos_phi * C.C00 + info.sin_phi * C.C10; // CHᵀ const float_v F1 = info.cos_phi * C.C10 + info.sin_phi * C.C11; const float_v F2 = info.cos_phi * C.C20 + info.sin_phi * C.C21; const float_v F3 = info.cos_phi * C.C30 + info.sin_phi * C.C31; const float_v F4 = info.cos_phi * C.C40 + info.sin_phi * C.C41; const float_v HCH = F0 * info.cos_phi + F1 * info.sin_phi; // HCHᵀ const float_v wi = 1.f / (info.sigma2 + HCH); const float_v zetawi = residual * wi; // (V + HCHᵀ)⁻¹ ζ const float_v K0 = F0 * wi; const float_v K1 = F1 * wi; const float_v K2 = F2 * wi; const float_v K3 = F3 * wi; const float_v K4 = F4 * wi; track. x -= F0 * zetawi; // r -= CHᵀ (V + HCHᵀ)⁻¹ ζ track. y -= F1 * zetawi; track.tx -= F2 * zetawi; track.ty -= F3 * zetawi; track.qp -= F4 * zetawi; C.C00 -= K0 * F0; // C -= CHᵀ (V + HCHᵀ)⁻¹ HC C.C10 -= K1 * F0; C.C11 -= K1 * F1; C.C20 -= K2 * F0; C.C21 -= K2 * F1; C.C22 -= K2 * F2; C.C30 -= K3 * F0; C.C31 -= K3 * F1; C.C32 -= K3 * F2; C.C33 -= K3 * F3; C.C40 -= K4 * F0; C.C41 -= K4 * F1; C.C42 -= K4 * F2; C.C43 -= K4 * F3; C.C44 -= K4 * F4; track.chi2 += residual * zetawi; // χ² += ζ (V + HCHᵀ)⁻¹ ζ track.NDF += 1; }In the

`Filter`

function a new measurement (`m`

) is added to the `Track`

state.
In the data structures used here, the objects each contain the data of `float_v::size`

tracks.
So instead of working with one track at a time, the code explicitly states that multiple tracks can be filtered together and that their data is stored interleaved in memory.
(The corresponding track (t`Track`

object (with e.g. SSE) looks like this:
`[x`_{t0} x_{t1} x_{t2} x_{t3}
y_{t0} y_{t1} y_{t2} y_{t3}
tx_{t0} ...
]

With AVX:
`[x`_{t0} x_{t1} x_{t2} x_{t3}
x_{t4} x_{t5} x_{t6} x_{t7}
y_{t0} y_{t1} y_{t2} y_{t3}
y_{t4} ...
]

)
- Thanks to Chandler Carruth for a good discussion about SIMD in C++ and comments on the structure and focus of this proposal.
- Thanks to Matthias Bach, David Rohr, and Volker Lindenstruth for API discussions and encouragement.
- Thanks to Igor Kulakov and Maksym Zyzak for all the feedback about usage of Vc in RHIC software.
- This work was supported by GSI Helmholtzzentrum für Schwerionenforschung and the Hessian LOEWE initiative through the Helmholtz International Center for FAIR (HIC for FAIR).

[2] Vc Repository http://code.compeng.uni-frankfurt.de/projects/vc