A constexpr bitwise operations library for C++

Introduction

This proposal adds support for low level bitwise and logical operations to C++.

Impact on the standard

This proposal is a pure library extension. It does not require any changes in the core language and does not depend on any other library extensions. The proposal is composed entirely of free functions. The proposed functions are added to the <cmath> and <memory> headers. No new headers are introduced.

While this proposal can be implemented entirely in standard C++14, optimal implementations will require additional support from the compiler to detect and replace function calls with native instructions when available. See [BitOpsRef] for a reference implementation written in C++14.

Motivation

The C and C++ languages provide an abstraction over the machine. The machine provides the common arithmetic and logical operations, which are accessed using the built in operators inherited from C. These operations are the primitives which are used to implement higher level abstractions.

We construct algorithms by combining these basic operations. Sometimes significant performance benefits can be gained by directly manipulating the binary quantities contained within the registers which make up this numerical abstraction. Many online and print references including [Anderson01], [Dietz01], [Neumann01], [Warren01], and [HACKMEM] are devoted to discovering these algorithms and implementing them efficiently.

Hardware vendors have understood the importance of high performance bitwise manipulation routines. Many of them have provided additional hardware which can perform these bitwise operations directly with a single instruction. These instructions are often much more efficient than computing the algorithm manually in C or assembly.

Other bitwise manipulation algorithms can be implemented using clever but non-intuitive combinations of arithmetic and logical operations. Most importantly, for some bitwise algorithms, the most efficient implementation varies between hardware platforms. These differences create an unreasonably large maintenance burden on the programmer who wishes to write efficient and portable code.

As a motivating example, consider the various implementations of the count trailing zeroes algorithm presented in [Kostjuchenko01]. In order to implement an SSE2 optimized strlen() function, the author had to implement, test, and profile many different versions of count trailing zeroes. None of them take advantage of native instructions.

One who wishes to exploit the bsf or tzcnt instructions on Intel must rely on non-standard compiler intrinsics or inline assembly. One must also provide backup implementations for other platforms which do not have such instructions. Adding support for native instructions requires a nest of #ifdefs and deep knowledge of different processor architectures. This is a heavy cost in programmer time.

Bitwise algorithms are general purpose tools which can be used in a wide variety of domains and are the key to unlocking high performance in many important algorithms. A bitwise operations library has been badly needed in the C and C++ standard libraries for many years.

We present a bitwise operations library which exposes these native instructions wherever possible and provides backup implementations in C++ if they do not exist. Because this library would be part of the standard, these commonly used routines could be implemented once and profiled on each platform. Finally, this library offers an interface which takes full advantage of the latest processor features including ARMv8 and Intel BMI2.

Design Goals and Scope

There are seemingly endless ways one can manipulate binary quantities. How does one go about choosing which ones to include in a library and which ones to exclude? How does one choose proper names for each function? Which algorithms can be trivially converted into single instructions by the optimizer and which actually require the programmer to declare their use through a function call? We will address these questions with the following design goals.

Design Goal 1: Provide the programmer with better access to the machine

In 1970, the Digital Equipment Corporation announced the PDP11. This 16 bit machine has 3 instructions of interest, ROR (rotate right), ROL (rotate left), and SWAB (swap bytes). These operations along with their later 32 and 64 bit variants are provided by many more modern machines as will be shown. As of 2013, the programmer still does not have direct access to these instructions in modern C and C++.

Therefore, the first and most important goal of this proposal is to provide the programmer with better access to the machine via a new set of primitives which go beyond simple arithmetic and logical operations. We will present new functions for the standard library which can be implemented using only few instructions if supported by the machine, using backup implementations if no such support is provided by the hardware.

Design Goal 2: Provide a reusable library of generic bitwise manipulation routines

In designing this proposal, we wish not just to limit ourselves to operations which may have native machine instruction implementations on at least one platform. We would like to provide a standard library of primitives which are commonly found to be reimplemented time and time again in different code bases. The standard library already provides a rich set of generic containers and algorithms. What is missing is a set of bitwise manipulation primitives.

Of particular emphasis are algorithms whose most efficient implementations depend on the implementations of other bitwise operations. A motivating example is checking whether a number is a power of 2.

Consider the following implementations:

bool ispow2(unsigned x) { return popcount(x) == 1; }
bool ispow2(unsigned x) { return x != 0 && (x & (x -1)) == 0; }

In the above example, popcount() is the population count or number of 1 bits in x. On a machine with a popcount instruction, the first implementation uses less instructions and no branches. Without a popcount instruction, the second version is the better choice as computing popcount requires much more than a few logical operations and comparisons [Dietz01]. In order to implement ispow2(), the programmer is faced with the same set of dilemnas as with the count trailing zeroes example from the Motivation section.

Glossary of Terms

The following terminology is used in the remainder of this document to describe the technical aspects of this proposal.

Technical Specification

We will now describe the additions to <cmath> and <memory>. This is a procedural library implemented entirely using constexpr templated free functions. In addition, each function has been qualified with noexcept. These operations are most often used in highly optimized numerical code where the overhead of exception handling would be inappropriate. For functions which have pre-conditions on their inputs, we have opted for undefined return values if these pre-conditions are ever violated. The functions are classified into different groups to aid analysis and discussion and each group will be presented one at a time.

We have chosen to support all signed and unsigned integral types in this proposal. It is often suggested that signed integers represent "numbers" and unsigned integers represent "bit fields" and that they should never be used together. While we generally agree with this philosophy, many of these algorithms have real use cases for signed and unsigned integral values. The primary danger of using both signed and unsigned integers comes from the pitfalls of comparing signed and unsigned values. None of the functions in this proposal require or encourage comparing or combining of signed and unsigned types. The template arguments for each proposed function are named integral to indicate generic support for all builtin integral types, signed and unsigned. Functions which take more than one integral argument of different types will use a single letter suffix, for example integrall and integralr for left and right hand arguments.

With regards to signed integers, this proposal does not require signed integers be implemented using 2's compliment. However, the design of this proposal considers the practical reality that almost all modern hardware does in fact use 2's compliment. All example code, including the reference implementation [BitOpsRef] assume 2's compliment signed integers with undefined behavior on overflow and underflow. Adding support for other signed representations is an exercise left for the reader.

Each section will describe the full technical specifications of the functions in that group, noting their return values and undefined behavior if any. We will also discuss the background and justification for each of the functions and list some applications where necessary. For the more complicated algorithms, examples will be provided to help illustrate how they work.

cmath Header Additions

The following sections describe the additions to the <cmath> header.

Explicit shifts

Bit shifting is provided in C++ with operator<< and operator>> for integral types. It is a very simplistic abstraction with many deficiencies and some subtle caveats.

First as noted earlier, there is no primitive for rotational shifts even though these shifts can be found in the instruction set of almost every machine. Second, operator>> for signed types has implementation defined behavior with regards to filling in the high order bits, making it nearly useless when writing portable code. Writing a portable arithmetic right shift cumbersome at best and inefficient at worst. Finally, performing a logical right shift on a signed quantity is also cumbersome because it requires casts which obscure the meaning of the code.

List of Functions

//SHift Logical Left
template <class integral>
constexpr integral shll(integral x, int s) noexcept;
//SHift Logical Right
template <class integral>
constexpr integral shlr(integral x, int s) noexcept;
//SHift Arithmetic Left
template <class integral>
constexpr integral shal(integral x, int s) noexcept;
//SHift Arithmetic Right
template <class integral>
constexpr integral shar(integral x, int s) noexcept;
//ROTate Left
template <class integral>
constexpr integral rotl(integral x, int s) noexcept;
//ROTate Right
template <class integral>
constexpr integral rotr(integral x, int s) noexcept;

Bit Counting Algorithms

Bit counting is used to construct efficient implementations of other higher level algorithms. Many of these operations have native support on a wide variety of modern and antiquated hardware. Some example applications will be provided below.

List of functions

//CouNT Trailing 0's
template <class integral>
constexpr int cntt0(integral x) noexcept;
//CouNT Leading 0's
template <class integral>
constexpr int cntl0(integral x) noexcept;
//CouNT Trailing 1's
template <class integral>
constexpr int cntt1(integral x) noexcept;
//CouNT Leading 1's
template <class integral>
constexpr int cntl1(integral x) noexcept;
//POPulation COUNT
template <class integral>
constexpr int popcount(integral x) noexcept;
//PARITY
template <class integral>
constexpr int parity(integral x) noexcept;

Applications

One application of cntt0 is in computing the greatest common divisor of 2 numbers. Credit goes to Howard Hinnant for bringing this to our attention.

template <typename unsigned-integral>
T gcd(T x, T y) 
{ 
    if (x == 0) 
        return y; 
    if (y == 0) 
        return x; 
    int cf2 = std::cntt0(x | y); 
    x >>= std::cntt0(x); 
    while (true) 
    { 
        y >>= std::cntt0(y); 
        if (x == y) 
            break; 
        if (x > y) 
            std::swap(x, y); 
        if (x == 1) 
            break; 
        y -= x; 
    } 
    return x << cf2; 
}

As mentioned earlier, we can use popcount() to detect whether or not an integer (signed or unsigned) is a power of 2.

template <typename integral>
bool ispow2(integral x) {
  return x > 0 && popcount(x) == 1;
}

Rightmost bit manipulation

The following functions perform simple manipulations on the rightmost bits of the given quantity. All of these operations can be trivially implemented using a few arithmetic and logical operations. Therefore, these functions are only provided as usability wrappers in order to allow the programmer to avoid having to spend time looking them up, reimplementing, and/or unit testing them. Because their simple implementations, we have included the implementations in the function listing. Credit goes to [Warren01] and [ChessProg] for providing these implementations and insight into their usage.

Most of the operations in the section are implemented in hardware on Intel and AMD machines which have Intel BMI and/or AMD TBM extensions (see Survey of Hardware Support). All of these functions were tested on gcc 4.8 using the provided C++ implementations. We found that on BMI and TBM enabled compilation, the optimizer was successfully able to compile the C++ expression a single BMI or TBM instruction. Therefore an implementation of this section can simply use the provided trivial implementations and rely on the optimizer for hardware support if available.

List of Functions

//ReSeT Least Significant 1 Bit
template <class integral>
constexpr integral rstls1b(integral x) noexcept;
//SET Least Significant 0 Bit
template <class integral>
constexpr integral setls0b(integral x) noexcept;
//ISOlate Least Significant 1 Bit
template <class integral>
constexpr integral isols1b(integral x) noexcept;
//ISOlate Least Significant 0 Bit
template <class integral>
constexpr integral isols0b(integral x) noexcept;
//ReSeT Trailing 1's
template <class integral>
constexpr integral rstt1(integral x) noexcept;
//SET Trailing 0's
template <class integral>
constexpr integral sett0(integral x) noexcept;
//MaSK Trailing 0's
template <class integral>
constexpr integral maskt0(integral x) noexcept;
//MaSK Trailing 1's
template <class integral>
constexpr integral maskt1(integral x) noexcept;
//MaSK Trailing 0's and Least Significant 1 Bit
template <class integral>
constexpr integral maskt0ls1b(integral x) noexcept;
//MaSK Trailing 1's and Least Significant 0 Bit
template <class integral>
constexpr integral maskt1ls0b(integral x) noexcept;

Single Bit Manipulation

Most programmers have at some point in their career have needed to index and manipulate a single bit within a given integral quantity. These functions are trivial to implement and are provided only for usability.

List of Functions

//SET BIT
template <class integral>
constexpr integral setbit(integral x, int b) noexcept;
//ReSeT BIT
template <class integral>
constexpr integral rstbit(integral x, int b) noexcept;
//FLIP BIT
template <class integral>
constexpr integral flipbit(integral x, int b) noexcept;
//TEST BIT
template <class integral>
constexpr bool testbit(integral x, int b) noexcept;

Range of bits manipulation

The following operations manipulate ranges of bits above or below a given index. One of them is implemented in hardware (see Survey of Hardware Support), the rest are provided for usability and completeness.

List of functions

//ReSeT BITS Greater than or Equal to
template <class integral>
constexpr integral rstbitsge(integral x, int b) noexcept;
//ReSeT BITS Less than or Equal to
template <class integral>
constexpr integral rstbitsle(integral x, int b) noexcept;
//SET BITS Greater than or Equal to
template <class integral>
constexpr integral setbitsge(integral x, int b) noexcept;
//SET BITS Less than or Equal to
template <class integral>
constexpr integral setbitsle(integral x, int b) noexcept;
//FLIP BITS Greater than or Equal to
template <class integral>
constexpr integral flipbitsge(integral x, int b) noexcept;
//FLIP BITS Less than or Equal to
template <class integral>
constexpr integral flipbitsle(integral x, int b) noexcept;

Bitwise and Bytewise Permutations

These functions provide a generic interface for permuting the bits and bytes in a word. Each function takes the following form:

template <typename integral>
constexpr integral permute_bits(integral x, /*arg2, arg3, ...*/ subword_bits=1, num_swar_words=1) noexcept;

In the above example, x is the value to permute, with additional arguments following it if needed. Each function operates on subwords of size subword_bits measured in bits. In this case, permute_bits(x) will perform the permutation on all of the bits of x, permute_bits(x, 2) will permute each pair of bits in x, permute_bits(x, 4) each nibble in x, and finally permute_bits(x, CHAR_BIT) will permute the bytes of x.

The num_swar_words parameter (Number of Simd Within A Register Words) enables parallel operation on multiple words within x. The size in bits of each individual word to permute will be sizeof(integral) * CHAR_BIT / num_swar_words. For example, permute_bits<uint32_t>(x, 1, 2) will independently permute the 16 high order bits of x and the 16 low order bits of x. Another example, permute_bits<uint32_t>(x, 2, 4) will permute the pairs of bits in each byte (assuming CHAR_BIT == 8) of x.

Finally, for each bitwise permutation routine, we also provide a corresponding bytewise permutation routine. These are provided for usability. They operate exactly like their bitwise cousins, except that their subword size is computed in bytes instead of bits.

template <typename integral>
constexpr integral permute_bytes(integral x, /*arg2, arg3, ...*/ subword_bytes=1, num_swar_words=1) noexcept;

The bytewise routines are trivially implemented as simple wrappers over the bitwise routines where we perform the simple conversion: subword_bits = subword_bytes * CHAR_BITS.

List of Functions

template <class integral>
constexpr integral reverse_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral reverse_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral outer_perfect_shuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral outer_perfect_shuffle_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral outer_perfect_unshuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral outer_perfect_unshuffle_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral inner_perfect_shuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral inner_perfect_shuffle_bytes(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral inner_perfect_unshuffle_bits(integral x, int subword_bits=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral inner_perfect_unshuffle_bytes(integral x, int subword_bytes=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral deposit_bits_right(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral deposit_bytes_right(integral x, integral mask, int subword_right=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral deposit_bits_left(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral deposit_bytes_left(integral x, integral mask, int subword_bytes=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral extract_bits_right(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral extract_bytes_right(integral x, integral mask, int subword_bytes=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral extract_bits_left(integral x, integral mask, int subword_bits=1, int num_swar_words=1) noexcept;
template <class integral>
constexpr integral extract_bytes_left(integral x, integral mask, int subword_bytes=1, int num_swar_words=1) noexcept;

Examples

The following table shows how some example 8 bit binary values would be permuted by each function. We use the C++14 binary literal syntax here. The bits of a given value are represented by letters to show how a generic value would be permuted. For a more detailed treatment of these operations, refer to [Neumann01]] and Chapter 7 of [Warren01].

Applications

Power of 2 manipulation

The following functions detect and compute powers of 2.

List of Functions

//IS POWer of 2
template <class integral>
constexpr bool ispow2(integral x) noexcept;
//CEILing Power of 2
template <class integral>
constexpr integral ceilp2(integral x) noexcept;
//FLOOR Power of 2
template <class integral>
constexpr integral floorp2(integral x) noexcept;

Applications

Saturated arithmetic

Saturated arithmetic is useful in digital signal processing applications [EmbedGuru01]. It is also provided as a hardware instruction on some machines. In our efforts to better expose hardware features, we have included saturated addition and subtraction functions in this proposal.

List of Functions

//SATurated ADDition
template <class integral_l, class integral_r>
constexpr auto satadd(integral_l l, integral_r r) noexcept -&gt; decltype(l + r);
//SATurated SUBtraction
template <class integral_l, class integral_r>
constexpr auto satsub(integral_l l, integral_r r) noexcept -&gt; decltype(l - r);

memory Header Additions

This section describes the additions to the <memory> header.

Alignment helpers

These are primitives used for aligning objects in memory. They supplement use cases with which std::align is not designed to handle. These are very useful operations and all of them have trivial implementations. They can often be found in operating system kernels and device drivers reimplemented time and time again as macros.

List of Functions

template <class integral>
constexpr bool is_aligned(integral x, size_t align) noexcept;
bool is_aligned(void* val, size_t align) noexcept;
template <class integral>
constexpr integral align_up(integral x, size_t align) noexcept;
void* align_up(void* val, size_t align) noexcept;
template <class integral>
constexpr integral align_down(integral x, size_t align) noexcept;
void* align_down(void* val, size_t align) noexcept;

Applications and std::align

We currently have std::align in the standard for doing alignment calculations. The function std::align has one very specific use case, that is to carve out an aligned buffer of a known size within a larger buffer. In order to use std::align, the user must a priori know the size of the aligned buffer they require. Unfortunately in some use cases, even calculating the size of this buffer as an input to std::align itself requires doing alignment calculations. Consider the following example of using aligned SIMD registers to process a memory buffer. The alignment calculations here cannot be done with std::align.

void process(char* b, char* e) {
  char* pb = std::min((char*)std::align_up(b, sizeof(simd16)), e);
  char* pe = (char*)std::align_down(e, sizeof(simd16));

  for(char* p = b; p < pb; ++p) {
    process1(p);
  }
  for(char* p = pb; p < pe; p += sizeof(simd16)) {
    simd16 x = simd16_aligned_load(p);
    process16(x);
    simd16_aligned_store(x, p);
  }
  for(char* p = pe; p < e; ++p) {
    process1(p);
  }
}

We conclude that std::align is much too specific for general alignment calculations. It has a very narrow use case and should only be considered as a helper function for when that use case is needed.

Implementation

Guidelines for Implementors

Those who wish to implement the functions provided by this proposal must consider the following guidelines:

Survey of Hardware Support

The following is a list of compiler intrinsics and native instructions which can be used to implement the proposal on various platforms. Several machine architectures were surveyed for their instruction references. The purpose of this section is to demonstrate the current state of the art on many different machines. We have also noted when one operation is trivially implementable from another bitops proposal operation.

Open Questions

Naming

Naming is one of the most difficult problems in software development. One one extreme are terse names such as std::ctz() for Count Trailing Zeroes. This naming style mimics assembler mnemonics and is also an artifact of the old days when programming languages had limits on the length of names of identifiers.

These short names do have some merits. They reduce the amount of typing required by the programmer and more importantly they can be used within complex expressions. The downside is the ambiguity that can come with some short names. Consider a hypothetical Count Leading Sign bits function std::cls(). This name could be interpreted in other contexts such as CLear Screen.

On the other extreme are verbose names such as std::mask_least_significant_1_bit_and_trailing_zeroes(). While these names remove all ambiguity they are very cumbersome to type. They also cannot be used easily in complex expressions with other operations.

We have opted to make a compromise. The current naming scheme adheres to the following rules:

As always, the naming question is continuously up for debate and reconsideration. Some other styles have been suggested on the std-proposals discussion forum.

Support for std::bitset

Many people have suggesting adding support for std::bitset. While this is certainly a good idea, we believe that is outside of the scope of this proposal. Once this proposal is finished and the interface is agreed upon, adding a follow up proposal for std::bitset would be easy to do.

Support for C

This library would also be very useful for the C community. Many of these bitwise operations are used by embedded developers and they often choose to implement in C. While C compatibility is a noble goal, we do not want to make sacrifices to the C++ interface in the name of C compatibility. Particularly with regards to templates, overloading, and constexpr. This is first and foremost a C++ proposal which takes advantage of the latest C++ techniques to provide a modern procedural interface.

If the C community shows interest we may consider a C interface that uses the generic macro feature. This may allow interoperability, using macros for C and templates for C++. The constexpr qualifier could be used in the C++ version while inline is used in the C version. If the C community shows interest, we will consider a joint C proposal and flesh out the technical details of the interface and compatibility.

Acknowledgements

Thank you to everyone on the std proposals forum for feedback and suggestions.

References