Document Number: P2629R0
Date: 2022-07-08
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>
Authors: Gonzalo Brito Gadeschi
Audience: Concurrency

barrier token-less split arrive/wait

Motivation

This proposal extends std::barrier with better support for data-pipelining. These pipelines are very common in, e.g., Deep Learning and HPC applications [0].

In Deep Learning pipelines targetting modern NUMA systems, two groups of threads, “producers” and “consumers”, synchronize access to the shared resources of a pipeline stage using a pair of barriers. Consumer threads wait on the “consumer barrier”, process data once unblocked, and arrive on the “producer barrier” to signal that shared resources are safe to reuse. Producer threads wait on the “producer barrier”, reuse shared resources to load the next data set, and arrive on the “consumer barrier” to signal consumer threads that this data set is ready for use.

In this synchronization pattern, consumer threads wait on the consumer barrier without arriving on it, and arrive on the producer barrier without waiting on it; and vice-versa for producer threads.

However, std::barrier APIs require threads to arrive on a barrier before waiting on it. This proposal allows producer and consumer threads to synchronize efficiently.

Tony tables

Real applications implement multi-stage pipelines with multiple shared resources. The following table shows a single stage of a consumer-producer pipeline involving one consumer thread, one producer thread, one data buffer as a shared resource, and a pair of std::barriers.

Note: the code of each thread runs in an infinite loop.

Before After
Producer Thread Consumer Thread Producer Thread Consumer Thread
producer.arrive_and_wait();
produce(data);
consumer.arrive();






consumer.arrive_and_wait();
consume(data);
producer.arrive();
producer.wait_parity(pp);
produce(data);
consumer.arrive_and_discard();






consumer.wait_parity(pp);
consume(data);
producer.arrive_and_discard();

Before this proposal:

These steps cause threads to unnecessarily:

  1. modify a barrier: by calling arrive_and_wait when they only need to wait,
  2. stall at the arrive until the barrier_token is constructed when they will immediately discard it afterwards, and
  3. cast [[nodiscard]] away: to avoid the warning produced by discarding the token.

After this proposal:

This proposal addresses the three pre-existing issues raised above.

User guide

The arrive_and_discard API is semantically equivalent to static_cast<void>(barrier.arrive()) but is easier to implement efficiently.

This section therefore focuses on how to use:

void wait_parity(bool phase_parity) const;

The phase completion step of std::barrier advances the barrier to the “next phase” (thread.barrier#class-1.3).

This proposal guarantees that:

This proposal defines that the parity of:

The wait_parity(bool phase_parity) API waits on the barrier phase parity changing from the specified phase_parity to a different parity. This is analogous to the atomic::wait(value) API, which waits on the atomic value changing from value to a different value.

Teaching examples

This example uses a barrier to print text in order:

void houston() { std::barrier barrier(2); // A std::thread apollo13([&barrier] { std::cout << "Okay, Houston, we've had a problem here" << std::endl; static_cast<void>(barrier.arrive()); // B }); bool phase_parity = false; barrier.arrive_and_wait(); // C std::cout << "This is Houston. Say again, please." << std::endl; apollo13.join(); }

The initial phase of a freshly initialized barrier is 0 and its parity is false. To implement this example with the proposed APIs we can:

void houston() { std::barrier barrier(1); // A std::thread apollo13([&barrier] { std::cout << "Okay, Houston, we've had a problem here" << std::endl; barrier.arrive_and_discard(); // B }); bool phase_parity = false; barrier.wait_parity(); // C std::cout << "This is Houston. Say again, please." << std::endl; apollo13.join(); }

The following example shows a thread that uses wait_pairty to wait on all barrier phases by alternating the phase_parity it waits on, on every iteration:

bool phase_parity = false; while(true) { barrier.wait_parity(phase_parity); phase_parity = !phase_parity; other_barrier.arrive_and_discard(); }

This thread arrives on some other_barrier, to make sure that other threads arriving on barrier don’t cause it to “skip” waiting on a particular phase.

Producer-consumer pipeline example

The following code snippet (try it out in compiler-explorer) shows a full working example of the producer-consumer pipeline described above and showcases how to combine these APIs:

#include <thread> #include <barrier> int data = -1; constexpr int max_count = 5; std::barrier producer(1); std::barrier consumer(1); void produce() { int count = 0; // Producer thread tracks the barrier phase separately. // On the first iteration, the producer thread waits // on "phase 1" of the barrier, which is an odd phase: bool phase = true; do { data = count++; // Once the data is produced, it notifies the consumer // by arriving on the "consumer" barrier using // "arrive_and_discard", because this thread does not // need to wait on that barrier: consumer.arrive_and_discard() if (count == max_count) return; producer.wait_parity(phase); // After unblocking from the barrier, the phase to // be waited on in the next iteration is updated: phase = !phase; } while(true); } void consume() { int count = 0; bool phase = true; while(true) { consumer.wait_parity(phase); phase = !phase; assert(data == count++); producer.arrive_and_discard(); if (count == max_count) return; } } int main() { std::thread t(produce); consume(); t.join(); return 0; }

Rationale

Applications cannot reuse std::barrier to solve these problems, and instead must re-implement it to add these APIs.

This proposal chooses to extend std::barrier with two APIs to enable applications to re-use it.

Rationale for waiting without a token

The proposed wait_parity API is a replacement for arrive_and_wait in certain situations.

The arrive_and_wait API has significant latency:

The wait_parity API enables applications that “know” the parity of the barrier phase they want to wait on, and that are able to do so safely, to just wait on the barrier, without having to arrive on it first.

Rationale for arriving without a token

The arrive_and_discard API is a replacement for arrive in situations where the calling thread is not going to wait at the barrier.

The arrive API:

The roundtrip latency of sending the message and waiting for the response is significant in large NUMA architectures.

P2588barrier’s phase completion guarantees” proposes relaxing the specification of barrier to enable implementations to complete the phase in one of the threads that waits at the barrier by requiring that at least one thread waits for the phase to flip.

On an implementation that completes the phase on a thread that waits, and on a proper hardware architecture with support for far atomics, arrive_and_discard is fire-and-forget. It sends a message to modify the barrier count, but it does not need to wait for a response.

During the call to wait, the implementation then picks one thread, and uses it to complete the phase, unblocking all other threads when the phase completes.

The reference implementation of this proposal on compiler-explorer provides such an implementation.

On large NUMA architectures and data-pipelining applications, the consumer barrier can be placed close to the consumer threads, and the producer barrier can be placed close to the producer threads.

This enables threads to poll the barrier locally and benefit from HW acceleration to do so, but increases the latency of arriving at the barrier.

The arrive_and_discard API brings the latency to zero on such an implementation, because the thread does not need to wait for a response to be able to make progress.

Impact

Impact on users

None, the semantics of existing code do not change, and it does not change the ABI.

Impact on implementations

The arrive_and_discard API imposes minimal overhead on implementations, since it can be naively implemented as:

void arrive_and_discard(ptrdiff_t update = 1) { static_cast<void>(arrive(update)); }

The implementation impact of wait_parity is “to be determined”. In the reference implementation, it is implemented naively as:

void wait_parity(bool phase_parity) const { wait(arrival_token { .phase = phase_parity }); }

Impact on/from other proposals

P2588 proposes relaxing std::barrier’s phase completion guarantees. Depending on the outcome of this proposal, wait_parity would, just like wait, be able to complete the barrier’s phase. In that case, it might make sense to not make wait_parity a const member function.

Unresolved questions

References

Proposed wording

thread.barrier.class:

namespace std {
  template<class CompletionFunction = see below>
  class barrier {
  public:
    using arrival_token = see below;

    static constexpr ptrdiff_t max() noexcept;

    constexpr explicit barrier(ptrdiff_t expected,
                               CompletionFunction f = CompletionFunction());
    ~barrier();

    barrier(const barrier&) = delete;
    barrier& operator=(const barrier&) = delete;

    [[nodiscard]] arrival_token arrive(ptrdiff_t update = 1);
    void arrive_and_discard(ptrdiff_t update = 1);
    void wait(arrival_token&& arrival) const;
    void wait_parity(bool phase_parity) const;

    void arrive_and_wait();
    void arrive_and_drop();

  private:
    CompletionFunction completion;      // exposition only
  };
}
  1. The initial phase of a barrier is the 0-th phase. The successor of the n-th barrier phase is the (n+1)-th phase.
  2. Each barrier phase consists of the following steps:
    1. The expected count is decremented by each call to arrive or arrive_and_drop.
    2. When the expected count reaches zero, the phase completion step is run. For the specialization with the default value of the CompletionFunction template parameter, the completion step is run as part of the call to arrive or arrive_and_drop that caused the expected count to reach zero. For other specializations, the completion step is run on one of the threads that arrived at the barrier during the phase.
    3. When the completion step finishes, the expected count is reset to what was specified by the expected argument to the constructor, possibly adjusted by calls to arrive_and_drop, and the nextsuccessor of the current phase starts.
  3. Each phase defines a phase synchronization point. Threads that arrive at the barrier during the phase can block on the phase synchronization point by calling wait, and will remain blocked until the phase completion step is run.
  4. The phase completion step that is executed at the end of each phase has the following effects:
    1. Invokes the completion function, equivalent to completion().
    2. Unblocks all threads that are blocked on the phase synchronization point.

      The end of the completion step strongly happens before the returns from all calls that were unblocked by the completion step. For specializations that do not have the default value of the CompletionFunction template parameter, the behavior is undefined if any of the barrier object’s member functions other than wait are called while the completion step is in progress.
  5. Concurrent invocations of the member functions of barrier, other than its destructor, do not introduce data races. The member functions arrive and arrive_and_drop execute atomically.
  6. CompletionFunction shall meet the Cpp17MoveConstructible (Table 30) and Cpp17Destructible (Table 34) requirements. is_nothrow_invocable_v<CompletionFunction&> shall be true.
  7. The default value of the CompletionFunction template parameter is an unspecified type, such that, in addition to satisfying the requirements of CompletionFunction, it meets the Cpp17DefaultConstructible requirements (Table 29) and completion() has no effects.
  8. barrier::arrival_token is an unspecified type, such that it meets the Cpp17MoveConstructible (Table 30), Cpp17MoveAssignable (Table 32), and Cpp17Destructible (Table 34) requirements.
static constexpr ptrdiff_t max() noexcept;
constexpr explicit barrier(ptrdiff_t expected,
                           CompletionFunction f = CompletionFunction());
[[nodiscard]] arrival_token arrive(ptrdiff_t update = 1);
void arrive_and_discard(ptrdiff_t update = 1);
void wait(arrival_token&& arrival) const;
void wait_parity(bool phase_parity) const;

UNRESOLVED QUESTION: if P2588 is accepted, we might not want to make wait_parity const.

void arrive_and_wait();
void arrive_and_drop();