diff --git a/libcudacxx/docs/extended_api/synchronization_primitives/barrier.md b/libcudacxx/docs/extended_api/synchronization_primitives/barrier.md index ac9ccfe78f..ab3e39c070 100644 --- a/libcudacxx/docs/extended_api/synchronization_primitives/barrier.md +++ b/libcudacxx/docs/extended_api/synchronization_primitives/barrier.md @@ -84,22 +84,34 @@ This may complete the current phase. ### Phase Completion of a `cuda::barrier` with tx-count support -The requirements for completion of the current phase are described below. Upon -completion of the current phase, the phase transitions to the subsequent phase -as described below. - -**Current phase completion requirements**. A `cuda::barrier` with tx-count -support completes the current phase when all of the following conditions are -met: - -1. The count of the pending arrivals has reached zero. -2. The tx-count has reached zero. - -**Phase transition**. When a `cuda::barrier` with tx-count support completes the -current phase, the following actions are performed atomically: - -1. The `cuda::barrier` object transitions to the next phase. -2. The pending arrival count is reinitialized to the expected arrival count. +Differences to phase completion of `std::barrier` are highlighted in bold in the +text below. + +A barrier is a thread coordination mechanism whose lifetime consists of a +sequence of barrier phases, where each phase allows at most an expected number +of threads to block until the expected number of threads **and the expected number +of transaction-based asynchronous operations** arrive at the barrier. + +Each barrier phase consists of the following steps: + +1. The `expected count` is decremented by each call to `arrive,arrive_and_drop`, or + **`cuda::device::barrier_arrive_tx`**. +2. **The transaction count is incremented by each call to + `cuda::device::barrier_arrive_tx` and decremented by the completion of + transaction-based asynchronous operations such as `cuda::memcpy_async_tx`**. +3. Exactly once after **both** the expected count and **the transaction count** reach + zero, a thread executes the completion step during its call to `arrive`, + `arrive_and_drop`, or `wait`, except that it is implementation-defined whether + the step executes if no thread calls `wait`. +4. When the completion step finishes, the expected count is reset to what was + specified by the expected argument to the constructor, possibly adjusted by + calls to `arrive_and_drop`, **the transaction count is reset to zero**, and the + next phase starts. + +Concurrent invocations of the member functions of barrier **and the non-member +barrier APIs in cuda::device**, other than its destructor, do not introduce data +races. The member functions `arrive` and `arrive_and_drop`, and the non-member +function **cuda::device::barrier_arrive_tx**, execute atomically. ## Implementation-Defined Behavior