Replace phase completion section

Instead of using the PTX docs as a reference, use the C++ barrier docs as a reference.
NVIDIA · Sep 12, 2023 · 8e2f2ef · 8e2f2ef
1 parent 331ed7c
commit 8e2f2ef
Showing 1 changed file with 28 additions and 16 deletions.
diff --git a/libcudacxx/docs/extended_api/synchronization_primitives/barrier.md b/libcudacxx/docs/extended_api/synchronization_primitives/barrier.md
@@ -84,22 +84,34 @@ This may complete the current phase.
 
 ### Phase Completion of a `cuda::barrier` with tx-count support
 
-The requirements for completion of the current phase are described below. Upon
-completion of the current phase, the phase transitions to the subsequent phase
-as described below.
-
-**Current phase completion requirements**. A `cuda::barrier` with tx-count
-support completes the current phase when all of the following conditions are
-met:
-
-1. The count of the pending arrivals has reached zero.
-2. The tx-count has reached zero.
-
-**Phase transition**. When a `cuda::barrier` with tx-count support completes the
-current phase, the following actions are performed atomically:
-
-1. The `cuda::barrier` object transitions to the next phase.
-2. The pending arrival count is reinitialized to the expected arrival count.
+Differences to phase completion of `std::barrier` are highlighted in bold in the
+text below.
+
+A barrier is a thread coordination mechanism whose lifetime consists of a
+sequence of barrier phases, where each phase allows at most an expected number
+of threads to block until the expected number of threads **and the expected number
+of transaction-based asynchronous operations** arrive at the barrier.
+
+Each barrier phase consists of the following steps:
+
+1. The `expected count` is decremented by each call to `arrive,arrive_and_drop`, or
+   **`cuda::device::barrier_arrive_tx`**.
+2. **The transaction count is incremented by each call to
+   `cuda::device::barrier_arrive_tx` and decremented by the completion of
+   transaction-based asynchronous operations such as `cuda::memcpy_async_tx`**.
+3. Exactly once after **both** the expected count and **the transaction count** reach
+   zero, a thread executes the completion step during its call to `arrive`,
+   `arrive_and_drop`, or `wait`, except that it is implementation-defined whether
+   the step executes if no thread calls `wait`.
+4. When the completion step finishes, the expected count is reset to what was
+   specified by the expected argument to the constructor, possibly adjusted by
+   calls to `arrive_and_drop`, **the transaction count is reset to zero**, and the
+   next phase starts.
+
+Concurrent invocations of the member functions of barrier **and the non-member
+barrier APIs in cuda::device**, other than its destructor, do not introduce data
+races. The member functions `arrive` and `arrive_and_drop`, and the non-member
+function **cuda::device::barrier_arrive_tx**, execute atomically.
 
 ## Implementation-Defined Behavior