-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cub::DeviceFind::FindIf
#2405
base: main
Are you sure you want to change the base?
Conversation
|
I would love to see a benchmark comparison of
Do I understand correctly, that |
cub::DeviceFind::FindIf
cub::DeviceFind::FindIf
The name For your benchmark you could name this value |
@bernhardmgruber |
🟨 CI finished in 1h 52m: Pass: 97%/259 | Total: 1d 08h | Avg: 7m 28s | Max: 35m 24s | Hits: 99%/20079
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
# | Runner |
---|---|
186 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
cub/cub/device/device_find_if.cuh
Outdated
for (int i = 0; i < elements_per_thread; ++i) | ||
{ | ||
auto index = tile_offset + threadIdx.x + i * blockDim.x; | ||
|
||
if (index < num_items) | ||
{ | ||
// early exit | ||
if (sresult < index) | ||
{ | ||
return; | ||
} | ||
|
||
if (pred(*(begin + index))) | ||
{ | ||
atomicMin(result, index); | ||
return; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: a few observations about this code:
- when one thread finds an element satisfying the predicate, it updates the atomic and exits while the remaining CTA will have to read the atomic again on the next iteration
- if every thread finds a predicate, we'll have at most the number of atomics matching the thread block size
Instead, you could experiment with an approach where threads first issue atomic min for shared memory, then outside of the loop over elements per thread whole block synchronizes, and if a minimal index
is less than int max, main thread issues global atomic min and the whole thread block exists.
This could make an iteration over a single tile a bit more expensive for the case when no predicate where found, but on the other hand we'll address (1) because now remaining threads will see that they have to exit without accessing global memory, and (2) because now at most one thread of a given CTA issues an atomic.
cub/cub/device/device_find_if.cuh
Outdated
if (sresult < index) | ||
{ | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I'd hoist this before elements per thread loop. The invariant here is that sresult
is only updated before elements loop. If sresult
was smaller than tile offset, it'll keep being smaller for any tile_offset + i
. If it was larger, it'll be larger for any tile_offset + i
, so there should be no reason to read from shared memory on every iteration.
79569ad
to
ce3b44c
Compare
ce3b44c
to
6f9fba8
Compare
🟨 CI finished in 4h 42m: Pass: 94%/259 | Total: 5d 01h | Avg: 28m 13s | Max: 1h 39m | Hits: 99%/20079
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
# | Runner |
---|---|
186 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
|
||
#include <cub/config.cuh> | ||
|
||
#include "device_launch_parameters.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should come after the other includes I assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah still a silly draft, but why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because any header other than the config one should come after the system header guard so we do not accidentally emit warnings to users
|
||
template <typename IterBegin, typename IterEnd, typename Pred> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing some documentation here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is super early draft on the performance analysis stage still. Until we make sure that this is the right implementation to go (not sure yet) I won't be adding any docs - unless it makes it easier for the crowd to understand the new algo - but I will add a detailed explanation on the description for that.
{ | ||
auto index = tile_offset + threadIdx.x + i * blockDim.x; | ||
|
||
if (index < num_items) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider turning this into an early return
if (index < num_items) | |
if (index >= num_items) { | |
continue; // maybe even break? | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole loob body is within the if
so an early return would make tings clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok ok got it. I had to look it with a fresh eye and some morning coffee to get it.
f2dc70c
to
612a9c2
Compare
🟨 CI finished in 8h 06m: Pass: 94%/259 | Total: 5d 02h | Avg: 28m 19s | Max: 1h 34m | Hits: 99%/20079
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
# | Runner |
---|---|
186 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
bool found = false; | ||
for (int i = 0; i < elements_per_thread; ++i) | ||
{ | ||
auto index = tile_offset + threadIdx.x + i * blockDim.x; | ||
int index = i % WORDS + (i / WORDS) * block_threads * WORDS + threadIdx.x * WORDS + tile_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression for the evaluation of this index can of course be simplified further. It is as is for educational purposes.
🟨 CI finished in 2h 35m: Pass: 97%/259 | Total: 5d 01h | Avg: 28m 11s | Max: 1h 41m | Hits: 99%/20079
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
# | Runner |
---|---|
186 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
This is a draft to track the work progress on
cub::DeviceFind::FindIf
which should ultimately be used to improvethrust::all_of
.The algorithm is still a prototype until we make sure that it performs better than reduce. Initial bench results are encouraging. Once finalized, it will be modeled in the CUB coding standards.