-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor cuda::ceil_div
to take two different types
#2376
Conversation
2ba30db
to
6df219e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thx a lot!
the code is pretty inefficient on GPU. Should we create another PR for that? |
@miscco and @gonzalobg spent some time inspecting SASS while coming up with the current version. I am curious about your suggestion to improve it :) Be careful not overflow, so the classic |
this was the solution I proposed a while ago template<typename T>
//HOST_DEVICE_NODISCARD
constexpr T ceil_div(T value, T div) {
//ASSERT_OR_ASSUME(is_zero_or_positive(value))
//ASSERT_OR_ASSUME(div > 0)
using U = ::cuda::std::__make_unsigned_t<T>;
auto value1 = static_cast<U>(value);
auto div1 = static_cast<U>(div);
auto ret1 = ::cuda::std::is_unsigned<T>::value ? (value1 / div1) + (value1 % div1 > 0)
: (value1 + div1 - 1) / div1; // faster
auto ret = static_cast<T>(ret1);
//ASSERT_OR_ASSUME(ret >= value / div)
return ret;
} Performance notes:
|
I think Let's get this PR in and then please propose the improved version! Thx! |
I believe the optimization from @fbusato points to a flaw in the API, which is what to do with negative values. Right now, we handle negative integer values by rounding up towards zero, which is actually incorrect. The question I have is whether we want to restrict the API towards positive values |
0dfbab1
to
d902a11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, looks good. I cannot assess though whether the new version for signed integers (a + b - 1) / b
is strictly better.
🟨 CI finished in 4h 34m: Pass: 96%/417 | Total: 8d 10h | Avg: 29m 10s | Max: 1h 17m | Hits: 32%/38811
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
304 | linux-amd64-cpu16 |
62 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
d902a11
to
edca157
Compare
@miscco please keep this function working only with non-negative numbers. Ceiling division with negative numbers is extremely rare and prohibits optimizations (fast path) |
Yeah I somehow overlooked that we in fact do assert that |
🟨 CI finished in 11h 35m: Pass: 99%/433 | Total: 8d 21h | Avg: 29m 38s | Max: 1h 18m | Hits: 20%/38228
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 433)
# | Runner |
---|---|
320 | linux-amd64-cpu16 |
62 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
This replaces `cub::DivideAndRoundUp`
039ed72
to
2c29f0f
Compare
🟨 CI finished in 7h 30m: Pass: 97%/433 | Total: 12d 19h | Avg: 42m 33s | Max: 1h 52m | Hits: 31%/41653
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 433)
# | Runner |
---|---|
320 | linux-amd64-cpu16 |
62 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
🟨 CI finished in 1d 09h: Pass: 97%/433 | Total: 8d 17h | Avg: 29m 03s | Max: 1h 42m | Hits: 52%/41653
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 433)
# | Runner |
---|---|
320 | linux-amd64-cpu16 |
62 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
🟨 CI finished in 5h 15m: Pass: 99%/433 | Total: 5d 00h | Avg: 16m 44s | Max: 1h 57m | Hits: 84%/41657
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 433)
# | Runner |
---|---|
320 | linux-amd64-cpu16 |
62 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
🟩 CI finished in 7h 47m: Pass: 100%/433 | Total: 4d 23h | Avg: 16m 35s | Max: 1h 57m | Hits: 84%/41657
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 433)
# | Runner |
---|---|
320 | linux-amd64-cpu16 |
62 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
We already use a similar function in cub.
Deprecate that and replace it with
cuda::ceil_div