-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune CUB's ChainedPolicy by __CUDA_ARCH_LIST__ #2154
base: main
Are you sure you want to change the base?
Conversation
a2d8c87
to
dfbb1ae
Compare
dfbb1ae
to
7f36941
Compare
71a0f9a
to
04e42c9
Compare
The unit tests now list all virtual architectures, since the list was shorter than I expected. |
04e42c9
to
4809a7e
Compare
4809a7e
to
c5f9680
Compare
I reworked the feature to now only ever instantiate to the PTX versions that appear in |
f6522ab
to
75d0f10
Compare
🟨 CI finished in 6h 48m: Pass: 80%/250 | Total: 6d 04h | Avg: 35m 38s | Max: 1h 27m | Hits: 64%/17277
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 250)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
41 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 1d 15h: Pass: 98%/250 | Total: 4d 21h | Avg: 28m 17s | Max: 1h 11m | Hits: 64%/17277
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 250)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
41 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work, I love the idea! 💚
Since the mechanism is at the core of CUB, I want to make sure all works as expected. I left a few comments that I hope will further improve test coverage.
9c627c6
to
8766281
Compare
🟨 CI finished in 9h 27m: Pass: 83%/250 | Total: 1d 23h | Avg: 11m 19s | Max: 48m 33s | Hits: 98%/16565
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 250)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
41 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
bd97bb1
to
bedf081
Compare
🟩 CI finished in 1d 00h: Pass: 100%/250 | Total: 6d 00h | Avg: 34m 43s | Max: 1h 26m | Hits: 64%/17355
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 250)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
41 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
@gevtushenko: @elstehle said he wouldn't want to merge without your approval. So we are waiting for to merge this PR. |
🟨 CI finished in 12h 04m: Pass: 99%/250 | Total: 5d 23h | Avg: 34m 21s | Max: 1h 11m | Hits: 64%/17373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 250)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
41 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
3aaad6e
to
df56483
Compare
44e810f
to
5b3ef92
Compare
🟨 CI finished in 4h 01m: Pass: 98%/250 | Total: 5d 21h | Avg: 33m 57s | Max: 1h 17m | Hits: 65%/16657
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 250)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
41 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
5b3ef92
to
524f81c
Compare
🟨 CI finished in 7h 42m: Pass: 99%/251 | Total: 5d 23h | Avg: 34m 12s | Max: 1h 05m | Hits: 64%/17373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
Co-authored-by: Elias Stehle <[email protected]>
``` /home/coder/cccl/thrust/thrust/cmake/../../thrust/iterator/detail/transform_input_output_iterator.inl:68:9: error: writing 1 byte into a region of size 0 [-Werror=stringop-overflow=] 68 | *io = output_function(x); | ~~~~^~~~~~~~~~~~~~~~~~~~~ ```
524f81c
to
c2b6385
Compare
🟨 CI finished in 4h 12m: Pass: 99%/251 | Total: 3d 15h | Avg: 21m 01s | Max: 1h 40m | Hits: 72%/17373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 4h 49m: Pass: 99%/251 | Total: 3d 16h | Avg: 21m 03s | Max: 1h 40m | Hits: 72%/17373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 5h 24m: Pass: 100%/251 | Total: 3d 16h | Avg: 21m 08s | Max: 1h 40m | Hits: 72%/17373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
@gevtushenko ping. |
Per @elstehle request, I tested this successfully with libcudf 24.10 without our current scan-tuning patch and this worked well us. |
@gevtushenko ping. |
// we instantiate invoke_static for each CudaArches, but only call the one matching device_ptx_version | ||
cudaError_t e = cudaSuccess; | ||
const cudaError_t dummy[] = { | ||
(device_ptx_version == CudaArches ? (e = invoke_static<CudaArches>(op, ::cuda::std::true_type{})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: this approach works because device_ptx_version
is not CC of the current device but rather PTX version closest to current device CC from the set of PTX versions that cub::EmptyKernel
was compiled for. When magic namespace is enabled, this property provides us a guarantee that device_ptx_version
is one of the CudaArches
because cub::EmptyKernel
was compiled for CudaArches
. At some point we discussed switching to a querying CC of current device directly instead of using empty kernel. This change would be one of the reasons for us not to do that, because then device_ptx_version
could be outside of CudaArches
, leaving algorithm not executed when someone compiled for, say, on Ampere but tried running code on Ada. I'd suggest adding a note somewhere on PtxVersion
saying that we should always query CC from empty kernel for that reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here was the issue for the alternative approach: #898
CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE static cudaError_t runtime_to_compiletime(int device_ptx_version, FunctorT& op) | ||
{ | ||
// we instantiate invoke_static for each CudaArches, but only call the one matching device_ptx_version | ||
cudaError_t e = cudaSuccess; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: there are a few potential situations that could lead us to a situation where device_ptx_version
is outside of CudaArches
. For instance, change in PtxVersion
function I described below, or disabled namespace magic. This situation would lead to this function returning cudaSuccess
while not invoking any algorithms. Given that it'd be a corrupted use case that we want to report to the user, I'd prefer having something other than cudaSuccess
as a default value here.
@@ -635,18 +636,79 @@ struct ChainedPolicy | |||
template <typename FunctorT> | |||
CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE static cudaError_t Invoke(int device_ptx_version, FunctorT& op) | |||
{ | |||
// __CUDA_ARCH_LIST__ is only available from CTK 11.5 onwards | |||
#ifdef __CUDA_ARCH_LIST__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested: as described below, this is safe only when we have namespace magic. Although this leads to UB when linking TUs compiled for different architecture sets, old chained policy would at least find a policy to get executed. New chained policy would not. So, I'd prefer something along the lines of:
#ifdef __CUDA_ARCH_LIST__ | |
#if defined(__CUDA_ARCH_LIST__) && !defined(CUB_DISABLE_NAMESPACE_MAGIC) |
but rapids is going to disable namespace magic, so a change like that would make this PR less useful. I guess if we change cudaError_t e = cudaSuccess;
to return an actual error as suggested below, we'll at least catch the problem at runtime. I'd recommend adding a note close to the error we are going to return when no arch from CudaArches
matched device ptx suggesting users to re-enable namespace magic or wrap namespace.
Motivated by @gevtushenko and @elstehle explaining to me why CUB instantiates so many kernels and why having many tuning policies is bad, here is a mitigation: When the macro
__CUDA_ARCH_LIST__
is available, we know at compile time what runtime values the ptx version can have, so we can prune the number of dispatches CUB generates from the tuning policies to only those versions. This should give us faster compilation and allow us to use tuning policies more liberally.Compile time and binary size of
cub.example.device.radix_sort
before and after: