Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cub::DeviceFind::FindIf #2405

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

gonidelis
Copy link
Member

This is a draft to track the work progress on cub::DeviceFind::FindIf which should ultimately be used to improve thrust::all_of.

The algorithm is still a prototype until we make sure that it performs better than reduce. Initial bench results are encouraging. Once finalized, it will be modeled in the CUB coding standards.

@gonidelis
Copy link
Member Author

gonidelis commented Sep 11, 2024

# Benchmark Results

## cub::FindIf

### [0] NVIDIA H200

|  T  |     Elements     | RelativeMismatchPosition | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-----|------------------|--------------------------|---------|------------|--------|------------|-------|
| I32 |     2^16 = 65536 |                        1 |  22928x |  26.790 us | 22.92% |  21.814 us | 2.06% |
| I32 |   2^20 = 1048576 |                        1 |  19632x |  31.313 us | 23.28% |  25.470 us | 3.40% |
| I32 |  2^24 = 16777216 |                        1 |  10800x |  51.443 us | 11.07% |  46.351 us | 1.28% |
| I32 | 2^28 = 268435456 |                        1 |   1074x | 471.441 us |  1.32% | 465.567 us | 0.36% |
| I32 |     2^16 = 65536 |                      0.5 |  23024x |  26.862 us | 23.81% |  21.717 us | 2.03% |
| I32 |   2^20 = 1048576 |                      0.5 |  19792x |  31.141 us | 23.58% |  25.273 us | 3.86% |
| I32 |  2^24 = 16777216 |                      0.5 |  17152x |  34.301 us | 17.73% |  29.160 us | 1.59% |
| I32 | 2^28 = 268435456 |                      0.5 |   2064x | 247.413 us |  2.15% | 242.283 us | 0.35% |
| I32 |     2^16 = 65536 |                        0 |  54416x |  14.286 us | 55.68% |   9.189 us | 3.01% |
| I32 |   2^20 = 1048576 |                        0 |  52288x |  14.649 us | 53.34% |   9.565 us | 3.21% |
| I32 |  2^24 = 16777216 |                        0 |  33568x |  19.990 us | 34.29% |  14.901 us | 1.96% |
| I32 | 2^28 = 268435456 |                        0 |  33632x |  20.006 us | 34.67% |  14.874 us | 2.54% |

## thrust::count_if

### [0] NVIDIA H200

|  T  |     Elements     | RelativeMismatchPosition | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-----|------------------|--------------------------|---------|------------|--------|------------|-------|
| I32 |     2^16 = 65536 |                        1 |  24528x |  24.829 us | 22.00% |  20.390 us | 2.94% |
| I32 |   2^20 = 1048576 |                        1 |  22480x |  26.721 us | 20.33% |  22.244 us | 2.62% |
| I32 |  2^24 = 16777216 |                        1 |  11856x |  46.720 us | 10.84% |  42.222 us | 1.89% |
| I32 | 2^28 = 268435456 |                        1 |   1855x | 274.137 us |  1.77% | 269.582 us | 0.49% |
| I32 |     2^16 = 65536 |                      0.5 |  24176x |  25.121 us | 21.73% |  20.685 us | 3.23% |
| I32 |   2^20 = 1048576 |                      0.5 |  22048x |  27.172 us | 19.97% |  22.689 us | 2.65% |
| I32 |  2^24 = 16777216 |                      0.5 |  11744x |  47.062 us | 10.71% |  42.587 us | 1.97% |
| I32 | 2^28 = 268435456 |                      0.5 |   1859x | 273.367 us |  1.69% | 268.966 us | 0.41% |
| I32 |     2^16 = 65536 |                        0 |  24160x |  25.209 us | 22.02% |  20.707 us | 3.19% |
| I32 |   2^20 = 1048576 |                        0 |  22144x |  27.092 us | 20.12% |  22.594 us | 2.75% |
| I32 |  2^24 = 16777216 |                        0 |  11744x |  47.111 us | 10.85% |  42.584 us | 2.02% |
| I32 | 2^28 = 268435456 |                        0 |   1862x | 273.046 us |  1.70% | 268.609 us | 0.38% |

@bernhardmgruber
Copy link
Contributor

bernhardmgruber commented Sep 11, 2024

until we make sure that it performs better than reduce.

I would love to see a benchmark comparison of thrust::all_of before and after your improvements at some point!

Initial bench results are encouraging.

Do I understand correctly, that cub::FindIf should outperform thrust::count_if? Because I am seeing the opposite on some of the numbers you posted.

@jrhemstad jrhemstad changed the title Imlement cub::DeviceFind::FindIf Implement cub::DeviceFind::FindIf Sep 11, 2024
@bernhardmgruber
Copy link
Contributor

The name CommonPrefixRatio comes from the old thrust::equals benchmark, where it denoted the amount of equal elements at the beginning of both ranges. A CommonPrefixRatio of 1 meant all elements were the same in both ranges. A value of 0.5 meant, the first 50% of the range were equal and then the mismatch occurs.

For your benchmark you could name this value RelativeMismatchPosition, or MismatchAt, or something like that. It's about where you expect the mismatch to happen.

@gonidelis
Copy link
Member Author

@bernhardmgruber cub::FindIf because of the early exit is expected to perform better than thrust::count_if at least in the middle cases RelativeMismatchPosition = 0.5 (thanx for the name suggestion). On the axis of input size, this is true for small and medium sized inputs, but not for large inputs.

Copy link
Contributor

🟨 CI finished in 1h 52m: Pass: 97%/259 | Total: 1d 08h | Avg: 7m 28s | Max: 35m 24s | Hits: 99%/20079
  • 🟨 cub: Pass: 95%/136 | Total: 19h 26m | Avg: 8m 34s | Max: 35m 24s

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  95%/128 | Total: 18h 52m | Avg:  8m 50s | Max: 35m 24s
      🟩 arm64              Pass: 100%/8   | Total: 33m 44s | Avg:  4m 13s | Max:  4m 33s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  7m 19s | Avg:  3m 39s | Max:  3m 49s
      🔍 nvcc               Pass:  95%/134 | Total: 19h 19m | Avg:  8m 39s | Max: 35m 24s
    🚨 cxx_family: MSVC 🚨
      🟩 Clang              Pass: 100%/63  | Total:  9h 04m | Avg:  8m 38s | Max: 30m 14s
      🟩 GCC                Pass: 100%/64  | Total:  9h 07m | Avg:  8m 33s | Max: 35m 24s
      🟩 Intel              Pass: 100%/3   | Total: 16m 34s | Avg:  5m 31s | Max:  5m 42s
      🔥 MSVC               Pass:   0%/6   | Total: 57m 26s | Avg:  9m 34s | Max: 10m 42s
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  94%/103 | Total:  8h 05m | Avg:  4m 42s | Max: 10m 42s
      🟩 DeviceLaunch       Pass: 100%/8   | Total:  2h 28m | Avg: 18m 35s | Max: 19m 52s
      🟩 GraphCapture       Pass: 100%/8   | Total:  2h 07m | Avg: 15m 54s | Max: 19m 10s
      🟩 HostLaunch         Pass: 100%/8   | Total:  2h 28m | Avg: 18m 32s | Max: 22m 38s
      🟩 SmallGMem          Pass: 100%/1   | Total: 35m 24s | Avg: 35m 24s | Max: 35m 24s
      🟩 TestGPU            Pass: 100%/8   | Total:  3h 41m | Avg: 27m 38s | Max: 30m 14s
    🟨 ctk
      🟨 11.1               Pass:  93%/15  | Total:  1h 01m | Avg:  4m 04s | Max: 10m 42s
      🟩 11.8               Pass: 100%/3   | Total: 13m 50s | Avg:  4m 36s | Max:  4m 57s
      🟨 12.6               Pass:  95%/118 | Total: 18h 11m | Avg:  9m 14s | Max: 35m 24s
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  7m 19s | Avg:  3m 39s | Max:  3m 49s
      🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 01m | Avg:  4m 04s | Max: 10m 42s
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 50s | Avg:  4m 36s | Max:  4m 57s
      🟨 nvcc12.6           Pass:  95%/116 | Total: 18h 04m | Avg:  9m 20s | Max: 35m 24s
    🟨 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 26m 22s | Avg:  4m 23s | Max:  5m 25s
      🟩 Clang10            Pass: 100%/3   | Total: 15m 38s | Avg:  5m 12s | Max:  5m 21s
      🟩 Clang11            Pass: 100%/4   | Total: 17m 44s | Avg:  4m 26s | Max:  4m 45s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 19s | Avg:  4m 34s | Max:  4m 43s
      🟩 Clang13            Pass: 100%/4   | Total: 18m 26s | Avg:  4m 36s | Max:  4m 46s
      🟩 Clang14            Pass: 100%/4   | Total: 17m 47s | Avg:  4m 26s | Max:  4m 39s
      🟩 Clang15            Pass: 100%/4   | Total: 18m 36s | Avg:  4m 39s | Max:  4m 53s
      🟩 Clang16            Pass: 100%/4   | Total: 19m 03s | Avg:  4m 45s | Max:  5m 13s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 14s | Avg:  4m 33s | Max:  4m 43s
      🟩 Clang18            Pass: 100%/26  | Total:  6h 14m | Avg: 14m 24s | Max: 30m 14s
      🟩 GCC6               Pass: 100%/2   | Total:  6m 57s | Avg:  3m 28s | Max:  3m 31s
      🟩 GCC7               Pass: 100%/6   | Total: 23m 23s | Avg:  3m 53s | Max:  4m 31s
      🟩 GCC8               Pass: 100%/6   | Total: 23m 30s | Avg:  3m 55s | Max:  4m 27s
      🟩 GCC9               Pass: 100%/6   | Total: 24m 35s | Avg:  4m 05s | Max:  4m 32s
      🟩 GCC10              Pass: 100%/4   | Total: 19m 01s | Avg:  4m 45s | Max:  5m 07s
      🟩 GCC11              Pass: 100%/7   | Total: 32m 33s | Avg:  4m 39s | Max:  4m 57s
      🟩 GCC12              Pass: 100%/4   | Total: 18m 46s | Avg:  4m 41s | Max:  4m 54s
      🟩 GCC13              Pass: 100%/29  | Total:  6h 38m | Avg: 13m 45s | Max: 35m 24s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 16m 34s | Avg:  5m 31s | Max:  5m 42s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 10m 42s | Avg: 10m 42s | Max: 10m 42s
      🟥 MSVC14.29          Pass:   0%/2   | Total: 18m 33s | Avg:  9m 16s | Max:  9m 24s
      🟥 MSVC14.39          Pass:   0%/3   | Total: 28m 11s | Avg:  9m 23s | Max:  9m 39s
    🟨 std
      🟩 11                 Pass: 100%/35  | Total:  4h 34m | Avg:  7m 50s | Max: 30m 14s
      🟨 14                 Pass:  92%/38  | Total:  5h 08m | Avg:  8m 07s | Max: 26m 43s
      🟨 17                 Pass:  94%/38  | Total:  5h 38m | Avg:  8m 54s | Max: 35m 24s
      🟨 20                 Pass:  96%/25  | Total:  4h 05m | Avg:  9m 48s | Max: 28m 21s
    🟨 gpu
      🟨 v100               Pass:  95%/136 | Total: 19h 26m | Avg:  8m 34s | Max: 35m 24s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 50s | Avg:  4m 36s | Max:  4m 57s
      🟩 90a                Pass: 100%/4   | Total: 14m 45s | Avg:  3m 41s | Max:  3m 49s
    
  • 🟩 thrust: Pass: 100%/122 | Total: 12h 35m | Avg: 6m 11s | Max: 21m 04s | Hits: 99%/20079

    🟩 cpu
      🟩 amd64              Pass: 100%/114 | Total: 12h 00m | Avg:  6m 19s | Max: 21m 04s | Hits:  99%/20079 
      🟩 arm64              Pass: 100%/8   | Total: 35m 04s | Avg:  4m 23s | Max:  5m 14s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 08m | Avg:  4m 35s | Max: 16m 04s | Hits:  99%/2231  
      🟩 11.8               Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟩 12.6               Pass: 100%/104 | Total: 11h 13m | Avg:  6m 28s | Max: 21m 04s | Hits:  99%/17848 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  8m 38s | Avg:  4m 19s | Max:  4m 23s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 08m | Avg:  4m 35s | Max: 16m 04s | Hits:  99%/2231  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟩 nvcc12.6           Pass: 100%/102 | Total: 11h 04m | Avg:  6m 30s | Max: 21m 04s | Hits:  99%/17848 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  8m 38s | Avg:  4m 19s | Max:  4m 23s
      🟩 nvcc               Pass: 100%/120 | Total: 12h 26m | Avg:  6m 13s | Max: 21m 04s | Hits:  99%/20079 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 29m 37s | Avg:  4m 56s | Max:  6m 26s
      🟩 Clang10            Pass: 100%/3   | Total: 17m 33s | Avg:  5m 51s | Max:  6m 18s
      🟩 Clang11            Pass: 100%/4   | Total: 18m 32s | Avg:  4m 38s | Max:  5m 00s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 43s | Avg:  4m 40s | Max:  5m 00s
      🟩 Clang13            Pass: 100%/4   | Total: 19m 25s | Avg:  4m 51s | Max:  5m 11s
      🟩 Clang14            Pass: 100%/4   | Total: 18m 48s | Avg:  4m 42s | Max:  4m 49s
      🟩 Clang15            Pass: 100%/4   | Total: 19m 15s | Avg:  4m 48s | Max:  5m 06s
      🟩 Clang16            Pass: 100%/4   | Total: 18m 36s | Avg:  4m 39s | Max:  4m 49s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 12s | Avg:  4m 33s | Max:  4m 56s
      🟩 Clang18            Pass: 100%/18  | Total:  2h 00m | Avg:  6m 41s | Max: 14m 53s
      🟩 GCC6               Pass: 100%/2   | Total:  6m 57s | Avg:  3m 28s | Max:  3m 31s
      🟩 GCC7               Pass: 100%/6   | Total: 23m 55s | Avg:  3m 59s | Max:  4m 49s
      🟩 GCC8               Pass: 100%/6   | Total: 24m 19s | Avg:  4m 03s | Max:  4m 33s
      🟩 GCC9               Pass: 100%/6   | Total: 24m 58s | Avg:  4m 09s | Max:  4m 54s
      🟩 GCC10              Pass: 100%/4   | Total: 18m 08s | Avg:  4m 32s | Max:  4m 41s
      🟩 GCC11              Pass: 100%/7   | Total: 31m 38s | Avg:  4m 31s | Max:  4m 40s
      🟩 GCC12              Pass: 100%/4   | Total: 19m 06s | Avg:  4m 46s | Max:  5m 03s
      🟩 GCC13              Pass: 100%/20  | Total:  2h 19m | Avg:  6m 59s | Max: 15m 30s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 16m 57s | Avg:  5m 39s | Max:  5m 54s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 16m 04s | Avg: 16m 04s | Max: 16m 04s | Hits:  99%/2231  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 27m 01s | Avg: 13m 30s | Max: 13m 39s | Hits:  99%/4462  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  1h 47m | Avg: 17m 53s | Max: 21m 04s | Hits:  99%/13386 
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  4h 59m | Avg:  5m 26s | Max: 14m 53s
      🟩 GCC                Pass: 100%/55  | Total:  4h 48m | Avg:  5m 15s | Max: 15m 30s
      🟩 Intel              Pass: 100%/3   | Total: 16m 57s | Avg:  5m 39s | Max:  5m 54s
      🟩 MSVC               Pass: 100%/9   | Total:  2h 30m | Avg: 16m 42s | Max: 21m 04s | Hits:  99%/20079 
    🟩 gpu
      🟩 v100               Pass: 100%/122 | Total: 12h 35m | Avg:  6m 11s | Max: 21m 04s | Hits:  99%/20079 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  8h 50m | Avg:  5m 09s | Max: 16m 11s | Hits:  99%/13386 
      🟩 TestCPU            Pass: 100%/11  | Total:  1h 58m | Avg: 10m 46s | Max: 21m 04s | Hits:  99%/6693  
      🟩 TestGPU            Pass: 100%/8   | Total:  1h 46m | Avg: 13m 15s | Max: 15m 30s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟩 90a                Pass: 100%/4   | Total: 15m 53s | Avg:  3m 58s | Max:  4m 20s
    🟩 std
      🟩 11                 Pass: 100%/31  | Total:  2h 33m | Avg:  4m 57s | Max: 15m 30s
      🟩 14                 Pass: 100%/35  | Total:  3h 50m | Avg:  6m 34s | Max: 19m 59s | Hits:  99%/8924  
      🟩 17                 Pass: 100%/34  | Total:  3h 37m | Avg:  6m 23s | Max: 21m 01s | Hits:  99%/6693  
      🟩 20                 Pass: 100%/22  | Total:  2h 34m | Avg:  7m 00s | Max: 21m 04s | Hits:  99%/4462  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

# Runner
186 linux-amd64-cpu16
42 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16

Comment on lines 72 to 92
for (int i = 0; i < elements_per_thread; ++i)
{
auto index = tile_offset + threadIdx.x + i * blockDim.x;

if (index < num_items)
{
// early exit
if (sresult < index)
{
return;
}

if (pred(*(begin + index)))
{
atomicMin(result, index);
return;
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: a few observations about this code:

  1. when one thread finds an element satisfying the predicate, it updates the atomic and exits while the remaining CTA will have to read the atomic again on the next iteration
  2. if every thread finds a predicate, we'll have at most the number of atomics matching the thread block size

Instead, you could experiment with an approach where threads first issue atomic min for shared memory, then outside of the loop over elements per thread whole block synchronizes, and if a minimal index is less than int max, main thread issues global atomic min and the whole thread block exists.

This could make an iteration over a single tile a bit more expensive for the case when no predicate where found, but on the other hand we'll address (1) because now remaining threads will see that they have to exit without accessing global memory, and (2) because now at most one thread of a given CTA issues an atomic.

Comment on lines 79 to 84
if (sresult < index)
{
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I'd hoist this before elements per thread loop. The invariant here is that sresult is only updated before elements loop. If sresult was smaller than tile offset, it'll keep being smaller for any tile_offset + i. If it was larger, it'll be larger for any tile_offset + i, so there should be no reason to read from shared memory on every iteration.

@gonidelis gonidelis force-pushed the cub_find_if branch 2 times, most recently from 79569ad to ce3b44c Compare September 16, 2024 23:59
Copy link
Contributor

🟨 CI finished in 4h 42m: Pass: 94%/259 | Total: 5d 01h | Avg: 28m 13s | Max: 1h 39m | Hits: 99%/20079
  • 🟨 cub: Pass: 88%/136 | Total: 4d 12h | Avg: 47m 41s | Max: 1h 39m

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  88%/128 | Total:  4d 05h | Avg: 47m 22s | Max:  1h 39m
      🟩 arm64              Pass: 100%/8   | Total:  7h 01m | Avg: 52m 41s | Max: 54m 14s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 50s | Max: 59m 38s
      🔍 nvcc               Pass:  88%/134 | Total:  4d 10h | Avg: 47m 31s | Max:  1h 39m
    🟨 ctk
      🟨 11.1               Pass:  93%/15  | Total:  1h 07m | Avg:  4m 30s | Max: 14m 31s
      🟩 11.8               Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟨 12.6               Pass:  88%/118 | Total:  4d 10h | Avg: 54m 16s | Max:  1h 39m
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 57m | Avg: 58m 50s | Max: 59m 38s
      🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 07m | Avg:  4m 30s | Max: 14m 31s
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟨 nvcc12.6           Pass:  87%/116 | Total:  4d 08h | Avg: 54m 11s | Max:  1h 39m
    🟨 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  2h 30m | Avg: 25m 07s | Max: 48m 51s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 15m | Avg: 45m 00s | Max: 47m 01s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 55m | Avg: 43m 51s | Max: 45m 13s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 01m | Avg: 45m 24s | Max: 47m 22s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 00m | Avg: 45m 14s | Max: 47m 45s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 53m | Avg: 43m 20s | Max: 43m 54s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 52m | Avg: 43m 14s | Max: 43m 46s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 57m | Avg: 44m 28s | Max: 47m 08s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 55m | Avg: 43m 51s | Max: 45m 31s
      🟨 Clang18            Pass:  84%/26  | Total:  1d 07h | Avg:  1h 11m | Max:  1h 34m
      🟩 GCC6               Pass: 100%/2   | Total:  7m 00s | Avg:  3m 30s | Max:  3m 41s
      🟩 GCC7               Pass: 100%/6   | Total:  2h 23m | Avg: 23m 52s | Max: 45m 11s
      🟩 GCC8               Pass: 100%/6   | Total:  2h 54m | Avg: 29m 00s | Max:  1h 12m
      🟩 GCC9               Pass: 100%/6   | Total:  2h 28m | Avg: 24m 42s | Max: 46m 30s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 58m | Avg: 44m 35s | Max: 47m 43s
      🟩 GCC11              Pass: 100%/7   | Total:  3h 14m | Avg: 27m 49s | Max: 46m 54s
      🟩 GCC12              Pass: 100%/4   | Total:  3h 03m | Avg: 45m 51s | Max: 48m 13s
      🟨 GCC13              Pass:  82%/29  | Total:  1d 06h | Avg:  1h 04m | Max:  1h 39m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 17m | Avg: 45m 54s | Max: 48m 09s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 14m 31s | Avg: 14m 31s | Max: 14m 31s
      🟥 MSVC14.29          Pass:   0%/2   | Total: 24m 02s | Avg: 12m 01s | Max: 12m 25s
      🟥 MSVC14.39          Pass:   0%/3   | Total: 38m 46s | Avg: 12m 55s | Max: 13m 28s
    🟨 cxx_family
      🟨 Clang              Pass:  93%/63  | Total:  2d 08h | Avg: 53m 42s | Max:  1h 34m
      🟨 GCC                Pass:  92%/64  | Total:  2d 00h | Avg: 45m 06s | Max:  1h 39m
      🟩 Intel              Pass: 100%/3   | Total:  2h 17m | Avg: 45m 54s | Max: 48m 09s
      🟥 MSVC               Pass:   0%/6   | Total:  1h 17m | Avg: 12m 53s | Max: 14m 31s
    🟨 jobs
      🟨 Build              Pass:  94%/103 | Total:  2d 13h | Avg: 35m 43s | Max:  1h 12m
      🟩 DeviceLaunch       Pass: 100%/8   | Total: 11h 01m | Avg:  1h 22m | Max:  1h 25m
      🟩 GraphCapture       Pass: 100%/8   | Total: 10h 59m | Avg:  1h 22m | Max:  1h 29m
      🟩 HostLaunch         Pass: 100%/8   | Total: 11h 01m | Avg:  1h 22m | Max:  1h 27m
      🟥 SmallGMem          Pass:   0%/1   | Total:  1h 39m | Avg:  1h 39m | Max:  1h 39m
      🟥 TestGPU            Pass:   0%/8   | Total: 12h 02m | Avg:  1h 30m | Max:  1h 36m
    🟨 gpu
      🟨 v100               Pass:  88%/136 | Total:  4d 12h | Avg: 47m 41s | Max:  1h 39m
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟩 90a                Pass: 100%/4   | Total: 15m 11s | Avg:  3m 47s | Max:  3m 55s
    🟨 std
      🟨 11                 Pass:  94%/35  | Total:  1d 02h | Avg: 45m 56s | Max:  1h 30m
      🟨 14                 Pass:  86%/38  | Total:  1d 03h | Avg: 43m 33s | Max:  1h 33m
      🟨 17                 Pass:  86%/38  | Total:  1d 06h | Avg: 47m 27s | Max:  1h 39m
      🟨 20                 Pass:  88%/25  | Total: 23h 38m | Avg: 56m 44s | Max:  1h 36m
    
  • 🟩 thrust: Pass: 100%/122 | Total: 13h 27m | Avg: 6m 37s | Max: 22m 34s | Hits: 99%/20079

    🟩 cpu
      🟩 amd64              Pass: 100%/114 | Total: 12h 49m | Avg:  6m 45s | Max: 22m 34s | Hits:  99%/20079 
      🟩 arm64              Pass: 100%/8   | Total: 37m 43s | Avg:  4m 42s | Max:  5m 21s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 12m | Avg:  4m 50s | Max: 18m 34s | Hits:  99%/2231  
      🟩 11.8               Pass: 100%/3   | Total: 14m 18s | Avg:  4m 46s | Max:  4m 59s
      🟩 12.6               Pass: 100%/104 | Total: 12h 00m | Avg:  6m 55s | Max: 22m 34s | Hits:  99%/17848 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  4m 26s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 12m | Avg:  4m 50s | Max: 18m 34s | Hits:  99%/2231  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 14m 18s | Avg:  4m 46s | Max:  4m 59s
      🟩 nvcc12.6           Pass: 100%/102 | Total: 11h 51m | Avg:  6m 58s | Max: 22m 34s | Hits:  99%/17848 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  4m 26s
      🟩 nvcc               Pass: 100%/120 | Total: 13h 18m | Avg:  6m 39s | Max: 22m 34s | Hits:  99%/20079 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 30m 48s | Avg:  5m 08s | Max:  5m 55s
      🟩 Clang10            Pass: 100%/3   | Total: 16m 46s | Avg:  5m 35s | Max:  5m 57s
      🟩 Clang11            Pass: 100%/4   | Total: 18m 12s | Avg:  4m 33s | Max:  4m 37s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 15s | Avg:  4m 33s | Max:  4m 46s
      🟩 Clang13            Pass: 100%/4   | Total: 18m 05s | Avg:  4m 31s | Max:  4m 39s
      🟩 Clang14            Pass: 100%/4   | Total: 18m 18s | Avg:  4m 34s | Max:  4m 43s
      🟩 Clang15            Pass: 100%/4   | Total: 19m 15s | Avg:  4m 48s | Max:  5m 14s
      🟩 Clang16            Pass: 100%/4   | Total: 18m 38s | Avg:  4m 39s | Max:  5m 00s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 45s | Avg:  4m 41s | Max:  4m 52s
      🟩 Clang18            Pass: 100%/18  | Total:  2h 15m | Avg:  7m 30s | Max: 18m 36s
      🟩 GCC6               Pass: 100%/2   | Total:  7m 06s | Avg:  3m 33s | Max:  3m 50s
      🟩 GCC7               Pass: 100%/6   | Total: 24m 38s | Avg:  4m 06s | Max:  4m 44s
      🟩 GCC8               Pass: 100%/6   | Total: 24m 11s | Avg:  4m 01s | Max:  4m 44s
      🟩 GCC9               Pass: 100%/6   | Total: 25m 26s | Avg:  4m 14s | Max:  4m 52s
      🟩 GCC10              Pass: 100%/4   | Total: 18m 45s | Avg:  4m 41s | Max:  5m 00s
      🟩 GCC11              Pass: 100%/7   | Total: 33m 43s | Avg:  4m 49s | Max:  5m 02s
      🟩 GCC12              Pass: 100%/4   | Total: 20m 07s | Avg:  5m 01s | Max:  5m 10s
      🟩 GCC13              Pass: 100%/20  | Total:  2h 32m | Avg:  7m 36s | Max: 19m 31s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 17m 50s | Avg:  5m 56s | Max:  6m 06s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 34s | Avg: 18m 34s | Max: 18m 34s | Hits:  99%/2231  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 32m 57s | Avg: 16m 28s | Max: 16m 46s | Hits:  99%/4462  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  1h 59m | Avg: 19m 57s | Max: 22m 34s | Hits:  99%/13386 
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  5h 12m | Avg:  5m 40s | Max: 18m 36s
      🟩 GCC                Pass: 100%/55  | Total:  5h 06m | Avg:  5m 34s | Max: 19m 31s
      🟩 Intel              Pass: 100%/3   | Total: 17m 50s | Avg:  5m 56s | Max:  6m 06s
      🟩 MSVC               Pass: 100%/9   | Total:  2h 51m | Avg: 19m 01s | Max: 22m 34s | Hits:  99%/20079 
    🟩 gpu
      🟩 v100               Pass: 100%/122 | Total: 13h 27m | Avg:  6m 37s | Max: 22m 34s | Hits:  99%/20079 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  9h 14m | Avg:  5m 22s | Max: 18m 50s | Hits:  99%/13386 
      🟩 TestCPU            Pass: 100%/11  | Total:  2h 05m | Avg: 11m 23s | Max: 22m 34s | Hits:  99%/6693  
      🟩 TestGPU            Pass: 100%/8   | Total:  2h 07m | Avg: 15m 59s | Max: 19m 31s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 14m 18s | Avg:  4m 46s | Max:  4m 59s
      🟩 90a                Pass: 100%/4   | Total: 15m 40s | Avg:  3m 55s | Max:  4m 02s
    🟩 std
      🟩 11                 Pass: 100%/31  | Total:  2h 46m | Avg:  5m 22s | Max: 19m 26s
      🟩 14                 Pass: 100%/35  | Total:  4h 09m | Avg:  7m 08s | Max: 22m 03s | Hits:  99%/8924  
      🟩 17                 Pass: 100%/34  | Total:  3h 45m | Avg:  6m 38s | Max: 22m 34s | Hits:  99%/6693  
      🟩 20                 Pass: 100%/22  | Total:  2h 45m | Avg:  7m 30s | Max: 21m 58s | Hits:  99%/4462  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

# Runner
186 linux-amd64-cpu16
42 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16


#include <cub/config.cuh>

#include "device_launch_parameters.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should come after the other includes I assume?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah still a silly draft, but why?

Copy link
Collaborator

@miscco miscco Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because any header other than the config one should come after the system header guard so we do not accidentally emit warnings to users

cub/cub/device/device_find_if.cuh Show resolved Hide resolved
Comment on lines +56 to +57

template <typename IterBegin, typename IterEnd, typename Pred>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing some documentation here

Copy link
Member Author

@gonidelis gonidelis Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is super early draft on the performance analysis stage still. Until we make sure that this is the right implementation to go (not sure yet) I won't be adding any docs - unless it makes it easier for the crowd to understand the new algo - but I will add a detailed explanation on the description for that.

{
auto index = tile_offset + threadIdx.x + i * blockDim.x;

if (index < num_items)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider turning this into an early return

Suggested change
if (index < num_items)
if (index >= num_items) {
continue; // maybe even break?
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole loob body is within the if so an early return would make tings clearer

Copy link
Member Author

@gonidelis gonidelis Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ok got it. I had to look it with a fresh eye and some morning coffee to get it.

Copy link
Contributor

🟨 CI finished in 8h 06m: Pass: 94%/259 | Total: 5d 02h | Avg: 28m 19s | Max: 1h 34m | Hits: 99%/20079
  • 🟨 cub: Pass: 88%/136 | Total: 4d 12h | Avg: 47m 44s | Max: 1h 34m

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  88%/128 | Total:  4d 05h | Avg: 47m 21s | Max:  1h 34m
      🟩 arm64              Pass: 100%/8   | Total:  7h 11m | Avg: 53m 53s | Max: 57m 36s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 05m
      🔍 nvcc               Pass:  88%/134 | Total:  4d 10h | Avg: 47m 29s | Max:  1h 34m
    🟨 ctk
      🟨 11.1               Pass:  93%/15  | Total:  1h 07m | Avg:  4m 31s | Max: 14m 48s
      🟩 11.8               Pass: 100%/3   | Total: 14m 06s | Avg:  4m 42s | Max:  4m 54s
      🟨 12.6               Pass:  88%/118 | Total:  4d 10h | Avg: 54m 19s | Max:  1h 34m
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 05m
      🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 07m | Avg:  4m 31s | Max: 14m 48s
      🟩 nvcc11.8           Pass: 100%/3   | Total: 14m 06s | Avg:  4m 42s | Max:  4m 54s
      🟨 nvcc12.6           Pass:  87%/116 | Total:  4d 08h | Avg: 54m 09s | Max:  1h 34m
    🟨 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  2h 27m | Avg: 24m 34s | Max: 46m 01s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 16m | Avg: 45m 28s | Max: 47m 07s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 59m | Avg: 44m 49s | Max: 46m 31s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 58m | Avg: 44m 42s | Max: 48m 46s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 01m | Avg: 45m 15s | Max: 48m 00s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 03m | Avg: 45m 46s | Max: 47m 53s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 57m | Avg: 44m 17s | Max: 46m 06s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 54m | Avg: 43m 38s | Max: 45m 19s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 01m | Avg: 45m 25s | Max: 47m 22s
      🟨 Clang18            Pass:  84%/26  | Total:  1d 07h | Avg:  1h 12m | Max:  1h 34m
      🟩 GCC6               Pass: 100%/2   | Total:  6m 59s | Avg:  3m 29s | Max:  3m 34s
      🟩 GCC7               Pass: 100%/6   | Total:  2h 22m | Avg: 23m 44s | Max: 44m 38s
      🟩 GCC8               Pass: 100%/6   | Total:  2h 20m | Avg: 23m 27s | Max: 43m 37s
      🟩 GCC9               Pass: 100%/6   | Total:  2h 27m | Avg: 24m 31s | Max: 47m 41s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 59m | Avg: 44m 52s | Max: 45m 38s
      🟩 GCC11              Pass: 100%/7   | Total:  3h 13m | Avg: 27m 36s | Max: 46m 19s
      🟩 GCC12              Pass: 100%/4   | Total:  3h 34m | Avg: 53m 39s | Max:  1h 17m
      🟨 GCC13              Pass:  82%/29  | Total:  1d 06h | Avg:  1h 03m | Max:  1h 34m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 18m | Avg: 46m 08s | Max: 48m 33s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 14m 48s | Avg: 14m 48s | Max: 14m 48s
      🟥 MSVC14.29          Pass:   0%/2   | Total: 25m 11s | Avg: 12m 35s | Max: 12m 36s
      🟥 MSVC14.39          Pass:   0%/3   | Total: 41m 00s | Avg: 13m 40s | Max: 14m 26s
    🟨 cxx_family
      🟨 Clang              Pass:  93%/63  | Total:  2d 08h | Avg: 54m 09s | Max:  1h 34m
      🟨 GCC                Pass:  92%/64  | Total:  1d 23h | Avg: 44m 42s | Max:  1h 34m
      🟩 Intel              Pass: 100%/3   | Total:  2h 18m | Avg: 46m 08s | Max: 48m 33s
      🟥 MSVC               Pass:   0%/6   | Total:  1h 20m | Avg: 13m 29s | Max: 14m 48s
    🟨 jobs
      🟨 Build              Pass:  94%/103 | Total:  2d 13h | Avg: 36m 04s | Max:  1h 17m
      🟩 DeviceLaunch       Pass: 100%/8   | Total: 11h 02m | Avg:  1h 22m | Max:  1h 27m
      🟩 GraphCapture       Pass: 100%/8   | Total: 10h 44m | Avg:  1h 20m | Max:  1h 25m
      🟩 HostLaunch         Pass: 100%/8   | Total: 11h 08m | Avg:  1h 23m | Max:  1h 29m
      🟥 SmallGMem          Pass:   0%/1   | Total:  1h 34m | Avg:  1h 34m | Max:  1h 34m
      🟥 TestGPU            Pass:   0%/8   | Total: 11h 48m | Avg:  1h 28m | Max:  1h 34m
    🟨 gpu
      🟨 v100               Pass:  88%/136 | Total:  4d 12h | Avg: 47m 44s | Max:  1h 34m
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 14m 06s | Avg:  4m 42s | Max:  4m 54s
      🟩 90a                Pass: 100%/4   | Total: 15m 27s | Avg:  3m 51s | Max:  4m 05s
    🟨 std
      🟨 11                 Pass:  94%/35  | Total:  1d 03h | Avg: 46m 34s | Max:  1h 28m
      🟨 14                 Pass:  86%/38  | Total:  1d 03h | Avg: 42m 41s | Max:  1h 26m
      🟨 17                 Pass:  86%/38  | Total:  1d 06h | Avg: 47m 56s | Max:  1h 34m
      🟨 20                 Pass:  88%/25  | Total: 23h 38m | Avg: 56m 44s | Max:  1h 33m
    
  • 🟩 thrust: Pass: 100%/122 | Total: 13h 45m | Avg: 6m 46s | Max: 25m 49s | Hits: 99%/20079

    🟩 cpu
      🟩 amd64              Pass: 100%/114 | Total: 13h 10m | Avg:  6m 56s | Max: 25m 49s | Hits:  99%/20079 
      🟩 arm64              Pass: 100%/8   | Total: 34m 56s | Avg:  4m 22s | Max:  4m 48s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 13m | Avg:  4m 52s | Max: 20m 32s | Hits:  99%/2231  
      🟩 11.8               Pass: 100%/3   | Total: 13m 40s | Avg:  4m 33s | Max:  4m 53s
      🟩 12.6               Pass: 100%/104 | Total: 12h 18m | Avg:  7m 06s | Max: 25m 49s | Hits:  99%/17848 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 18s | Avg:  4m 39s | Max:  4m 44s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 13m | Avg:  4m 52s | Max: 20m 32s | Hits:  99%/2231  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 40s | Avg:  4m 33s | Max:  4m 53s
      🟩 nvcc12.6           Pass: 100%/102 | Total: 12h 09m | Avg:  7m 09s | Max: 25m 49s | Hits:  99%/17848 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 18s | Avg:  4m 39s | Max:  4m 44s
      🟩 nvcc               Pass: 100%/120 | Total: 13h 36m | Avg:  6m 48s | Max: 25m 49s | Hits:  99%/20079 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 28m 11s | Avg:  4m 41s | Max:  5m 43s
      🟩 Clang10            Pass: 100%/3   | Total: 17m 05s | Avg:  5m 41s | Max:  6m 17s
      🟩 Clang11            Pass: 100%/4   | Total: 19m 09s | Avg:  4m 47s | Max:  4m 58s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 41s | Avg:  4m 40s | Max:  4m 48s
      🟩 Clang13            Pass: 100%/4   | Total: 19m 37s | Avg:  4m 54s | Max:  5m 48s
      🟩 Clang14            Pass: 100%/4   | Total: 18m 53s | Avg:  4m 43s | Max:  5m 01s
      🟩 Clang15            Pass: 100%/4   | Total: 19m 43s | Avg:  4m 55s | Max:  5m 25s
      🟩 Clang16            Pass: 100%/4   | Total: 19m 09s | Avg:  4m 47s | Max:  5m 03s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 47s | Avg:  4m 41s | Max:  4m 56s
      🟩 Clang18            Pass: 100%/18  | Total:  2h 28m | Avg:  8m 15s | Max: 23m 35s
      🟩 GCC6               Pass: 100%/2   | Total:  7m 07s | Avg:  3m 33s | Max:  3m 49s
      🟩 GCC7               Pass: 100%/6   | Total: 24m 04s | Avg:  4m 00s | Max:  4m 41s
      🟩 GCC8               Pass: 100%/6   | Total: 25m 11s | Avg:  4m 11s | Max:  4m 38s
      🟩 GCC9               Pass: 100%/6   | Total: 25m 03s | Avg:  4m 10s | Max:  5m 09s
      🟩 GCC10              Pass: 100%/4   | Total: 18m 43s | Avg:  4m 40s | Max:  5m 02s
      🟩 GCC11              Pass: 100%/7   | Total: 33m 20s | Avg:  4m 45s | Max:  5m 28s
      🟩 GCC12              Pass: 100%/4   | Total: 19m 33s | Avg:  4m 53s | Max:  5m 07s
      🟩 GCC13              Pass: 100%/20  | Total:  2h 23m | Avg:  7m 09s | Max: 16m 05s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 18m 12s | Avg:  6m 04s | Max:  6m 29s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 20m 32s | Avg: 20m 32s | Max: 20m 32s | Hits:  99%/2231  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 34m 31s | Avg: 17m 15s | Max: 17m 34s | Hits:  99%/4462  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  2h 08m | Avg: 21m 23s | Max: 25m 49s | Hits:  99%/13386 
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  5h 27m | Avg:  5m 57s | Max: 23m 35s
      🟩 GCC                Pass: 100%/55  | Total:  4h 56m | Avg:  5m 23s | Max: 16m 05s
      🟩 Intel              Pass: 100%/3   | Total: 18m 12s | Avg:  6m 04s | Max:  6m 29s
      🟩 MSVC               Pass: 100%/9   | Total:  3h 03m | Avg: 20m 22s | Max: 25m 49s | Hits:  99%/20079 
    🟩 gpu
      🟩 v100               Pass: 100%/122 | Total: 13h 45m | Avg:  6m 46s | Max: 25m 49s | Hits:  99%/20079 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  9h 19m | Avg:  5m 26s | Max: 20m 32s | Hits:  99%/13386 
      🟩 TestCPU            Pass: 100%/11  | Total:  2h 11m | Avg: 11m 56s | Max: 25m 49s | Hits:  99%/6693  
      🟩 TestGPU            Pass: 100%/8   | Total:  2h 14m | Avg: 16m 49s | Max: 23m 35s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 40s | Avg:  4m 33s | Max:  4m 53s
      🟩 90a                Pass: 100%/4   | Total: 16m 09s | Avg:  4m 02s | Max:  4m 19s
    🟩 std
      🟩 11                 Pass: 100%/31  | Total:  2h 46m | Avg:  5m 22s | Max: 23m 35s
      🟩 14                 Pass: 100%/35  | Total:  4h 03m | Avg:  6m 58s | Max: 22m 44s | Hits:  99%/8924  
      🟩 17                 Pass: 100%/34  | Total:  4h 05m | Avg:  7m 13s | Max: 25m 20s | Hits:  99%/6693  
      🟩 20                 Pass: 100%/22  | Total:  2h 49m | Avg:  7m 41s | Max: 25m 49s | Hits:  99%/4462  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

# Runner
186 linux-amd64-cpu16
42 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16

bool found = false;
for (int i = 0; i < elements_per_thread; ++i)
{
auto index = tile_offset + threadIdx.x + i * blockDim.x;
int index = i % WORDS + (i / WORDS) * block_threads * WORDS + threadIdx.x * WORDS + tile_offset;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression for the evaluation of this index can of course be simplified further. It is as is for educational purposes.

Copy link
Contributor

🟨 CI finished in 2h 35m: Pass: 97%/259 | Total: 5d 01h | Avg: 28m 11s | Max: 1h 41m | Hits: 99%/20079
  • 🟨 cub: Pass: 95%/136 | Total: 4d 12h | Avg: 47m 51s | Max: 1h 41m

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  95%/128 | Total:  4d 05h | Avg: 47m 22s | Max:  1h 41m
      🟩 arm64              Pass: 100%/8   | Total:  7h 24m | Avg: 55m 32s | Max: 59m 13s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 04m
      🔍 nvcc               Pass:  95%/134 | Total:  4d 10h | Avg: 47m 38s | Max:  1h 41m
    🚨 cxx_family: MSVC 🚨
      🟩 Clang              Pass: 100%/63  | Total:  2d 08h | Avg: 53m 43s | Max:  1h 31m
      🟩 GCC                Pass: 100%/64  | Total:  2d 00h | Avg: 45m 26s | Max:  1h 41m
      🟩 Intel              Pass: 100%/3   | Total:  2h 17m | Avg: 45m 46s | Max: 46m 50s
      🔥 MSVC               Pass:   0%/6   | Total:  1h 18m | Avg: 13m 04s | Max: 14m 25s
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  94%/103 | Total:  2d 13h | Avg: 35m 50s | Max:  1h 04m
      🟩 DeviceLaunch       Pass: 100%/8   | Total: 11h 17m | Avg:  1h 24m | Max:  1h 28m
      🟩 GraphCapture       Pass: 100%/8   | Total: 10h 49m | Avg:  1h 21m | Max:  1h 25m
      🟩 HostLaunch         Pass: 100%/8   | Total: 11h 03m | Avg:  1h 22m | Max:  1h 26m
      🟩 SmallGMem          Pass: 100%/1   | Total:  1h 41m | Avg:  1h 41m | Max:  1h 41m
      🟩 TestGPU            Pass: 100%/8   | Total: 12h 04m | Avg:  1h 30m | Max:  1h 32m
    🟨 ctk
      🟨 11.1               Pass:  93%/15  | Total:  1h 04m | Avg:  4m 17s | Max: 14m 25s
      🟩 11.8               Pass: 100%/3   | Total: 13m 53s | Avg:  4m 37s | Max:  4m 53s
      🟨 12.6               Pass:  95%/118 | Total:  4d 11h | Avg: 54m 29s | Max:  1h 41m
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 04m
      🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 04m | Avg:  4m 17s | Max: 14m 25s
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 53s | Avg:  4m 37s | Max:  4m 53s
      🟨 nvcc12.6           Pass:  95%/116 | Total:  4d 09h | Avg: 54m 22s | Max:  1h 41m
    🟨 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  2h 26m | Avg: 24m 27s | Max: 47m 10s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 15m | Avg: 45m 06s | Max: 47m 07s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 00m | Avg: 45m 14s | Max: 47m 46s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 53m | Avg: 43m 18s | Max: 43m 58s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 57m | Avg: 44m 29s | Max: 48m 16s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 54m | Avg: 43m 32s | Max: 44m 01s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 00m | Avg: 45m 13s | Max: 48m 18s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 58m | Avg: 44m 33s | Max: 49m 00s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 52m | Avg: 43m 14s | Max: 43m 49s
      🟩 Clang18            Pass: 100%/26  | Total:  1d 07h | Avg:  1h 11m | Max:  1h 31m
      🟩 GCC6               Pass: 100%/2   | Total:  7m 05s | Avg:  3m 32s | Max:  3m 36s
      🟩 GCC7               Pass: 100%/6   | Total:  2h 20m | Avg: 23m 28s | Max: 44m 26s
      🟩 GCC8               Pass: 100%/6   | Total:  2h 19m | Avg: 23m 13s | Max: 43m 13s
      🟩 GCC9               Pass: 100%/6   | Total:  2h 21m | Avg: 23m 34s | Max: 44m 16s
      🟩 GCC10              Pass: 100%/4   | Total:  3h 01m | Avg: 45m 27s | Max: 47m 37s
      🟩 GCC11              Pass: 100%/7   | Total:  3h 09m | Avg: 27m 05s | Max: 45m 18s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 54m | Avg: 43m 40s | Max: 44m 17s
      🟩 GCC13              Pass: 100%/29  | Total:  1d 08h | Avg:  1h 06m | Max:  1h 41m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 17m | Avg: 45m 46s | Max: 46m 50s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 14m 25s | Avg: 14m 25s | Max: 14m 25s
      🟥 MSVC14.29          Pass:   0%/2   | Total: 25m 57s | Avg: 12m 58s | Max: 13m 57s
      🟥 MSVC14.39          Pass:   0%/3   | Total: 38m 05s | Avg: 12m 41s | Max: 13m 01s
    🟨 std
      🟩 11                 Pass: 100%/35  | Total:  1d 02h | Avg: 45m 54s | Max:  1h 31m
      🟨 14                 Pass:  92%/38  | Total:  1d 03h | Avg: 43m 46s | Max:  1h 31m
      🟨 17                 Pass:  94%/38  | Total:  1d 06h | Avg: 47m 29s | Max:  1h 41m
      🟨 20                 Pass:  96%/25  | Total: 23h 53m | Avg: 57m 20s | Max:  1h 32m
    🟨 gpu
      🟨 v100               Pass:  95%/136 | Total:  4d 12h | Avg: 47m 51s | Max:  1h 41m
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 53s | Avg:  4m 37s | Max:  4m 53s
      🟩 90a                Pass: 100%/4   | Total: 40m 53s | Avg: 10m 13s | Max: 14m 20s
    
  • 🟩 thrust: Pass: 100%/122 | Total: 12h 58m | Avg: 6m 22s | Max: 25m 02s | Hits: 99%/20079

    🟩 cpu
      🟩 amd64              Pass: 100%/114 | Total: 12h 23m | Avg:  6m 31s | Max: 25m 02s | Hits:  99%/20079 
      🟩 arm64              Pass: 100%/8   | Total: 34m 44s | Avg:  4m 20s | Max:  4m 53s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 10m | Avg:  4m 40s | Max: 18m 21s | Hits:  99%/2231  
      🟩 11.8               Pass: 100%/3   | Total: 14m 34s | Avg:  4m 51s | Max:  5m 31s
      🟩 12.6               Pass: 100%/104 | Total: 11h 33m | Avg:  6m 40s | Max: 25m 02s | Hits:  99%/17848 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  4m 47s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 10m | Avg:  4m 40s | Max: 18m 21s | Hits:  99%/2231  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 14m 34s | Avg:  4m 51s | Max:  5m 31s
      🟩 nvcc12.6           Pass: 100%/102 | Total: 11h 24m | Avg:  6m 42s | Max: 25m 02s | Hits:  99%/17848 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  4m 47s
      🟩 nvcc               Pass: 100%/120 | Total: 12h 49m | Avg:  6m 24s | Max: 25m 02s | Hits:  99%/20079 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 27m 33s | Avg:  4m 35s | Max:  5m 30s
      🟩 Clang10            Pass: 100%/3   | Total: 17m 02s | Avg:  5m 40s | Max:  6m 05s
      🟩 Clang11            Pass: 100%/4   | Total: 18m 49s | Avg:  4m 42s | Max:  5m 01s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 24s | Avg:  4m 36s | Max:  4m 50s
      🟩 Clang13            Pass: 100%/4   | Total: 18m 13s | Avg:  4m 33s | Max:  5m 09s
      🟩 Clang14            Pass: 100%/4   | Total: 18m 14s | Avg:  4m 33s | Max:  5m 03s
      🟩 Clang15            Pass: 100%/4   | Total: 19m 02s | Avg:  4m 45s | Max:  5m 01s
      🟩 Clang16            Pass: 100%/4   | Total: 18m 31s | Avg:  4m 37s | Max:  4m 44s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 15s | Avg:  4m 33s | Max:  4m 51s
      🟩 Clang18            Pass: 100%/18  | Total:  2h 06m | Avg:  7m 01s | Max: 15m 32s
      🟩 GCC6               Pass: 100%/2   | Total:  6m 51s | Avg:  3m 25s | Max:  3m 33s
      🟩 GCC7               Pass: 100%/6   | Total: 23m 06s | Avg:  3m 51s | Max:  4m 23s
      🟩 GCC8               Pass: 100%/6   | Total: 24m 29s | Avg:  4m 04s | Max:  4m 46s
      🟩 GCC9               Pass: 100%/6   | Total: 24m 27s | Avg:  4m 04s | Max:  5m 02s
      🟩 GCC10              Pass: 100%/4   | Total: 17m 24s | Avg:  4m 21s | Max:  4m 39s
      🟩 GCC11              Pass: 100%/7   | Total: 33m 24s | Avg:  4m 46s | Max:  5m 31s
      🟩 GCC12              Pass: 100%/4   | Total: 19m 11s | Avg:  4m 47s | Max:  5m 08s
      🟩 GCC13              Pass: 100%/20  | Total:  2h 18m | Avg:  6m 54s | Max: 15m 11s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 17m 10s | Avg:  5m 43s | Max:  5m 57s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 21s | Avg: 18m 21s | Max: 18m 21s | Hits:  99%/2231  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 33m 52s | Avg: 16m 56s | Max: 18m 05s | Hits:  99%/4462  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  2h 01m | Avg: 20m 17s | Max: 25m 02s | Hits:  99%/13386 
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  5h 00m | Avg:  5m 27s | Max: 15m 32s
      🟩 GCC                Pass: 100%/55  | Total:  4h 46m | Avg:  5m 13s | Max: 15m 11s
      🟩 Intel              Pass: 100%/3   | Total: 17m 10s | Avg:  5m 43s | Max:  5m 57s
      🟩 MSVC               Pass: 100%/9   | Total:  2h 53m | Avg: 19m 19s | Max: 25m 02s | Hits:  99%/20079 
    🟩 gpu
      🟩 v100               Pass: 100%/122 | Total: 12h 58m | Avg:  6m 22s | Max: 25m 02s | Hits:  99%/20079 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  9h 02m | Avg:  5m 15s | Max: 18m 21s | Hits:  99%/13386 
      🟩 TestCPU            Pass: 100%/11  | Total:  2h 04m | Avg: 11m 18s | Max: 25m 02s | Hits:  99%/6693  
      🟩 TestGPU            Pass: 100%/8   | Total:  1h 51m | Avg: 13m 58s | Max: 15m 32s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 14m 34s | Avg:  4m 51s | Max:  5m 31s
      🟩 90a                Pass: 100%/4   | Total: 15m 40s | Avg:  3m 55s | Max:  4m 15s
    🟩 std
      🟩 11                 Pass: 100%/31  | Total:  2h 26m | Avg:  4m 43s | Max: 10m 45s
      🟩 14                 Pass: 100%/35  | Total:  3h 58m | Avg:  6m 49s | Max: 21m 38s | Hits:  99%/8924  
      🟩 17                 Pass: 100%/34  | Total:  3h 52m | Avg:  6m 50s | Max: 22m 44s | Hits:  99%/6693  
      🟩 20                 Pass: 100%/22  | Total:  2h 40m | Avg:  7m 18s | Max: 25m 02s | Hits:  99%/4462  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

# Runner
186 linux-amd64-cpu16
42 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Build and benchmark a prototype CUB cub::DeviceFind::FindIf kernel
4 participants