Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for large num_items to device_select.cuh #1422

Open
7 of 9 tasks
Tracked by #50
elstehle opened this issue Feb 21, 2024 · 13 comments · May be fixed by #2400
Open
7 of 9 tasks
Tracked by #50

Add support for large num_items to device_select.cuh #1422

elstehle opened this issue Feb 21, 2024 · 13 comments · May be fixed by #2400
Assignees

Comments

@elstehle
Copy link
Collaborator

elstehle commented Feb 21, 2024

Tasks

  1. elstehle
  2. elstehle
  3. elstehle
  4. elstehle
  5. feature request
    elstehle
@elstehle
Copy link
Collaborator Author

cub.bench.select.if.base: signed versus unsigned offset types

[0] Tesla V100-SXM2-32GB

T{ct} OffsetT{ct} Elements{io} Entropy Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 1 8.930 us 6.37% 8.984 us 6.42% 0.054 us 0.61% PASS
I8 I32 2^20 1 14.559 us 3.37% 14.431 us 3.13% -0.128 us -0.88% PASS
I8 I32 2^24 1 104.069 us 0.95% 99.623 us 0.98% -4.446 us -4.27% FAIL
I8 I32 2^28 1 1.540 ms 0.50% 1.474 ms 0.50% -66.427 us -4.31% FAIL
I8 I32 2^16 0.544 8.799 us 5.77% 8.847 us 5.62% 0.048 us 0.55% PASS
I8 I32 2^20 0.544 14.434 us 3.36% 14.151 us 3.22% -0.283 us -1.96% PASS
I8 I32 2^24 0.544 96.522 us 0.84% 91.525 us 0.82% -4.997 us -5.18% FAIL
I8 I32 2^28 0.544 1.414 ms 0.50% 1.334 ms 0.50% -79.674 us -5.64% FAIL
I8 I32 2^16 0 8.477 us 5.46% 8.453 us 5.29% -0.023 us -0.28% PASS
I8 I32 2^20 0 13.636 us 3.62% 13.511 us 3.32% -0.125 us -0.92% PASS
I8 I32 2^24 0 87.201 us 0.68% 82.173 us 0.72% -5.028 us -5.77% FAIL
I8 I32 2^28 0 1.245 ms 0.45% 1.161 ms 0.49% -83.970 us -6.74% FAIL
I8 I64 2^16 1 8.770 us 5.82% 8.824 us 5.70% 0.055 us 0.62% PASS
I8 I64 2^20 1 14.659 us 3.37% 14.847 us 3.50% 0.189 us 1.29% PASS
I8 I64 2^24 1 108.105 us 0.73% 108.920 us 0.75% 0.814 us 0.75% FAIL
I8 I64 2^28 1 1.607 ms 0.50% 1.617 ms 0.50% 10.795 us 0.67% FAIL
I8 I64 2^16 0.544 8.773 us 5.82% 8.813 us 5.74% 0.040 us 0.46% PASS
I8 I64 2^20 0.544 14.354 us 3.24% 14.406 us 3.03% 0.052 us 0.36% PASS
I8 I64 2^24 0.544 101.642 us 0.68% 102.324 us 0.70% 0.683 us 0.67% PASS
I8 I64 2^28 0.544 1.506 ms 0.50% 1.512 ms 0.50% 5.677 us 0.38% PASS
I8 I64 2^16 0 8.488 us 5.48% 8.546 us 5.74% 0.058 us 0.69% PASS
I8 I64 2^20 0 13.986 us 3.81% 14.147 us 3.47% 0.161 us 1.15% PASS
I8 I64 2^24 0 91.343 us 0.59% 92.670 us 0.59% 1.327 us 1.45% FAIL
I8 I64 2^28 0 1.308 ms 0.50% 1.326 ms 0.50% 17.942 us 1.37% FAIL
I16 I32 2^16 1 9.279 us 4.37% 8.820 us 5.69% -0.459 us -4.94% FAIL
I16 I32 2^20 1 15.774 us 3.30% 15.618 us 3.03% -0.157 us -0.99% PASS
I16 I32 2^24 1 122.989 us 1.22% 119.690 us 1.18% -3.299 us -2.68% FAIL
I16 I32 2^28 1 1.829 ms 0.52% 1.778 ms 0.56% -50.697 us -2.77% FAIL
I16 I32 2^16 0.544 9.018 us 4.72% 8.902 us 5.44% -0.117 us -1.29% PASS
I16 I32 2^20 0.544 15.707 us 3.28% 15.336 us 2.72% -0.371 us -2.36% PASS
I16 I32 2^24 0.544 112.747 us 1.15% 108.236 us 1.21% -4.512 us -4.00% FAIL
I16 I32 2^28 0.544 1.655 ms 0.50% 1.589 ms 0.50% -66.748 us -4.03% FAIL
I16 I32 2^16 0 8.669 us 5.92% 8.632 us 5.90% -0.036 us -0.42% PASS
I16 I32 2^20 0 15.141 us 3.40% 15.083 us 3.30% -0.058 us -0.39% PASS
I16 I32 2^24 0 92.163 us 0.78% 88.110 us 0.79% -4.053 us -4.40% FAIL
I16 I32 2^28 0 1.291 ms 0.50% 1.220 ms 0.50% -71.447 us -5.53% FAIL
I16 I64 2^16 1 9.331 us 4.01% 9.062 us 4.54% -0.269 us -2.88% PASS
I16 I64 2^20 1 16.097 us 3.19% 16.344 us 2.91% 0.247 us 1.53% PASS
I16 I64 2^24 1 125.353 us 1.04% 126.075 us 1.01% 0.722 us 0.58% PASS
I16 I64 2^28 1 1.863 ms 0.50% 1.872 ms 0.50% 8.985 us 0.48% PASS
I16 I64 2^16 0.544 8.977 us 5.02% 9.088 us 4.37% 0.111 us 1.24% PASS
I16 I64 2^20 0.544 16.029 us 3.34% 15.873 us 3.28% -0.156 us -0.98% PASS
I16 I64 2^24 0.544 115.299 us 0.90% 116.190 us 0.88% 0.891 us 0.77% PASS
I16 I64 2^28 0.544 1.699 ms 0.50% 1.713 ms 0.50% 13.120 us 0.77% FAIL
I16 I64 2^16 0 8.705 us 5.94% 8.801 us 5.77% 0.096 us 1.10% PASS
I16 I64 2^20 0 15.414 us 3.20% 15.579 us 3.07% 0.165 us 1.07% PASS
I16 I64 2^24 0 96.224 us 0.69% 97.288 us 0.69% 1.064 us 1.11% FAIL
I16 I64 2^28 0 1.356 ms 0.50% 1.371 ms 0.50% 15.056 us 1.11% FAIL
I32 I32 2^16 1 8.994 us 4.89% 9.424 us 4.74% 0.430 us 4.78% FAIL
I32 I32 2^20 1 18.774 us 2.70% 18.926 us 2.83% 0.152 us 0.81% PASS
I32 I32 2^24 1 182.142 us 0.77% 182.708 us 0.80% 0.567 us 0.31% PASS
I32 I32 2^28 1 2.796 ms 0.58% 2.800 ms 0.58% 4.156 us 0.15% PASS
I32 I32 2^16 0.544 9.109 us 4.29% 9.000 us 4.99% -0.109 us -1.20% PASS
I32 I32 2^20 0.544 18.963 us 2.86% 18.956 us 2.84% -0.007 us -0.04% PASS
I32 I32 2^24 0.544 151.466 us 0.94% 151.512 us 0.96% 0.046 us 0.03% PASS
I32 I32 2^28 0.544 2.268 ms 0.50% 2.271 ms 0.50% 2.676 us 0.12% PASS
I32 I32 2^16 0 8.807 us 5.82% 8.717 us 5.94% -0.090 us -1.02% PASS
I32 I32 2^20 0 18.128 us 2.80% 18.081 us 2.87% -0.046 us -0.26% PASS
I32 I32 2^24 0 108.646 us 1.01% 108.790 us 1.03% 0.144 us 0.13% PASS
I32 I32 2^28 0 1.496 ms 1.19% 1.497 ms 1.19% 1.699 us 0.11% PASS
I32 I64 2^16 1 9.131 us 4.19% 9.234 us 3.16% 0.103 us 1.13% PASS
I32 I64 2^20 1 19.115 us 3.03% 19.104 us 2.75% -0.011 us -0.06% PASS
I32 I64 2^24 1 184.830 us 0.90% 184.204 us 0.86% -0.626 us -0.34% PASS
I32 I64 2^28 1 2.833 ms 0.59% 2.824 ms 0.57% -8.893 us -0.31% PASS
I32 I64 2^16 0.544 9.129 us 4.30% 9.218 us 3.37% 0.089 us 0.98% PASS
I32 I64 2^20 0.544 19.130 us 4.56% 19.192 us 2.78% 0.062 us 0.32% PASS
I32 I64 2^24 0.544 154.892 us 1.11% 153.039 us 0.99% -1.852 us -1.20% FAIL
I32 I64 2^28 0.544 2.328 ms 0.50% 2.297 ms 0.50% -30.851 us -1.33% FAIL
I32 I64 2^16 0 8.831 us 9.11% 8.880 us 5.58% 0.049 us 0.55% PASS
I32 I64 2^20 0 18.127 us 4.29% 18.174 us 2.76% 0.046 us 0.26% PASS
I32 I64 2^24 0 114.131 us 1.04% 112.163 us 0.98% -1.969 us -1.72% FAIL
I32 I64 2^28 0 1.592 ms 0.99% 1.551 ms 1.09% -40.839 us -2.57% FAIL
I64 I32 2^16 1 9.868 us 7.14% 9.986 us 4.67% 0.118 us 1.19% PASS
I64 I32 2^20 1 29.292 us 2.98% 29.363 us 2.40% 0.071 us 0.24% PASS
I64 I32 2^24 1 348.962 us 0.50% 348.794 us 0.50% -0.168 us -0.05% PASS
I64 I32 2^28 1 5.458 ms 0.50% 5.456 ms 0.50% -1.461 us -0.03% PASS
I64 I32 2^16 0.544 10.521 us 6.66% 10.416 us 4.33% -0.106 us -1.01% PASS
I64 I32 2^20 0.544 27.335 us 2.77% 27.205 us 2.22% -0.130 us -0.48% PASS
I64 I32 2^24 0.544 279.669 us 0.61% 279.501 us 0.60% -0.167 us -0.06% PASS
I64 I32 2^28 0.544 4.311 ms 0.50% 4.311 ms 0.50% 0.102 us 0.00% PASS
I64 I32 2^16 0 9.751 us 7.74% 9.680 us 5.36% -0.070 us -0.72% PASS
I64 I32 2^20 0 26.911 us 2.95% 26.729 us 2.39% -0.182 us -0.68% PASS
I64 I32 2^24 0 189.068 us 0.91% 188.791 us 0.86% -0.277 us -0.15% PASS
I64 I32 2^28 0 2.765 ms 0.95% 2.763 ms 0.96% -1.896 us -0.07% PASS
I64 I64 2^16 1 10.448 us 6.37% 10.422 us 4.26% -0.026 us -0.25% PASS
I64 I64 2^20 1 29.609 us 2.86% 29.594 us 2.39% -0.014 us -0.05% PASS
I64 I64 2^24 1 350.438 us 0.54% 350.396 us 0.50% -0.041 us -0.01% PASS
I64 I64 2^28 1 5.473 ms 0.50% 5.476 ms 0.50% 2.974 us 0.05% PASS
I64 I64 2^16 0.544 10.174 us 7.09% 10.107 us 4.57% -0.066 us -0.65% PASS
I64 I64 2^20 0.544 27.675 us 2.91% 27.617 us 2.14% -0.059 us -0.21% PASS
I64 I64 2^24 0.544 281.827 us 0.63% 281.893 us 0.60% 0.067 us 0.02% PASS
I64 I64 2^28 0.544 4.341 ms 0.50% 4.343 ms 0.50% 1.952 us 0.04% PASS
I64 I64 2^16 0 10.170 us 6.47% 10.243 us 4.08% 0.074 us 0.72% PASS
I64 I64 2^20 0 27.191 us 3.03% 27.292 us 2.39% 0.101 us 0.37% PASS
I64 I64 2^24 0 192.592 us 0.85% 193.404 us 0.83% 0.812 us 0.42% PASS
I64 I64 2^28 0 2.823 ms 0.91% 2.839 ms 0.90% 15.763 us 0.56% PASS
I128 I32 2^16 1 12.180 us 5.89% 12.219 us 3.21% 0.039 us 0.32% PASS
I128 I32 2^20 1 39.499 us 2.27% 39.611 us 1.50% 0.112 us 0.28% PASS
I128 I32 2^24 1 361.077 us 0.64% 362.579 us 0.61% 1.502 us 0.42% PASS
I128 I32 2^28 1 5.524 ms 0.65% 5.550 ms 0.64% 25.811 us 0.47% PASS
I128 I32 2^16 0.544 12.135 us 5.42% 12.280 us 3.33% 0.145 us 1.20% PASS
I128 I32 2^20 0.544 39.472 us 1.92% 39.598 us 1.51% 0.127 us 0.32% PASS
I128 I32 2^24 0.544 361.046 us 0.65% 362.558 us 0.60% 1.512 us 0.42% PASS
I128 I32 2^28 0.544 5.524 ms 0.65% 5.550 ms 0.64% 25.909 us 0.47% PASS
I128 I32 2^16 0 12.166 us 5.36% 12.219 us 3.33% 0.054 us 0.44% PASS
I128 I32 2^20 0 39.407 us 1.91% 39.574 us 1.49% 0.167 us 0.42% PASS
I128 I32 2^24 0 360.967 us 0.67% 362.596 us 0.59% 1.629 us 0.45% PASS
I128 I32 2^28 0 5.524 ms 0.65% 5.550 ms 0.64% 25.893 us 0.47% PASS
I128 I64 2^16 1 11.785 us 6.30% 11.779 us 4.43% -0.006 us -0.05% PASS
I128 I64 2^20 1 40.619 us 1.98% 40.602 us 1.62% -0.016 us -0.04% PASS
I128 I64 2^24 1 401.233 us 0.51% 401.529 us 0.50% 0.296 us 0.07% PASS
I128 I64 2^28 1 6.193 ms 0.50% 6.202 ms 0.50% 9.428 us 0.15% PASS
I128 I64 2^16 0.544 11.804 us 6.61% 11.827 us 4.41% 0.024 us 0.20% PASS
I128 I64 2^20 0.544 40.619 us 1.94% 40.582 us 1.60% -0.037 us -0.09% PASS
I128 I64 2^24 0.544 401.259 us 0.50% 401.523 us 0.50% 0.264 us 0.07% PASS
I128 I64 2^28 0.544 6.193 ms 0.50% 6.202 ms 0.50% 9.147 us 0.15% PASS
I128 I64 2^16 0 11.730 us 6.34% 11.789 us 4.45% 0.059 us 0.50% PASS
I128 I64 2^20 0 40.562 us 1.99% 40.597 us 1.61% 0.035 us 0.09% PASS
I128 I64 2^24 0 401.268 us 0.51% 401.452 us 0.50% 0.184 us 0.05% PASS
I128 I64 2^28 0 6.193 ms 0.50% 6.202 ms 0.50% 9.374 us 0.15% PASS
F32 I32 2^16 1 9.134 us 7.85% 9.137 us 4.31% 0.003 us 0.03% PASS
F32 I32 2^20 1 18.950 us 3.76% 18.986 us 2.81% 0.036 us 0.19% PASS
F32 I32 2^24 1 182.801 us 0.85% 183.047 us 0.82% 0.247 us 0.13% PASS
F32 I32 2^28 1 2.940 ms 0.67% 2.944 ms 0.67% 3.611 us 0.12% PASS
F32 I32 2^16 0.544 8.929 us 7.87% 8.858 us 5.62% -0.072 us -0.80% PASS
F32 I32 2^20 0.544 18.370 us 4.03% 18.467 us 2.77% 0.097 us 0.53% PASS
F32 I32 2^24 0.544 125.099 us 1.19% 125.454 us 1.08% 0.356 us 0.28% PASS
F32 I32 2^28 0.544 1.808 ms 0.69% 1.813 ms 0.69% 4.483 us 0.25% PASS
F32 I32 2^16 0 8.788 us 8.72% 8.752 us 5.89% -0.036 us -0.41% PASS
F32 I32 2^20 0 18.181 us 3.77% 18.225 us 2.71% 0.044 us 0.24% PASS
F32 I32 2^24 0 108.643 us 1.13% 109.101 us 1.02% 0.458 us 0.42% PASS
F32 I32 2^28 0 1.496 ms 1.19% 1.504 ms 1.17% 7.968 us 0.53% PASS
F32 I64 2^16 1 9.351 us 6.98% 9.285 us 3.20% -0.065 us -0.70% PASS
F32 I64 2^20 1 19.256 us 4.11% 19.305 us 2.79% 0.050 us 0.26% PASS
F32 I64 2^24 1 185.170 us 0.97% 184.398 us 0.86% -0.772 us -0.42% PASS
F32 I64 2^28 1 2.957 ms 0.66% 2.953 ms 0.67% -4.099 us -0.14% PASS
F32 I64 2^16 0.544 9.046 us 7.47% 9.006 us 5.02% -0.040 us -0.44% PASS
F32 I64 2^20 0.544 18.534 us 3.57% 18.536 us 2.63% 0.001 us 0.01% PASS
F32 I64 2^24 0.544 129.131 us 1.12% 128.212 us 1.01% -0.918 us -0.71% PASS
F32 I64 2^28 0.544 1.872 ms 0.72% 1.863 ms 0.59% -8.846 us -0.47% PASS
F32 I64 2^16 0 8.874 us 6.76% 8.844 us 5.66% -0.029 us -0.33% PASS
F32 I64 2^20 0 18.164 us 2.87% 18.125 us 2.82% -0.039 us -0.22% PASS
F32 I64 2^24 0 114.475 us 0.96% 112.431 us 0.99% -2.044 us -1.79% FAIL
F32 I64 2^28 0 1.596 ms 0.99% 1.557 ms 1.09% -38.802 us -2.43% FAIL
F64 I32 2^16 1 10.226 us 4.18% 10.275 us 4.19% 0.049 us 0.48% PASS
F64 I32 2^20 1 29.350 us 2.46% 29.400 us 2.46% 0.050 us 0.17% PASS
F64 I32 2^24 1 348.778 us 0.50% 349.010 us 0.50% 0.232 us 0.07% PASS
F64 I32 2^28 1 5.456 ms 0.50% 5.459 ms 0.50% 2.836 us 0.05% PASS
F64 I32 2^16 0.544 9.717 us 5.36% 9.791 us 5.36% 0.074 us 0.76% PASS
F64 I32 2^20 0.544 26.550 us 2.29% 26.659 us 2.25% 0.109 us 0.41% PASS
F64 I32 2^24 0.544 222.605 us 0.72% 222.988 us 0.73% 0.383 us 0.17% PASS
F64 I32 2^28 0.544 3.348 ms 0.51% 3.353 ms 0.50% 4.932 us 0.15% PASS
F64 I32 2^16 0 9.652 us 5.30% 9.726 us 5.36% 0.074 us 0.76% PASS
F64 I32 2^20 0 26.766 us 2.41% 26.859 us 2.38% 0.093 us 0.35% PASS
F64 I32 2^24 0 188.788 us 0.87% 189.233 us 0.89% 0.445 us 0.24% PASS
F64 I32 2^28 0 2.761 ms 0.96% 2.769 ms 0.95% 7.647 us 0.28% PASS
F64 I64 2^16 1 10.611 us 4.71% 10.536 us 4.47% -0.076 us -0.71% PASS
F64 I64 2^20 1 29.736 us 2.44% 29.825 us 2.47% 0.089 us 0.30% PASS
F64 I64 2^24 1 350.297 us 0.50% 350.369 us 0.50% 0.072 us 0.02% PASS
F64 I64 2^28 1 5.472 ms 0.50% 5.473 ms 0.50% 0.601 us 0.01% PASS
F64 I64 2^16 0.544 10.331 us 6.59% 10.223 us 4.05% -0.109 us -1.05% PASS
F64 I64 2^20 0.544 26.938 us 2.84% 26.847 us 2.22% -0.091 us -0.34% PASS
F64 I64 2^24 0.544 224.604 us 0.74% 224.608 us 0.71% 0.004 us 0.00% PASS
F64 I64 2^28 0.544 3.378 ms 0.50% 3.380 ms 0.50% 1.763 us 0.05% PASS
F64 I64 2^16 0 10.189 us 7.05% 10.105 us 4.44% -0.084 us -0.82% PASS
F64 I64 2^20 0 27.264 us 2.89% 27.206 us 2.36% -0.058 us -0.21% PASS
F64 I64 2^24 0 192.524 us 0.91% 192.890 us 0.83% 0.366 us 0.19% PASS
F64 I64 2^28 0 2.820 ms 0.90% 2.829 ms 0.90% 8.648 us 0.31% PASS

@elstehle
Copy link
Collaborator Author

elstehle commented Feb 21, 2024

Seeing some noticeable performance drops for:

  • i64 elements with 64-bit offset type
  • f64 elements with 64-bit offset type
  • i128 elements with 32-bit offset type
cub.bench.select.flagged.base: signed versus unsigned offset types ## [0] Tesla V100-SXM2-32GB
T{ct} OffsetT{ct} Elements{io} Entropy Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 1 8.820 us 6.44% 8.876 us 6.24% 0.056 us 0.64% PASS
I8 I32 2^20 1 15.357 us 2.70% 15.417 us 2.69% 0.059 us 0.39% PASS
I8 I32 2^24 1 111.282 us 0.93% 111.720 us 0.91% 0.438 us 0.39% PASS
I8 I32 2^28 1 1.655 ms 0.50% 1.659 ms 0.50% 3.890 us 0.24% PASS
I8 I32 2^16 0.544 8.710 us 5.91% 8.810 us 5.78% 0.099 us 1.14% PASS
I8 I32 2^20 0.544 15.212 us 3.09% 15.213 us 3.01% 0.001 us 0.01% PASS
I8 I32 2^24 0.544 108.060 us 0.98% 108.216 us 0.98% 0.156 us 0.14% PASS
I8 I32 2^28 0.544 1.576 ms 0.64% 1.576 ms 0.64% 0.018 us 0.00% PASS
I8 I32 2^16 0 8.417 us 5.08% 8.483 us 5.46% 0.066 us 0.78% PASS
I8 I32 2^20 0 14.724 us 3.44% 14.796 us 3.49% 0.071 us 0.48% PASS
I8 I32 2^24 0 93.078 us 0.74% 92.757 us 0.72% -0.321 us -0.34% PASS
I8 I32 2^28 0 1.314 ms 0.10% 1.310 ms 0.11% -3.783 us -0.29% FAIL
I8 I64 2^16 1 8.813 us 5.74% 8.809 us 5.79% -0.004 us -0.05% PASS
I8 I64 2^20 1 16.088 us 3.21% 16.115 us 3.12% 0.026 us 0.16% PASS
I8 I64 2^24 1 119.316 us 0.75% 120.151 us 0.73% 0.835 us 0.70% PASS
I8 I64 2^28 1 1.789 ms 0.50% 1.801 ms 0.50% 11.664 us 0.65% FAIL
I8 I64 2^16 0.544 8.819 us 5.72% 8.846 us 5.62% 0.027 us 0.31% PASS
I8 I64 2^20 0.544 16.069 us 3.18% 16.043 us 3.16% -0.026 us -0.16% PASS
I8 I64 2^24 0.544 116.079 us 0.84% 117.019 us 0.82% 0.940 us 0.81% PASS
I8 I64 2^28 0.544 1.711 ms 0.50% 1.723 ms 0.50% 11.844 us 0.69% FAIL
I8 I64 2^16 0 8.585 us 5.82% 8.603 us 5.86% 0.018 us 0.21% PASS
I8 I64 2^20 0 15.388 us 2.98% 15.441 us 3.01% 0.054 us 0.35% PASS
I8 I64 2^24 0 101.388 us 0.58% 102.082 us 0.60% 0.694 us 0.68% FAIL
I8 I64 2^28 0 1.449 ms 0.09% 1.461 ms 0.09% 11.505 us 0.79% FAIL
I16 I32 2^16 1 8.806 us 5.79% 8.805 us 5.80% -0.001 us -0.02% PASS
I16 I32 2^20 1 16.388 us 2.82% 16.433 us 2.96% 0.044 us 0.27% PASS
I16 I32 2^24 1 135.161 us 0.96% 135.245 us 0.92% 0.084 us 0.06% PASS
I16 I32 2^28 1 2.039 ms 0.50% 2.041 ms 0.50% 2.161 us 0.11% PASS
I16 I32 2^16 0.544 8.731 us 5.92% 8.854 us 5.62% 0.123 us 1.41% PASS
I16 I32 2^20 0.544 16.014 us 3.30% 15.970 us 3.24% -0.044 us -0.28% PASS
I16 I32 2^24 0.544 130.677 us 1.07% 130.215 us 1.10% -0.461 us -0.35% PASS
I16 I32 2^28 0.544 1.958 ms 0.57% 1.952 ms 0.56% -6.222 us -0.32% PASS
I16 I32 2^16 0 8.476 us 5.43% 8.560 us 5.77% 0.085 us 1.00% PASS
I16 I32 2^20 0 15.868 us 3.29% 15.895 us 3.30% 0.027 us 0.17% PASS
I16 I32 2^24 0 101.776 us 0.81% 101.793 us 0.82% 0.017 us 0.02% PASS
I16 I32 2^28 0 1.422 ms 0.14% 1.423 ms 0.14% 0.381 us 0.03% PASS
I16 I64 2^16 1 8.847 us 5.62% 8.997 us 4.93% 0.149 us 1.69% PASS
I16 I64 2^20 1 17.185 us 3.16% 16.984 us 3.22% -0.201 us -1.17% PASS
I16 I64 2^24 1 140.330 us 0.81% 137.644 us 0.80% -2.685 us -1.91% FAIL
I16 I64 2^28 1 2.120 ms 0.50% 2.078 ms 0.50% -41.668 us -1.97% FAIL
I16 I64 2^16 0.544 8.776 us 5.84% 8.942 us 5.22% 0.167 us 1.90% PASS
I16 I64 2^20 0.544 16.810 us 3.29% 16.431 us 2.72% -0.379 us -2.25% PASS
I16 I64 2^24 0.544 136.831 us 0.90% 134.197 us 0.95% -2.634 us -1.93% FAIL
I16 I64 2^28 0.544 2.045 ms 0.52% 2.000 ms 0.54% -44.590 us -2.18% FAIL
I16 I64 2^16 0 8.637 us 5.91% 8.665 us 5.96% 0.028 us 0.32% PASS
I16 I64 2^20 0 16.245 us 3.05% 16.169 us 3.20% -0.075 us -0.46% PASS
I16 I64 2^24 0 108.601 us 0.73% 105.566 us 0.77% -3.035 us -2.79% FAIL
I16 I64 2^28 0 1.538 ms 0.10% 1.486 ms 0.12% -51.870 us -3.37% FAIL
I32 I32 2^16 1 9.143 us 4.47% 9.240 us 4.38% 0.097 us 1.06% PASS
I32 I32 2^20 1 20.295 us 3.35% 20.258 us 3.32% -0.037 us -0.18% PASS
I32 I32 2^24 1 203.477 us 0.63% 203.377 us 0.61% -0.100 us -0.05% PASS
I32 I32 2^28 1 3.137 ms 0.54% 3.138 ms 0.54% 1.604 us 0.05% PASS
I32 I32 2^16 0.544 9.208 us 4.74% 9.293 us 4.82% 0.084 us 0.92% PASS
I32 I32 2^20 0.544 20.517 us 3.75% 20.583 us 3.82% 0.066 us 0.32% PASS
I32 I32 2^24 0.544 180.354 us 0.89% 180.218 us 0.89% -0.136 us -0.08% PASS
I32 I32 2^28 0.544 2.736 ms 0.50% 2.734 ms 0.50% -1.667 us -0.06% PASS
I32 I32 2^16 0 8.902 us 5.67% 8.897 us 5.65% -0.004 us -0.05% PASS
I32 I32 2^20 0 19.585 us 2.80% 19.577 us 2.78% -0.007 us -0.04% PASS
I32 I32 2^24 0 126.788 us 0.72% 126.793 us 0.71% 0.005 us 0.00% PASS
I32 I32 2^28 0 1.784 ms 0.13% 1.785 ms 0.13% 0.470 us 0.03% PASS
I32 I64 2^16 1 9.177 us 4.60% 9.261 us 4.32% 0.084 us 0.92% PASS
I32 I64 2^20 1 20.962 us 4.07% 20.697 us 3.31% -0.265 us -1.26% PASS
I32 I64 2^24 1 205.537 us 0.61% 204.144 us 0.59% -1.393 us -0.68% FAIL
I32 I64 2^28 1 3.176 ms 0.52% 3.149 ms 0.53% -27.667 us -0.87% FAIL
I32 I64 2^16 0.544 9.108 us 4.98% 10.090 us 4.55% 0.982 us 10.78% FAIL
I32 I64 2^20 0.544 21.098 us 4.47% 21.086 us 3.71% -0.012 us -0.05% PASS
I32 I64 2^24 0.544 183.903 us 0.86% 181.918 us 0.81% -1.985 us -1.08% FAIL
I32 I64 2^28 0.544 2.796 ms 0.50% 2.762 ms 0.50% -34.490 us -1.23% FAIL
I32 I64 2^16 0 8.927 us 5.71% 8.970 us 5.54% 0.042 us 0.47% PASS
I32 I64 2^20 0 19.952 us 3.08% 19.882 us 2.93% -0.069 us -0.35% PASS
I32 I64 2^24 0 131.903 us 0.67% 129.572 us 0.69% -2.330 us -1.77% FAIL
I32 I64 2^28 0 1.869 ms 0.12% 1.829 ms 0.11% -39.822 us -2.13% FAIL
I64 I32 2^16 1 10.117 us 5.11% 9.943 us 4.91% -0.174 us -1.72% PASS
I64 I32 2^20 1 31.058 us 2.49% 31.013 us 2.50% -0.045 us -0.14% PASS
I64 I32 2^24 1 370.652 us 0.48% 370.409 us 0.50% -0.243 us -0.07% PASS
I64 I32 2^28 1 5.805 ms 0.50% 5.804 ms 0.50% -1.308 us -0.02% PASS
I64 I32 2^16 0.544 9.993 us 4.96% 9.940 us 4.98% -0.053 us -0.53% PASS
I64 I32 2^20 0.544 28.720 us 2.21% 28.642 us 2.13% -0.078 us -0.27% PASS
I64 I32 2^24 0.544 306.502 us 0.65% 306.278 us 0.62% -0.224 us -0.07% PASS
I64 I32 2^28 0.544 4.752 ms 0.50% 4.752 ms 0.50% 0.217 us 0.00% PASS
I64 I32 2^16 0 9.705 us 5.35% 9.727 us 5.34% 0.022 us 0.23% PASS
I64 I32 2^20 0 28.156 us 2.17% 28.144 us 2.13% -0.011 us -0.04% PASS
I64 I32 2^24 0 207.893 us 0.53% 207.850 us 0.52% -0.043 us -0.02% PASS
I64 I32 2^28 0 3.068 ms 0.13% 3.070 ms 0.11% 1.659 us 0.05% PASS
I64 I64 2^16 1 10.227 us 4.04% 10.403 us 4.00% 0.176 us 1.72% PASS
I64 I64 2^20 1 31.023 us 2.35% 31.654 us 2.12% 0.630 us 2.03% PASS
I64 I64 2^24 1 372.213 us 0.48% 379.513 us 0.54% 7.299 us 1.96% FAIL
I64 I64 2^28 1 5.828 ms 0.50% 5.918 ms 0.50% 90.110 us 1.55% FAIL
I64 I64 2^16 0.544 10.179 us 4.23% 10.324 us 4.19% 0.146 us 1.43% PASS
I64 I64 2^20 0.544 28.675 us 2.16% 29.474 us 3.23% 0.799 us 2.79% FAIL
I64 I64 2^24 0.544 308.244 us 0.62% 316.340 us 0.71% 8.096 us 2.63% FAIL
I64 I64 2^28 0.544 4.782 ms 0.50% 4.888 ms 0.50% 105.948 us 2.22% FAIL
I64 I64 2^16 0 10.014 us 4.63% 10.098 us 6.74% 0.084 us 0.84% PASS
I64 I64 2^20 0 28.304 us 2.15% 28.871 us 2.62% 0.567 us 2.00% PASS
I64 I64 2^24 0 209.709 us 0.45% 218.592 us 0.55% 8.883 us 4.24% FAIL
I64 I64 2^28 0 3.098 ms 0.10% 3.246 ms 0.12% 147.642 us 4.77% FAIL
I128 I32 2^16 1 12.393 us 3.27% 12.385 us 5.18% -0.009 us -0.07% PASS
I128 I32 2^20 1 52.571 us 1.33% 54.041 us 1.58% 1.469 us 2.80% FAIL
I128 I32 2^24 1 718.611 us 0.33% 747.554 us 0.40% 28.943 us 4.03% FAIL
I128 I32 2^28 1 11.388 ms 0.50% 11.864 ms 0.50% 475.777 us 4.18% FAIL
I128 I32 2^16 0.544 12.433 us 3.81% 12.629 us 6.03% 0.196 us 1.58% PASS
I128 I32 2^20 0.544 46.625 us 1.79% 47.986 us 2.15% 1.361 us 2.92% FAIL
I128 I32 2^24 0.544 581.132 us 0.55% 619.256 us 0.60% 38.124 us 6.56% FAIL
I128 I32 2^28 0.544 9.150 ms 0.62% 9.781 ms 0.50% 631.109 us 6.90% FAIL
I128 I32 2^16 0 12.118 us 3.82% 12.048 us 6.14% -0.070 us -0.58% PASS
I128 I32 2^20 0 41.273 us 1.47% 41.965 us 1.94% 0.692 us 1.68% FAIL
I128 I32 2^24 0 382.251 us 0.32% 428.911 us 0.36% 46.660 us 12.21% FAIL
I128 I32 2^28 0 5.846 ms 0.09% 6.611 ms 0.09% 765.571 us 13.10% FAIL
I128 I64 2^16 1 11.925 us 4.32% 12.051 us 5.92% 0.127 us 1.06% PASS
I128 I64 2^20 1 53.803 us 1.27% 53.840 us 1.65% 0.037 us 0.07% PASS
I128 I64 2^24 1 738.681 us 0.40% 738.197 us 0.38% -0.484 us -0.07% PASS
I128 I64 2^28 1 11.703 ms 0.50% 11.698 ms 0.50% -4.917 us -0.04% PASS
I128 I64 2^16 0.544 12.058 us 4.33% 12.176 us 6.14% 0.117 us 0.97% PASS
I128 I64 2^20 0.544 47.704 us 1.75% 47.740 us 2.10% 0.036 us 0.07% PASS
I128 I64 2^24 0.544 607.297 us 0.64% 606.423 us 0.64% -0.874 us -0.14% PASS
I128 I64 2^28 0.544 9.583 ms 0.50% 9.576 ms 0.50% -7.073 us -0.07% PASS
I128 I64 2^16 0 11.783 us 4.46% 11.698 us 7.18% -0.085 us -0.72% PASS
I128 I64 2^20 0 42.010 us 1.54% 41.931 us 1.98% -0.079 us -0.19% PASS
I128 I64 2^24 0 418.861 us 0.33% 418.295 us 0.36% -0.566 us -0.14% PASS
I128 I64 2^28 0 6.457 ms 0.08% 6.451 ms 0.08% -6.630 us -0.10% FAIL
F32 I32 2^16 1 9.203 us 4.37% 9.199 us 8.07% -0.004 us -0.05% PASS
F32 I32 2^20 1 20.471 us 3.35% 20.450 us 4.37% -0.021 us -0.10% PASS
F32 I32 2^24 1 203.478 us 0.63% 203.665 us 0.75% 0.187 us 0.09% PASS
F32 I32 2^28 1 3.137 ms 0.54% 3.138 ms 0.54% 1.635 us 0.05% PASS
F32 I32 2^16 0.544 9.167 us 4.75% 9.313 us 9.39% 0.146 us 1.60% PASS
F32 I32 2^20 0.544 20.520 us 3.77% 20.558 us 4.95% 0.038 us 0.19% PASS
F32 I32 2^24 0.544 180.348 us 0.87% 180.288 us 0.94% -0.060 us -0.03% PASS
F32 I32 2^28 0.544 2.736 ms 0.50% 2.734 ms 0.50% -1.678 us -0.06% PASS
F32 I32 2^16 0 8.871 us 5.96% 8.968 us 9.46% 0.097 us 1.09% PASS
F32 I32 2^20 0 19.514 us 2.73% 19.576 us 4.06% 0.061 us 0.31% PASS
F32 I32 2^24 0 126.795 us 0.71% 126.776 us 0.93% -0.019 us -0.01% PASS
F32 I32 2^28 0 1.784 ms 0.13% 1.785 ms 0.13% 0.769 us 0.04% PASS
F32 I64 2^16 1 9.175 us 4.59% 9.424 us 8.57% 0.249 us 2.72% PASS
F32 I64 2^20 1 20.943 us 4.00% 21.025 us 4.57% 0.082 us 0.39% PASS
F32 I64 2^24 1 205.801 us 0.62% 204.511 us 0.67% -1.290 us -0.63% FAIL
F32 I64 2^28 1 3.176 ms 0.52% 3.149 ms 0.53% -27.577 us -0.87% FAIL
F32 I64 2^16 0.544 9.157 us 4.83% 9.431 us 8.67% 0.274 us 2.99% PASS
F32 I64 2^20 0.544 21.073 us 4.45% 21.065 us 4.87% -0.007 us -0.03% PASS
F32 I64 2^24 0.544 184.207 us 0.85% 182.129 us 0.97% -2.077 us -1.13% FAIL
F32 I64 2^28 0.544 2.796 ms 0.50% 2.761 ms 0.50% -34.517 us -1.23% FAIL
F32 I64 2^16 0 8.998 us 5.44% 9.086 us 9.59% 0.087 us 0.97% PASS
F32 I64 2^20 0 20.073 us 3.05% 19.984 us 4.29% -0.089 us -0.44% PASS
F32 I64 2^24 0 131.995 us 0.69% 129.567 us 0.89% -2.428 us -1.84% FAIL
F32 I64 2^28 0 1.869 ms 0.12% 1.829 ms 0.13% -39.656 us -2.12% FAIL
F64 I32 2^16 1 10.246 us 4.19% 10.128 us 7.23% -0.118 us -1.15% PASS
F64 I32 2^20 1 31.172 us 2.48% 31.335 us 3.38% 0.163 us 0.52% PASS
F64 I32 2^24 1 370.658 us 0.49% 370.547 us 0.50% -0.111 us -0.03% PASS
F64 I32 2^28 1 5.805 ms 0.50% 5.804 ms 0.50% -1.364 us -0.02% PASS
F64 I32 2^16 0.544 10.071 us 5.15% 10.145 us 7.97% 0.074 us 0.74% PASS
F64 I32 2^20 0.544 28.678 us 2.15% 28.797 us 3.05% 0.119 us 0.42% PASS
F64 I32 2^24 0.544 306.363 us 0.63% 306.311 us 0.64% -0.053 us -0.02% PASS
F64 I32 2^28 0.544 4.752 ms 0.50% 4.752 ms 0.50% 0.597 us 0.01% PASS
F64 I32 2^16 0 9.730 us 5.38% 9.759 us 8.32% 0.029 us 0.29% PASS
F64 I32 2^20 0 28.192 us 2.19% 28.126 us 2.93% -0.066 us -0.23% PASS
F64 I32 2^24 0 207.864 us 0.52% 207.919 us 0.65% 0.056 us 0.03% PASS
F64 I32 2^28 0 3.068 ms 0.14% 3.069 ms 0.12% 1.558 us 0.05% PASS
F64 I64 2^16 1 10.393 us 3.97% 10.634 us 7.92% 0.240 us 2.31% PASS
F64 I64 2^20 1 31.128 us 2.36% 31.797 us 2.74% 0.669 us 2.15% PASS
F64 I64 2^24 1 372.297 us 0.47% 379.746 us 0.61% 7.449 us 2.00% FAIL
F64 I64 2^28 1 5.828 ms 0.50% 5.918 ms 0.50% 89.889 us 1.54% FAIL
F64 I64 2^16 0.544 10.372 us 4.66% 10.526 us 7.99% 0.154 us 1.48% PASS
F64 I64 2^20 0.544 28.620 us 2.10% 29.502 us 3.09% 0.882 us 3.08% FAIL
F64 I64 2^24 0.544 308.197 us 0.63% 316.426 us 0.70% 8.230 us 2.67% FAIL
F64 I64 2^28 0.544 4.782 ms 0.50% 4.887 ms 0.50% 105.565 us 2.21% FAIL
F64 I64 2^16 0 9.998 us 4.70% 10.204 us 7.74% 0.205 us 2.05% PASS
F64 I64 2^20 0 28.309 us 2.10% 28.931 us 3.21% 0.622 us 2.20% FAIL
F64 I64 2^24 0 209.813 us 0.48% 218.718 us 0.57% 8.905 us 4.24% FAIL
F64 I64 2^28 0 3.098 ms 0.09% 3.245 ms 0.13% 147.587 us 4.76% FAIL

@elstehle
Copy link
Collaborator Author

elstehle commented Apr 3, 2024

Currently blocked by #1454.

Turns out that there's some performance degradation from simply moving DeviceSelect to use choose_offset_t (see results above). Similarly, there's performance downside of as much as 50% if we were using i32 and i64 offset types here.

Given there's no easy choice for the offset type here, we want to revisit #1454 and come to a conclusion for a broader approach of offset type handling first, before continuing on this endeavour.

@bhack
Copy link

bhack commented Apr 30, 2024

We have some tickets potentially related to this in Pytorch like:
pytorch/pytorch#51871

@bhack
Copy link

bhack commented May 6, 2024

Do you have an ETA for this?

@jrhemstad
Copy link
Collaborator

Hey @bhack, this is something we're actively working on. Are there other specific algorithms that you're interested in?

@bhack
Copy link

bhack commented May 6, 2024

Personally I hit this specific one for underline nonzero implementation as other PyTorch users/developers.
I don't know if @ezyang have a more complete overview about other related priorities in the PyTorch context.

@bhack
Copy link

bhack commented May 6, 2024

For nonzero as sum seems already covered in the table at:
#50 (comment)

cub::DeviceSelect::Flagged is the only one still needed for large N:

    cub::DeviceSelect::Flagged(nullptr, temp_storage_bytes, counting_itr, itr,
      out_temp.mutable_data_ptr<int64_t>(), (int*)num_nonzeros.get(), N, stream);
    temp_storage = allocator.allocate(temp_storage_bytes);
    cub::DeviceSelect::Flagged(temp_storage.get(), temp_storage_bytes, counting_itr, itr,
      out_temp.mutable_data_ptr<int64_t>(), (int*)num_nonzeros.get(), N, stream);

@bhack
Copy link

bhack commented Aug 2, 2024

@elstehle We are having another problem related to this with the just release (but popular= model by Meta SAM2:
facebookresearch/segment-anything-2#44

Any progress on this?
Basically pytorch nonzero ops rely on this.

@elstehle
Copy link
Collaborator Author

elstehle commented Aug 2, 2024

Thank your for letting us know that this came up again, in another, very recent model, @bhack. We understand that this is of great importance to the community.

Unfortunately, there's no straight-forward solution that would not see significant slow-downs (in some cases 50% performance drops) when moving from 32 to 64-bit offset types. We are currently investigating options that mitigate performance drops when using 64-bit offset types. One such option is tracked here #2136

@PointKernel
Copy link
Member

Hi @elstehle

I noticed num_items in cub::DeviceSelect::UniqueByKey is a template parameter so the API can handle inputs larger than INT_MAX. Could we do something similar to cub::DeviceSelect::If as well?

This will unblock NVIDIA/cuCollections#576 and rapidsai/cudf#16526.

@elstehle
Copy link
Collaborator Author

elstehle commented Aug 13, 2024

I noticed num_items in cub::DeviceSelect::UniqueByKey is a template parameter so the API can handle inputs larger than INT_MAX. Could we do something similar to cub::DeviceSelect::If as well?

In theory, yes, we could just make num_items a template parameter but the algorithm performance is very susceptible to changes in the offset type used in the kernel template instantiation. We see a worst-case slow-down of 2.7x, when simply switching to 64-bit offset types (see benchmark data).

So, we're trying various ways to mitigate these performance drops that come from using a wider offset type.

With a more sophisticated approach, we were able to mitigate this slowdown to only 1.3x. However, this is still more than we would like to tolerate, if possible.

We will likely pursue a streaming approach for DeviceSelect and DevicePartition, similar to this experimental PR here that showed some promising results of a worst-case slowdown of only 4% for 2^24 number of items and more compared to main.

@PointKernel
Copy link
Member

We will likely pursue a streaming approach for DeviceSelect and DevicePartition, similar to this experimental PR here that showed some promising results of a worst-case slowdown of only 4% for 2^24 number of items and more compared to main.

I like the streaming idea. The performance degradation with small inputs is IMO negligible since the overall runtime is no more than one millisecond. Thanks for the great work!

PointKernel added a commit to NVIDIA/cuCollections that referenced this issue Aug 16, 2024
Fix #576

This PR fixes the large input retrieve_all bug with a method similar to
the streaming approach mentioned in
NVIDIA/cccl#1422 (comment).

To be reverted once the CCCL fix is in place.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

4 participants