[FEA]: Add Thrust build option to disable dynamic offset type dispatch #1958

jrhemstad · 2024-07-08T23:15:12Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

Thrust

Is your feature request related to a problem? Please describe.

Today, Thrust algorithms introspects the size of the input sequence to perform a dynamic dispatch between two independent instantiations of the same kernel. The only difference between the kernels is the type used for the "offset type", or the type that is used for index calculations into the input sequence.

This is done for a balance between correctness and performance.

For correctness, Thrust needs to be able to handle input sequences larger than what can be represented by int, e.g., numeric_limits<int>::max() aka INT_MAX equal to 2^31 - 1. For a sequence of 4B integers, this would be ~8.5GB. That's big, but by no means unreasonable with the size of GPU memory these days.

This means Thrust's kernels (CUB) need to be able to handle indexing into sequences larger than INT_MAX. This means using a 64 bit integer type like int64_t or uint64_t instead of a 32 bit integer type like int32_t or uint32_t. This small change can have a big performance impact on some algorithms/kernels. Therefore, we couldn't just switch everything using int to use int64_t without potentially causing significant performance regressions for existing users.

For more detail, see: #47

Thrust's dynamic dispatch to the two kernel instantiations can negatively impact downstream user's build times and binary sizes. This has lead to projects like RAPIDS cuDF to patch CCCL source code as part of their build process to disable this dispatch.

Describe the solution you'd like

Thrust should expose a build option like THRUST_FORCE_64BIT_OFFSET_TYPE and THRUST_FORCE_32BIT_OFFSET_TYPE that would be functionality equivalent to the patch that RAPIDS applies today. It will eliminate the dual instantiation and dynamic dispatch by only instantiating the kernel with the specified offset type.

One question I wasn't sure about:

What should happen when you specify THRUST_FORCE_32BIT_OFFSET_TYPE and you pass an input sequence where `distance(begin,end) >= (2<<31 - 1)?
- UB?
- Throw? (this is what RAPIDS' patch is doing today)

Describe alternatives you've considered

No response

Additional context

This will provide no added benefits over what RAPIDS is doing today other than the fact that RAPIDS no longer has to maintain a CCCL patch, which simplifies CCCL's testing of RAPIDS in our CI.

The text was updated successfully, but these errors were encountered:

bdice · 2024-07-09T03:49:13Z

What should happen when you specify THRUST_FORCE_32BIT_OFFSET_TYPE and you pass an input sequence where distance(begin,end) >= (2<<31 - 1)?

We chose to throw in the cuDF patch but I don't think we rely on this behavior. This patch only applies to cuDF, not all of RAPIDS, and I think we have safety guarantees from cuDF's size_type being a 32-bit type already. I think cuDF could accept UB if that is a better solution for some reason but I lean towards throwing, unless someone knows of a reason not to do so (e.g. making it UB might allow us to skip the distance check? But that's probably not very costly). I would advocate for adopting the RAPIDS patch pretty much as-is for the case of THRUST_FORCE_32BIT_OFFSET_TYPE.

bernhardmgruber · 2024-07-09T09:18:52Z

CMake-interface-wise I would opt for a single tri-state option, e.g. THRUST_OFFSET_TYPE, with string values 32, 64 and auto (default).

What should happen when you specify THRUST_FORCE_32BIT_OFFSET_TYPE and you pass an input sequence where distance(begin,end) >= (2<<31 - 1)?

UB is really the worst option, because it leaves bugs on the userside undetected. Since Thrust uses mostly random access and contiguous iteartors and we already compute the distance in many cases when calling the CUB API, the cost is negligible and I strongly advocate for detecting overflow and erroring out.

Whether an exception is the right tool is a different question, since we communicate CUDA errors using error codes. However, (I think) allocation failures are already reported by throwing std::bad_alloc, so we already mix error reporting techniques. Adding a new exception should be fine. Also, because the user is not expected to handle it: If a user passes a larger range then what an int32 can hold while promising to not do that (via the cmake option), it's a bug and not an expected but unlikely error condition. If a user expects that they can overrun the 32-bit-sized range, then they should rather check their range's size before making the Thrust call, or configure CMake to handle these cases by promoting the offset type to 64-bit. Therefore, an exception is a good tool, since we are detecting an exceptional circumstance.

jrhemstad · 2024-07-09T15:22:10Z

Whether an exception is the right tool is a different question, since we communicate CUDA errors using error codes

Thrust throws exceptions today already, so this isn't a problem.

My preference is to throw as well, but wanted to leave it open for someone to disagree.

jrhemstad · 2024-07-09T15:23:49Z

CMake-interface-wise I would opt for a single tri-state option, e.g. THRUST_OFFSET_TYPE, with string values 32, 64 and auto (default).

As far as I'm concerned, this build option is a stop-gap solution and the ergonomics aren't very important to me.

The long term solution will be for people to specify the desired offset type via the execution policy, but that will be more involved for us to implement.

The current dispatch mechanisms trades compile time and binary size for performance and flexibility. Allow users to tune that depending on their needs Fixes NVIDIA#1958

jrhemstad added the feature request New feature or request. label Jul 8, 2024

jrhemstad mentioned this issue Jul 8, 2024

[EPIC] RAPIDS Should not need to patch CCCL #1939

Open

jrhemstad mentioned this issue Jul 22, 2024

Thrust large input support #49

Open

jrhemstad mentioned this issue Aug 27, 2024

Update patches for CCCL 2.6 rapidsai/cudf#16668

Open

jrhemstad assigned miscco Aug 27, 2024

miscco mentioned this issue Aug 28, 2024

Make the thrust dispatch mechanisms configurable #2310

Merged

miscco closed this as completed in #2310 Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Add Thrust build option to disable dynamic offset type dispatch #1958

[FEA]: Add Thrust build option to disable dynamic offset type dispatch #1958

jrhemstad commented Jul 8, 2024

bdice commented Jul 9, 2024 •

edited

Loading

bernhardmgruber commented Jul 9, 2024

jrhemstad commented Jul 9, 2024

jrhemstad commented Jul 9, 2024 •

edited

Loading

[FEA]: Add Thrust build option to disable dynamic offset type dispatch #1958

[FEA]: Add Thrust build option to disable dynamic offset type dispatch #1958

Comments

jrhemstad commented Jul 8, 2024

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

bdice commented Jul 9, 2024 • edited Loading

bernhardmgruber commented Jul 9, 2024

jrhemstad commented Jul 9, 2024

jrhemstad commented Jul 9, 2024 • edited Loading

bdice commented Jul 9, 2024 •

edited

Loading

jrhemstad commented Jul 9, 2024 •

edited

Loading