Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Add Thrust build option to disable dynamic offset type dispatch #1958

Closed
1 task done
Tracked by #49
jrhemstad opened this issue Jul 8, 2024 · 4 comments · Fixed by #2310
Closed
1 task done
Tracked by #49

[FEA]: Add Thrust build option to disable dynamic offset type dispatch #1958

jrhemstad opened this issue Jul 8, 2024 · 4 comments · Fixed by #2310
Assignees
Labels
feature request New feature or request.

Comments

@jrhemstad
Copy link
Collaborator

Is this a duplicate?

Area

Thrust

Is your feature request related to a problem? Please describe.

Today, Thrust algorithms introspects the size of the input sequence to perform a dynamic dispatch between two independent instantiations of the same kernel. The only difference between the kernels is the type used for the "offset type", or the type that is used for index calculations into the input sequence.

This is done for a balance between correctness and performance.

For correctness, Thrust needs to be able to handle input sequences larger than what can be represented by int, e.g., numeric_limits<int>::max() aka INT_MAX equal to 2^31 - 1. For a sequence of 4B integers, this would be ~8.5GB. That's big, but by no means unreasonable with the size of GPU memory these days.

This means Thrust's kernels (CUB) need to be able to handle indexing into sequences larger than INT_MAX. This means using a 64 bit integer type like int64_t or uint64_t instead of a 32 bit integer type like int32_t or uint32_t. This small change can have a big performance impact on some algorithms/kernels. Therefore, we couldn't just switch everything using int to use int64_t without potentially causing significant performance regressions for existing users.

For more detail, see: #47

Thrust's dynamic dispatch to the two kernel instantiations can negatively impact downstream user's build times and binary sizes. This has lead to projects like RAPIDS cuDF to patch CCCL source code as part of their build process to disable this dispatch.

Describe the solution you'd like

Thrust should expose a build option like THRUST_FORCE_64BIT_OFFSET_TYPE and THRUST_FORCE_32BIT_OFFSET_TYPE that would be functionality equivalent to the patch that RAPIDS applies today. It will eliminate the dual instantiation and dynamic dispatch by only instantiating the kernel with the specified offset type.

One question I wasn't sure about:

  • What should happen when you specify THRUST_FORCE_32BIT_OFFSET_TYPE and you pass an input sequence where `distance(begin,end) >= (2<<31 - 1)?
    • UB?
    • Throw? (this is what RAPIDS' patch is doing today)

Describe alternatives you've considered

No response

Additional context

This will provide no added benefits over what RAPIDS is doing today other than the fact that RAPIDS no longer has to maintain a CCCL patch, which simplifies CCCL's testing of RAPIDS in our CI.

@bdice
Copy link
Contributor

bdice commented Jul 9, 2024

What should happen when you specify THRUST_FORCE_32BIT_OFFSET_TYPE and you pass an input sequence where distance(begin,end) >= (2<<31 - 1)?

We chose to throw in the cuDF patch but I don't think we rely on this behavior. This patch only applies to cuDF, not all of RAPIDS, and I think we have safety guarantees from cuDF's size_type being a 32-bit type already. I think cuDF could accept UB if that is a better solution for some reason but I lean towards throwing, unless someone knows of a reason not to do so (e.g. making it UB might allow us to skip the distance check? But that's probably not very costly). I would advocate for adopting the RAPIDS patch pretty much as-is for the case of THRUST_FORCE_32BIT_OFFSET_TYPE.

@bernhardmgruber
Copy link
Contributor

CMake-interface-wise I would opt for a single tri-state option, e.g. THRUST_OFFSET_TYPE, with string values 32, 64 and auto (default).

What should happen when you specify THRUST_FORCE_32BIT_OFFSET_TYPE and you pass an input sequence where distance(begin,end) >= (2<<31 - 1)?

UB is really the worst option, because it leaves bugs on the userside undetected. Since Thrust uses mostly random access and contiguous iteartors and we already compute the distance in many cases when calling the CUB API, the cost is negligible and I strongly advocate for detecting overflow and erroring out.

Whether an exception is the right tool is a different question, since we communicate CUDA errors using error codes. However, (I think) allocation failures are already reported by throwing std::bad_alloc, so we already mix error reporting techniques. Adding a new exception should be fine. Also, because the user is not expected to handle it: If a user passes a larger range then what an int32 can hold while promising to not do that (via the cmake option), it's a bug and not an expected but unlikely error condition. If a user expects that they can overrun the 32-bit-sized range, then they should rather check their range's size before making the Thrust call, or configure CMake to handle these cases by promoting the offset type to 64-bit. Therefore, an exception is a good tool, since we are detecting an exceptional circumstance.

@jrhemstad
Copy link
Collaborator Author

Whether an exception is the right tool is a different question, since we communicate CUDA errors using error codes

Thrust throws exceptions today already, so this isn't a problem.

My preference is to throw as well, but wanted to leave it open for someone to disagree.

@jrhemstad
Copy link
Collaborator Author

jrhemstad commented Jul 9, 2024

CMake-interface-wise I would opt for a single tri-state option, e.g. THRUST_OFFSET_TYPE, with string values 32, 64 and auto (default).

As far as I'm concerned, this build option is a stop-gap solution and the ergonomics aren't very important to me.

The long term solution will be for people to specify the desired offset type via the execution policy, but that will be more involved for us to implement.

miscco added a commit to miscco/cccl that referenced this issue Aug 28, 2024
The current dispatch mechanisms trades compile time and binary size for performance and flexibility.

Allow users to tune that depending on their needs

Fixes NVIDIA#1958
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants