Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable hash computation from variable length keys #327

Merged
merged 19 commits into from
Jul 12, 2023

Conversation

sleeepyjack
Copy link
Collaborator

This PR enables hash computation of key types whose sizes are only known at runtime.

@sleeepyjack sleeepyjack added the type: feature request New feature request label Jun 30, 2023
@sleeepyjack
Copy link
Collaborator Author

Murmur3 doesn't play nice with the dynamic size. For now, I just implemented this idea for xxhash.

@sleeepyjack
Copy link
Collaborator Author

I'm seeing a few build warnings in the tests:

In member function ‘constexpr cuco::detail::XXHash_64<Key>::result_type cuco::detail::XXHash_64<Key>::operator()(const Key&, Extent) const [with Extent = long unsigned int; Key = int]’,
    inlined from ‘void CATCH2_INTERNAL_TEMPLATE_TEST_8() [with Hash = cuco::detail::XXHash_64<int>]’ at /workspaces/cuCollections/tests/utility/hash_test.cu:170:33:
/workspaces/cuCollections/include/cuco/detail/hash_functions/xxhash.cuh:296:38: warning: ‘<anonymous>’ may be used uninitialized [-Wmaybe-uninitialized]
  296 |         v2 += blocks8[pipeline_offset + 1] * prime2;
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
/workspaces/cuCollections/tests/utility/hash_test.cu: In function ‘void CATCH2_INTERNAL_TEMPLATE_TEST_8() [with Hash = cuco::detail::XXHash_64<int>]’:
/workspaces/cuCollections/tests/utility/hash_test.cu:170:75: note: ‘<anonymous>’ declared here
  170 |     CHECK(hash(42) == hash(42, key_size));
      |                                                                           ^             
In member function ‘constexpr cuco::detail::XXHash_64<Key>::result_type cuco::detail::XXHash_64<Key>::operator()(const Key&, Extent) const [with Extent = long unsigned int; Key = int]’,
    inlined from ‘void CATCH2_INTERNAL_TEMPLATE_TEST_8() [with Hash = cuco::detail::XXHash_64<int>]’ at /workspaces/cuCollections/tests/utility/hash_test.cu:170:33:
/workspaces/cuCollections/include/cuco/detail/hash_functions/xxhash.cuh:299:38: warning: ‘<anonymous>’ may be used uninitialized [-Wmaybe-uninitialized]
  299 |         v3 += blocks8[pipeline_offset + 2] * prime2;
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
/workspaces/cuCollections/tests/utility/hash_test.cu: In function ‘void CATCH2_INTERNAL_TEMPLATE_TEST_8() [with Hash = cuco::detail::XXHash_64<int>]’:
/workspaces/cuCollections/tests/utility/hash_test.cu:170:75: note: ‘<anonymous>’ declared here
  170 |     CHECK(hash(42) == hash(42, key_size));
      |                                                                           ^             
In member function ‘constexpr cuco::detail::XXHash_64<Key>::result_type cuco::detail::XXHash_64<Key>::operator()(const Key&, Extent) const [with Extent = long unsigned int; Key = int]’,
    inlined from ‘void CATCH2_INTERNAL_TEMPLATE_TEST_8() [with Hash = cuco::detail::XXHash_64<int>]’ at /workspaces/cuCollections/tests/utility/hash_test.cu:170:33:
/workspaces/cuCollections/include/cuco/detail/hash_functions/xxhash.cuh:302:38: warning: ‘<anonymous>’ may be used uninitialized [-Wmaybe-uninitialized]
  302 |         v4 += blocks8[pipeline_offset + 3] * prime2;
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
/workspaces/cuCollections/tests/utility/hash_test.cu: In function ‘void CATCH2_INTERNAL_TEMPLATE_TEST_8() [with Hash = cuco::detail::XXHash_64<int>]’:
/workspaces/cuCollections/tests/utility/hash_test.cu:170:75: note: ‘<anonymous>’ declared here
  170 |     CHECK(hash(42) == hash(42, key_size));

@PointKernel
Copy link
Member

PointKernel commented Jun 30, 2023

Murmur3 doesn't play nice with the dynamic size. For now, I just implemented this idea for xxhash.

What is the issue with murmurhash? We could take cudf code as a reference (https://github.com/rapidsai/cudf/blob/d14b6cce9cc39793c118f065b113c83d0210ceb6/cpp/include/cudf/detail/utilities/hash_functions.cuh#L195-L285). Eventually, I would like to get rid of all hash function details in libcudf and reply on cuco hashers instead.

@sleeepyjack
Copy link
Collaborator Author

Dynamic mumurhash3 passes the unit test now (see diff 4a95fa8) but I'm getting even more warnings:

/workspaces/cuCollections/include/cuco/detail/hash_functions/murmurhash3.cuh:183:13: warning: this statement may fall through [-Wimplicit-fallthrough=]
  183 |       case 3: k1 ^= tail[2] << 16;
      |          ~~~^~~~~~~~~~~~~~~~~~~~
/workspaces/cuCollections/include/cuco/detail/hash_functions/murmurhash3.cuh:184:1: note: here
  184 |       case 2: k1 ^= tail[1] << 8;
      | ^   
/workspaces/cuCollections/include/cuco/detail/hash_functions/murmurhash3.cuh:184:13: warning: this statement may fall through [-Wimplicit-fallthrough=]
  184 |       case 2: k1 ^= tail[1] << 8;
      |          ~~~^~~~~~~~~~~~~~~~~~~
...

@PointKernel
Copy link
Member

Dynamic mumurhash3 passes the unit test now (see diff 4a95fa8) but I'm getting even more warnings:

/workspaces/cuCollections/include/cuco/detail/hash_functions/murmurhash3.cuh:183:13: warning: this statement may fall through [-Wimplicit-fallthrough=]
  183 |       case 3: k1 ^= tail[2] << 16;
      |          ~~~^~~~~~~~~~~~~~~~~~~~
/workspaces/cuCollections/include/cuco/detail/hash_functions/murmurhash3.cuh:184:1: note: here
  184 |       case 2: k1 ^= tail[1] << 8;
      | ^   
/workspaces/cuCollections/include/cuco/detail/hash_functions/murmurhash3.cuh:184:13: warning: this statement may fall through [-Wimplicit-fallthrough=]
  184 |       case 2: k1 ^= tail[1] << 8;
      |          ~~~^~~~~~~~~~~~~~~~~~~
...

[[fallthrough]]

include/cuco/detail/hash_functions/murmurhash3.cuh Outdated Show resolved Hide resolved
* @return A resulting hash value for `key`
*/
template <typename Extent>
constexpr result_type __host__ __device__ operator()(Key const& key, Extent size) const noexcept
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline. We would provide a member like

 constexpr result_type __host__ __device__ compute_hash(Key const& key, Extent size) const noexcept;

instead of overloading () operator.

Per std::hash

std::hash<Key>::operator()
 C++ Utilities library std::hash 
Specializations of std::hash should define an operator() that:

Takes a single argument key of type Key.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we still meet the standard requirements/interface. This is just another overload of the operator. The standard does not prohibit additional functionality.

@sleeepyjack
Copy link
Collaborator Author

sleeepyjack commented Jul 1, 2023

Proof that with the changes introduced by this PR the compiler is still able to apply the same optimizations in case of a static extent (PTX output is identical): https://godbolt.org/z/zGd4G4heG Edit: update incorporating latest changes: https://godbolt.org/z/31srn8f3e

Comment on lines 151 to 170
const uint32_t* const blocks = (const uint32_t*)(data + nblocks * 4);
for (int i = -nblocks; i; i++) {
uint32_t k1 = blocks[i]; // getblock32(blocks,i);
for (std::remove_const_t<decltype(nblocks)> i = 0; size >= 4 && i < nblocks; i++) {
std::uint32_t k1 = load_chunk<std::uint32_t>(data, i);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PointKernel Here's the essential change I had to make to eliminate the compiler warnings.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflicts to resolve (due to the bug fix in #326)

@sleeepyjack Can you please add a benchmark verifying the memcpy won't affect runtime performance?

Otherwise LGTM

@sleeepyjack
Copy link
Collaborator Author

Performance difference between the current dev branch (Ref) and this branch (Cmp):

[0] NVIDIA RTX A6000

Hash Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
cuco::detail::MurmurHash3_32 3.852 ms 0.01% 3.747 ms 0.01% -105.309 us -2.73% FAIL
cuco::detail::MurmurHash3_32 7.526 ms 0.01% 7.527 ms 0.00% 0.138 us 0.00% PASS
cuco::detail::MurmurHash3_32<large_key<32> > 33.123 ms 0.00% 33.123 ms 0.00% -0.257 us -0.00% PASS
cuco::detail::XXHash_32 3.158 ms 0.02% 2.974 ms 0.02% -183.604 us -5.81% FAIL
cuco::detail::XXHash_32 6.253 ms 0.01% 6.253 ms 0.01% 0.397 us 0.01% PASS
cuco::detail::XXHash_32<large_key<32> > 26.792 ms 0.00% 26.792 ms 0.00% 0.108 us 0.00% PASS
cuco::detail::XXHash_64 6.620 ms 0.01% 6.379 ms 0.01% -241.088 us -3.64% FAIL
cuco::detail::XXHash_64 8.156 ms 0.01% 8.705 ms 0.01% 548.905 us 6.73% FAIL
cuco::detail::XXHash_64<large_key<32> > 64.476 ms 0.00% 64.475 ms 0.00% -0.467 us -0.00% PASS
cuco::detail::MurmurHash3_fmix32 2.772 ms 0.01% 2.772 ms 0.01% 0.017 us 0.00% PASS
cuco::detail::MurmurHash3_fmix64 4.633 ms 0.01% 4.633 ms 0.01% 0.095 us 0.00% PASS

@sleeepyjack
Copy link
Collaborator Author

rerun tests

@sleeepyjack
Copy link
Collaborator Author

sleeepyjack commented Jul 4, 2023

Ok, so it turns out that the commit that introduced the memcpy (3aedeb0) blew up everything - compiler segfaults in every TU that uses a hasher inside a map/set. Interestingly, hash_test and hash_bench work fine. Am I doing something stupid here? cc @jrhemstad

@sleeepyjack
Copy link
Collaborator Author

With CTK 12.2 everything compiles just fine - no compiler segfaults. Does anyone know of a recently fixed nvcc bug causing this issue?

* @return A resulting hash value for `key`
*/
template <typename Extent>
constexpr result_type __host__ __device__ operator()(Key const& key, Extent size) const noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird. Normally, I'd expect such an esoteric use case to provide a custom hash function as opposed to trying to twist the standard hash function signature to make it work for this kind of situation.

Copy link
Collaborator Author

@sleeepyjack sleeepyjack Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just discussed this with @PointKernel. I will change the signature to

template <typename Extent>
constexpr result_type __host__ __device__ compute_bytes(std::byte const* data, Extent size) const noexcept

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that a user can use this function to build a custom hasher around our hashers.

Copy link
Collaborator Author

@sleeepyjack sleeepyjack Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more modern approach would be

template<class T, std::size_t N>
constexpr result_type __host__ __device__ compute_hash(cuda::std::span<T, N> data) const noexcept;

Edit: We're still waiting for cuda::std::span support: #332

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 74a3739

@sleeepyjack
Copy link
Collaborator Author

Updated benchmarks (reference is current dev branch):

[0] NVIDIA RTX A6000

Hash Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
cuco::detail::MurmurHash3_32 3.123 ms 0.97% 3.094 ms 0.67% -29.645 us -0.95% FAIL
cuco::detail::MurmurHash3_32 6.541 ms 1.08% 6.506 ms 1.27% -35.056 us -0.54% PASS
cuco::detail::MurmurHash3_32<large_key<32> > 27.241 ms 0.62% 27.031 ms 0.82% -209.693 us -0.77% FAIL
cuco::detail::XXHash_32 2.619 ms 0.73% 2.602 ms 0.24% -16.662 us -0.64% FAIL
cuco::detail::XXHash_32 5.501 ms 1.17% 5.503 ms 1.95% 2.386 us 0.04% PASS
cuco::detail::XXHash_32<large_key<32> > 22.452 ms 0.77% 22.218 ms 0.58% -234.048 us -1.04% FAIL
cuco::detail::XXHash_64 5.504 ms 0.24% 5.437 ms 0.46% -67.710 us -1.23% FAIL
cuco::detail::XXHash_64 6.784 ms 0.00% 6.714 ms 0.33% -69.711 us -1.03% FAIL
cuco::detail::XXHash_64<large_key<32> > 55.766 ms 0.53% 55.026 ms 0.63% -739.887 us -1.33% FAIL
cuco::detail::MurmurHash3_fmix32 2.251 ms 1.56% 2.231 ms 1.47% -20.388 us -0.91% PASS
cuco::detail::MurmurHash3_fmix64 3.832 ms 0.51% 3.794 ms 0.55% -37.373 us -0.98% FAIL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request New feature request
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants