From 52a967fd00cc0ecccbac64eb3af57a5b730f3179 Mon Sep 17 00:00:00 2001
From: "Ralf W. Grosse-Kunstleve" <rwgkio@gmail.com>
Date: Thu, 19 Sep 2024 04:54:29 -0700
Subject: [PATCH 1/5] Replace `CUDA C++ Core Libraries` with `CUDA Core Compute
 Libraries` (only in README.md). (#2424)

* Replace `CUDA C++ Core Libraries` with `CUDA Core Compute Libraries`.

* Remove a couple `C++`, based on suggestion by @miscco

---------

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
---
 README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index b4649dbde2..828586824e 100644
--- a/README.md
+++ b/README.md
@@ -3,9 +3,9 @@
 |[Contributor Guide](https://github.com/NVIDIA/cccl/blob/main/CONTRIBUTING.md)|[Dev Containers](https://github.com/NVIDIA/cccl/blob/main/.devcontainer/README.md)|[Discord](https://discord.gg/nvidiadeveloper)|[Godbolt](https://godbolt.org/z/x4G73af9a)|[GitHub Project](https://github.com/orgs/NVIDIA/projects/6)|[Documentation](https://nvidia.github.io/cccl)|
 |-|-|-|-|-|-|
 
-# CUDA C++ Core Libraries (CCCL)
+# CUDA Core Compute Libraries (CCCL)
 
-Welcome to the CUDA C++ Core Libraries (CCCL) where our mission is to make CUDA C++ more delightful.
+Welcome to the CUDA Core Compute Libraries (CCCL) where our mission is to make CUDA more delightful.
 
 This repository unifies three essential CUDA C++ libraries into a single, convenient repository:
 
@@ -19,7 +19,7 @@ For more information about the decision to unify these projects, see the [announ
 
 ## Overview
 
-The concept for the CUDA C++ Core Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers.
+The concept for the CUDA Core Compute Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers.
 Naturally, there was a lot of overlap among the three projects, and it became clear the community would be better served by unifying them into a single repository.
 
 - **Thrust** is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs via configurable backends that allow using multiple parallel programming frameworks (such as CUDA, TBB, and OpenMP).
@@ -413,7 +413,7 @@ For a detailed overview of the CI pipeline, see [ci-overview.md](ci-overview.md)
 
 ## Related Projects
 
-Projects that are related to CCCL's mission to make CUDA C++ more delightful:
+Projects that are related to CCCL's mission to make CUDA more delightful:
 - [cuCollections](https://github.com/NVIDIA/cuCollections) - GPU accelerated data structures like hash tables
 - [NVBench](https://github.com/NVIDIA/nvbench) - Benchmarking library tailored for CUDA applications
 - [stdexec](https://github.com/nvidia/stdexec) - Reference implementation for Senders asynchronous programming model

From d1911020ea04ec45583b430fb87604807e66221d Mon Sep 17 00:00:00 2001
From: Michael Schellenberger Costa <miscco@nvidia.com>
Date: Thu, 19 Sep 2024 16:15:11 +0200
Subject: [PATCH 2/5] Move the cuda atomic.h file (#2418)

---
 libcudacxx/include/cuda/{std/__cuda => __atomic}/atomic.h | 6 +++---
 libcudacxx/include/cuda/atomic                            | 7 +++++--
 2 files changed, 8 insertions(+), 5 deletions(-)
 rename libcudacxx/include/cuda/{std/__cuda => __atomic}/atomic.h (97%)

diff --git a/libcudacxx/include/cuda/std/__cuda/atomic.h b/libcudacxx/include/cuda/__atomic/atomic.h
similarity index 97%
rename from libcudacxx/include/cuda/std/__cuda/atomic.h
rename to libcudacxx/include/cuda/__atomic/atomic.h
index 2ce5e4f580..aed5253770 100644
--- a/libcudacxx/include/cuda/std/__cuda/atomic.h
+++ b/libcudacxx/include/cuda/__atomic/atomic.h
@@ -8,8 +8,8 @@
 //
 //===----------------------------------------------------------------------===//
 
-#ifndef _LIBCUDACXX___CUDA_ATOMIC_H
-#define _LIBCUDACXX___CUDA_ATOMIC_H
+#ifndef _CUDA___ATOMIC_ATOMIC_H
+#define _CUDA___ATOMIC_ATOMIC_H
 
 #include <cuda/std/detail/__config>
 
@@ -138,4 +138,4 @@ inline _CCCL_HOST_DEVICE void atomic_signal_fence(memory_order __m)
 
 _LIBCUDACXX_END_NAMESPACE_CUDA
 
-#endif // _LIBCUDACXX___CUDA_ATOMIC_H
+#endif // _CUDA___ATOMIC_ATOMIC_H
diff --git a/libcudacxx/include/cuda/atomic b/libcudacxx/include/cuda/atomic
index 06dd1c785c..6f7d89899c 100644
--- a/libcudacxx/include/cuda/atomic
+++ b/libcudacxx/include/cuda/atomic
@@ -4,14 +4,14 @@
 // under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-// SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES.
+// SPDX-FileCopyrightText: Copyright (c) 2023-24 NVIDIA CORPORATION & AFFILIATES.
 //
 //===----------------------------------------------------------------------===//
 
 #ifndef _CUDA_ATOMIC
 #define _CUDA_ATOMIC
 
-#include <cuda/std/__cuda/atomic.h>
+#include <cuda/std/detail/__config>
 
 #if defined(_CCCL_IMPLICIT_SYSTEM_HEADER_GCC)
 #  pragma GCC system_header
@@ -21,4 +21,7 @@
 #  pragma system_header
 #endif // no system header
 
+#include <cuda/__atomic/atomic.h>
+#include <cuda/std/atomic>
+
 #endif // _CUDA_ATOMIC

From 445fd71ad66435ff7db7df166b6cd7e36a7c7f9c Mon Sep 17 00:00:00 2001
From: Eric Niebler <eniebler@nvidia.com>
Date: Thu, 19 Sep 2024 07:15:47 -0700
Subject: [PATCH 3/5] `uninitialized_buffer::get_resource` returns a ref to an
 `any_resource` that can be copied (#2431)

* `uninitialized_buffer::get_resource` returns a ref to an `any_resource` that can be copied

* Also update `uninintialized_async_buffer`

* Fix doc string

---------

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
---
 .../uninitialized_async_buffer.cuh            | 15 +++---
 .../__container/uninitialized_buffer.cuh      | 13 ++---
 .../containers/uninitialized_async_buffer.cu  | 50 +++++++++++++++++++
 cudax/test/containers/uninitialized_buffer.cu | 46 +++++++++++++++++
 4 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh b/cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh
index eea30c1b69..54115e4ccd 100644
--- a/cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh
+++ b/cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh
@@ -72,7 +72,8 @@ template <class _Tp, class... _Properties>
 class uninitialized_async_buffer
 {
 private:
-  ::cuda::experimental::mr::async_any_resource<_Properties...> __mr_;
+  using __async_resource = ::cuda::experimental::mr::async_any_resource<_Properties...>;
+  __async_resource __mr_;
   ::cuda::stream_ref __stream_ = {};
   size_t __count_              = 0;
   void* __buf_                 = nullptr;
@@ -127,9 +128,7 @@ public:
   //! @param __count The desired size of the buffer.
   //! @note Depending on the alignment requirements of `T` the size of the underlying allocation might be larger
   //! than `count * sizeof(T)`. Only allocates memory when \p __count > 0
-  uninitialized_async_buffer(::cuda::experimental::mr::async_any_resource<_Properties...> __mr,
-                             const ::cuda::stream_ref __stream,
-                             const size_t __count)
+  uninitialized_async_buffer(__async_resource __mr, const ::cuda::stream_ref __stream, const size_t __count)
       : __mr_(_CUDA_VSTD::move(__mr))
       , __stream_(__stream)
       , __count_(__count)
@@ -205,12 +204,12 @@ public:
   }
 
   //! @rst
-  //! Returns an :ref:`asnyc_resource_ref <libcudacxx-extended-api-memory-resources-resource-ref>` to the resource used
-  //! to allocate the buffer
+  //! Returns a \c const reference to the :ref:`any_async_resource <cudax-memory-resource-async-any-resource>`
+  //! that holds the memory resource used to allocate the buffer
   //! @endrst
-  _CCCL_NODISCARD _CUDA_VMR::async_resource_ref<_Properties...> get_resource() const noexcept
+  _CCCL_NODISCARD const __async_resource& get_resource() const noexcept
   {
-    return _CUDA_VMR::async_resource_ref<_Properties...>{const_cast<uninitialized_async_buffer*>(this)->__mr_};
+    return __mr_;
   }
 
   //! @brief Returns the stored stream
diff --git a/cudax/include/cuda/experimental/__container/uninitialized_buffer.cuh b/cudax/include/cuda/experimental/__container/uninitialized_buffer.cuh
index 9c88df1d95..9cce7e706e 100644
--- a/cudax/include/cuda/experimental/__container/uninitialized_buffer.cuh
+++ b/cudax/include/cuda/experimental/__container/uninitialized_buffer.cuh
@@ -63,7 +63,8 @@ template <class _Tp, class... _Properties>
 class uninitialized_buffer
 {
 private:
-  ::cuda::experimental::mr::any_resource<_Properties...> __mr_;
+  using __resource = ::cuda::experimental::mr::any_resource<_Properties...>;
+  __resource __mr_;
   size_t __count_ = 0;
   void* __buf_    = nullptr;
 
@@ -116,7 +117,7 @@ public:
   //! @note Depending on the alignment requirements of `T` the size of the underlying allocation might be larger
   //! than `count * sizeof(T)`.
   //! @note Only allocates memory when \p __count > 0
-  uninitialized_buffer(::cuda::experimental::mr::any_resource<_Properties...> __mr, const size_t __count)
+  uninitialized_buffer(__resource __mr, const size_t __count)
       : __mr_(_CUDA_VSTD::move(__mr))
       , __count_(__count)
       , __buf_(__count_ == 0 ? nullptr : __mr_.allocate(__get_allocation_size(__count_)))
@@ -188,13 +189,13 @@ public:
   }
 
   //! @rst
-  //! Returns a :ref:`resource_ref <libcudacxx-extended-api-memory-resources-resource-ref>` to the resource used to
-  //! allocate the buffer
+  //! Returns a \c const reference to the :ref:`any_resource <cudax-memory-resource-any-resource>`
+  //! that holds the memory resource used to allocate the buffer
   //! @endrst
   _CCCL_EXEC_CHECK_DISABLE
-  _CCCL_NODISCARD _CCCL_HOST_DEVICE _CUDA_VMR::resource_ref<_Properties...> get_resource() const noexcept
+  _CCCL_NODISCARD _CCCL_HOST_DEVICE const __resource& get_resource() const noexcept
   {
-    return _CUDA_VMR::resource_ref<_Properties...>{const_cast<uninitialized_buffer*>(this)->__mr_};
+    return __mr_;
   }
 
   //! @brief Swaps the contents with those of another \c uninitialized_buffer
diff --git a/cudax/test/containers/uninitialized_async_buffer.cu b/cudax/test/containers/uninitialized_async_buffer.cu
index b32d322827..8e1f6e304b 100644
--- a/cudax/test/containers/uninitialized_async_buffer.cu
+++ b/cudax/test/containers/uninitialized_async_buffer.cu
@@ -156,3 +156,53 @@ TEMPLATE_TEST_CASE(
     }
   }
 }
+
+// A test resource that keeps track of the number of resources are
+// currently alive.
+struct test_async_memory_resource : cudax::mr::async_memory_resource
+{
+  static int count;
+
+  test_async_memory_resource()
+  {
+    ++count;
+  }
+
+  test_async_memory_resource(const test_async_memory_resource& other)
+      : cudax::mr::async_memory_resource{other}
+  {
+    ++count;
+  }
+
+  ~test_async_memory_resource()
+  {
+    --count;
+  }
+};
+
+int test_async_memory_resource::count = 0;
+
+TEST_CASE("uninitialized_async_buffer's memory resource does not dangle", "[container]")
+{
+  cuda::experimental::stream stream{};
+  cudax::uninitialized_async_buffer<int, ::cuda::mr::device_accessible> buffer{
+    cudax::mr::async_memory_resource{}, stream, 0};
+
+  {
+    CHECK(test_async_memory_resource::count == 0);
+
+    cudax::uninitialized_async_buffer<int, ::cuda::mr::device_accessible> src_buffer{
+      test_async_memory_resource{}, stream, 1024};
+
+    CHECK(test_async_memory_resource::count == 1);
+
+    cudax::uninitialized_async_buffer<int, ::cuda::mr::device_accessible> dst_buffer{
+      src_buffer.get_resource(), stream, 1024};
+
+    CHECK(test_async_memory_resource::count == 2);
+
+    buffer = ::cuda::std::move(dst_buffer);
+  }
+
+  CHECK(test_async_memory_resource::count == 1);
+}
diff --git a/cudax/test/containers/uninitialized_buffer.cu b/cudax/test/containers/uninitialized_buffer.cu
index c924750a8a..e8aecad470 100644
--- a/cudax/test/containers/uninitialized_buffer.cu
+++ b/cudax/test/containers/uninitialized_buffer.cu
@@ -200,3 +200,49 @@ TEST_CASE("uninitialized_buffer is usable with cudax::launch", "[container]")
     cudax::launch(stream, dimensions, const_kernel, buffer);
   }
 }
+
+// A test resource that keeps track of the number of resources are
+// currently alive.
+struct test_device_memory_resource : cuda::mr::device_memory_resource
+{
+  static int count;
+
+  test_device_memory_resource()
+  {
+    ++count;
+  }
+
+  test_device_memory_resource(const test_device_memory_resource& other)
+      : cuda::mr::device_memory_resource{other}
+  {
+    ++count;
+  }
+
+  ~test_device_memory_resource()
+  {
+    --count;
+  }
+};
+
+int test_device_memory_resource::count = 0;
+
+TEST_CASE("uninitialized_buffer's memory resource does not dangle", "[container]")
+{
+  cudax::uninitialized_buffer<int, ::cuda::mr::device_accessible> buffer{cuda::mr::device_memory_resource{}, 0};
+
+  {
+    CHECK(test_device_memory_resource::count == 0);
+
+    cudax::uninitialized_buffer<int, ::cuda::mr::device_accessible> src_buffer{test_device_memory_resource{}, 1024};
+
+    CHECK(test_device_memory_resource::count == 1);
+
+    cudax::uninitialized_buffer<int, ::cuda::mr::device_accessible> dst_buffer{src_buffer.get_resource(), 1024};
+
+    CHECK(test_device_memory_resource::count == 2);
+
+    buffer = ::cuda::std::move(dst_buffer);
+  }
+
+  CHECK(test_device_memory_resource::count == 1);
+}

From b07f036979db8e82a6743abecbc68a4df4d04bb8 Mon Sep 17 00:00:00 2001
From: Michael Schellenberger Costa <miscco@nvidia.com>
Date: Thu, 19 Sep 2024 16:54:01 +0200
Subject: [PATCH 4/5] Refactor `cuda::ceil_div` to take two different types
 (#2376)

* Rework `cuda::ceil_div` to take different types

This replaces `cub::DivideAndRoundUp`
---
 cub/cub/agent/agent_radix_sort_histogram.cuh  |  2 +-
 cub/cub/device/device_for.cuh                 |  2 +-
 .../dispatch/dispatch_adjacent_difference.cuh |  4 +-
 .../device/dispatch/dispatch_batch_memcpy.cuh |  4 +-
 cub/cub/device/dispatch/dispatch_for.cuh      |  4 +-
 .../device/dispatch/dispatch_histogram.cuh    |  2 +-
 cub/cub/device/dispatch/dispatch_merge.cuh    |  5 +-
 .../device/dispatch/dispatch_merge_sort.cuh   |  5 +-
 .../device/dispatch/dispatch_radix_sort.cuh   | 14 +--
 .../dispatch/dispatch_reduce_by_key.cuh       |  4 +-
 cub/cub/device/dispatch/dispatch_rle.cuh      |  6 +-
 cub/cub/device/dispatch/dispatch_scan.cuh     |  4 +-
 .../device/dispatch/dispatch_scan_by_key.cuh  |  4 +-
 .../dispatch/dispatch_segmented_sort.cuh      |  8 +-
 .../device/dispatch/dispatch_select_if.cuh    |  6 +-
 .../device/dispatch/dispatch_spmv_orig.cuh    | 14 +--
 .../dispatch/dispatch_three_way_partition.cuh |  6 +-
 .../dispatch/dispatch_unique_by_key.cuh       |  6 +-
 cub/cub/grid/grid_even_share.cuh              |  2 +-
 cub/cub/util_math.cuh                         |  8 +-
 .../example_device_decoupled_look_back.cu     |  2 +-
 cub/test/catch2_large_array_sort_helper.cuh   |  2 +-
 cub/test/catch2_radix_sort_helper.cuh         |  2 +-
 cub/test/catch2_segmented_sort_helper.cuh     |  2 +-
 .../catch2_test_device_decoupled_look_back.cu |  2 +-
 cub/test/catch2_test_vsmem.cu                 |  2 +-
 cub/test/test_util.h                          |  2 +-
 libcudacxx/include/cuda/__cmath/ceil_div.h    | 87 +++++++++++++++++++
 libcudacxx/include/cuda/cmath                 | 17 +---
 .../cuda/std/__type_traits/underlying_type.h  |  3 +
 .../test/libcudacxx/cuda/cmath.pass.cpp       | 57 ++++++++++--
 thrust/thrust/system/cuda/detail/extrema.h    |  2 +-
 thrust/thrust/system/cuda/detail/reduce.h     |  2 +-
 .../thrust/system/cuda/detail/reduce_by_key.h |  2 +-
 thrust/thrust/system/cuda/detail/unique.h     |  2 +-
 35 files changed, 208 insertions(+), 88 deletions(-)
 create mode 100644 libcudacxx/include/cuda/__cmath/ceil_div.h

diff --git a/cub/cub/agent/agent_radix_sort_histogram.cuh b/cub/cub/agent/agent_radix_sort_histogram.cuh
index 5004334bfe..87f5e0790d 100644
--- a/cub/cub/agent/agent_radix_sort_histogram.cuh
+++ b/cub/cub/agent/agent_radix_sort_histogram.cuh
@@ -242,7 +242,7 @@ struct AgentRadixSortHistogram
     // Within a portion, avoid overflowing (u)int32 counters.
     // Between portions, accumulate results in global memory.
     constexpr OffsetT MAX_PORTION_SIZE = 1 << 30;
-    OffsetT num_portions               = cub::DivideAndRoundUp(num_items, MAX_PORTION_SIZE);
+    OffsetT num_portions               = ::cuda::ceil_div(num_items, MAX_PORTION_SIZE);
     for (OffsetT portion = 0; portion < num_portions; ++portion)
     {
       // Reset the counters.
diff --git a/cub/cub/device/device_for.cuh b/cub/cub/device/device_for.cuh
index 0e0bcaa36c..3d8c4286bb 100644
--- a/cub/cub/device/device_for.cuh
+++ b/cub/cub/device/device_for.cuh
@@ -154,7 +154,7 @@ private:
 
     if (is_aligned<typename wrapped_op_t::vector_t>(unwrapped_first))
     { // Vectorize loads
-      const OffsetT num_vec_items = cub::DivideAndRoundUp(num_items, wrapped_op_t::vec_size);
+      const OffsetT num_vec_items = ::cuda::ceil_div(num_items, wrapped_op_t::vec_size);
 
       return detail::for_each::dispatch_t<OffsetT, wrapped_op_t>::dispatch(
         num_vec_items,
diff --git a/cub/cub/device/dispatch/dispatch_adjacent_difference.cuh b/cub/cub/device/dispatch/dispatch_adjacent_difference.cuh
index d4ae6ecddd..af41c7137c 100644
--- a/cub/cub/device/dispatch/dispatch_adjacent_difference.cuh
+++ b/cub/cub/device/dispatch/dispatch_adjacent_difference.cuh
@@ -205,7 +205,7 @@ struct DispatchAdjacentDifference : public SelectedPolicy
     do
     {
       constexpr int tile_size = AdjacentDifferencePolicyT::ITEMS_PER_TILE;
-      const int num_tiles     = static_cast<int>(DivideAndRoundUp(num_items, tile_size));
+      const int num_tiles     = static_cast<int>(::cuda::ceil_div(num_items, tile_size));
 
       std::size_t first_tile_previous_size = MayAlias * num_tiles * sizeof(InputT);
 
@@ -244,7 +244,7 @@ struct DispatchAdjacentDifference : public SelectedPolicy
         using AgentDifferenceInitT = AgentDifferenceInit<InputIteratorT, InputT, OffsetT, ReadLeft>;
 
         constexpr int init_block_size = AgentDifferenceInitT::BLOCK_THREADS;
-        const int init_grid_size      = DivideAndRoundUp(num_tiles, init_block_size);
+        const int init_grid_size      = ::cuda::ceil_div(num_tiles, init_block_size);
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
         _CubLog("Invoking DeviceAdjacentDifferenceInitKernel"
diff --git a/cub/cub/device/dispatch/dispatch_batch_memcpy.cuh b/cub/cub/device/dispatch/dispatch_batch_memcpy.cuh
index df84b2d6b9..78681953d2 100644
--- a/cub/cub/device/dispatch/dispatch_batch_memcpy.cuh
+++ b/cub/cub/device/dispatch/dispatch_batch_memcpy.cuh
@@ -453,7 +453,7 @@ struct DispatchBatchMemcpy : SelectedPolicy
                                  * ActivePolicyT::AgentSmallBufferPolicyT::BUFFERS_PER_THREAD;
 
     // The number of thread blocks (or tiles) required to process all of the given buffers
-    BlockOffsetT num_tiles = DivideAndRoundUp(num_buffers, TILE_SIZE);
+    BlockOffsetT num_tiles = ::cuda::ceil_div(num_buffers, TILE_SIZE);
 
     using BlevBufferSrcsOutT          = ::cuda::std::_If<IsMemcpy, void*, cub::detail::value_t<InputBufferIt>>;
     using BlevBufferDstOutT           = ::cuda::std::_If<IsMemcpy, void*, cub::detail::value_t<OutputBufferIt>>;
@@ -528,7 +528,7 @@ struct DispatchBatchMemcpy : SelectedPolicy
     BlevBufferTileOffsetsOutItT d_blev_block_offsets = blev_buffer_block_alloc.get();
 
     // Kernels' grid sizes
-    BlockOffsetT init_grid_size         = DivideAndRoundUp(num_tiles, INIT_KERNEL_THREADS);
+    BlockOffsetT init_grid_size         = ::cuda::ceil_div(num_tiles, INIT_KERNEL_THREADS);
     BlockOffsetT batch_memcpy_grid_size = num_tiles;
 
     // Kernels
diff --git a/cub/cub/device/dispatch/dispatch_for.cuh b/cub/cub/device/dispatch/dispatch_for.cuh
index 4af6864b03..de0189490f 100644
--- a/cub/cub/device/dispatch/dispatch_for.cuh
+++ b/cub/cub/device/dispatch/dispatch_for.cuh
@@ -99,7 +99,7 @@ struct dispatch_t : PolicyHubT
     constexpr int items_per_thread = ActivePolicyT::for_policy_t::items_per_thread;
 
     const auto tile_size = static_cast<OffsetT>(block_threads * items_per_thread);
-    const auto num_tiles = cub::DivideAndRoundUp(num_items, tile_size);
+    const auto num_tiles = ::cuda::ceil_div(num_items, tile_size);
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
     _CubLog("Invoking detail::for_each::dynamic_kernel<<<%d, %d, 0, %lld>>>(), "
@@ -144,7 +144,7 @@ struct dispatch_t : PolicyHubT
     constexpr int items_per_thread = ActivePolicyT::for_policy_t::items_per_thread;
 
     const auto tile_size = static_cast<OffsetT>(block_threads * items_per_thread);
-    const auto num_tiles = cub::DivideAndRoundUp(num_items, tile_size);
+    const auto num_tiles = ::cuda::ceil_div(num_items, tile_size);
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
     _CubLog("Invoking detail::for_each::static_kernel<<<%d, %d, 0, %lld>>>(), "
diff --git a/cub/cub/device/dispatch/dispatch_histogram.cuh b/cub/cub/device/dispatch/dispatch_histogram.cuh
index 1839385b19..e658fdb455 100644
--- a/cub/cub/device/dispatch/dispatch_histogram.cuh
+++ b/cub/cub/device/dispatch/dispatch_histogram.cuh
@@ -361,7 +361,7 @@ struct dispatch_histogram
 
       // Get grid dimensions, trying to keep total blocks ~histogram_sweep_occupancy
       int pixels_per_tile = block_threads * pixels_per_thread;
-      int tiles_per_row   = static_cast<int>(cub::DivideAndRoundUp(num_row_pixels, pixels_per_tile));
+      int tiles_per_row   = static_cast<int>(::cuda::ceil_div(num_row_pixels, pixels_per_tile));
       int blocks_per_row  = CUB_MIN(histogram_sweep_occupancy, tiles_per_row);
       int blocks_per_col =
         (blocks_per_row > 0) ? int(CUB_MIN(histogram_sweep_occupancy / blocks_per_row, num_rows)) : 0;
diff --git a/cub/cub/device/dispatch/dispatch_merge.cuh b/cub/cub/device/dispatch/dispatch_merge.cuh
index 2c16d85144..8db48787d6 100644
--- a/cub/cub/device/dispatch/dispatch_merge.cuh
+++ b/cub/cub/device/dispatch/dispatch_merge.cuh
@@ -238,7 +238,7 @@ struct dispatch_t
       typename choose_merge_agent<merge_policy_t, KeyIt1, ValueIt1, KeyIt2, ValueIt2, KeyIt3, ValueIt3, Offset, CompareOp>::
         type;
 
-    const auto num_tiles = cub::DivideAndRoundUp(num_items1 + num_items2, agent_t::policy::ITEMS_PER_TILE);
+    const auto num_tiles = ::cuda::ceil_div(num_items1 + num_items2, agent_t::policy::ITEMS_PER_TILE);
     void* allocations[2] = {nullptr, nullptr};
     {
       const std::size_t merge_partitions_size      = (1 + num_tiles) * sizeof(Offset);
@@ -263,8 +263,7 @@ struct dispatch_t
     {
       const Offset num_partitions               = num_tiles + 1;
       constexpr int threads_per_partition_block = 256; // TODO(bgruber): no policy?
-      const int partition_grid_size =
-        static_cast<int>(cub::DivideAndRoundUp(num_partitions, threads_per_partition_block));
+      const int partition_grid_size = static_cast<int>(::cuda::ceil_div(num_partitions, threads_per_partition_block));
 
       auto error = CubDebug(
         THRUST_NS_QUALIFIER::cuda_cub::launcher::triple_chevron(
diff --git a/cub/cub/device/dispatch/dispatch_merge_sort.cuh b/cub/cub/device/dispatch/dispatch_merge_sort.cuh
index 59deb2e529..307c53c1f0 100644
--- a/cub/cub/device/dispatch/dispatch_merge_sort.cuh
+++ b/cub/cub/device/dispatch/dispatch_merge_sort.cuh
@@ -508,7 +508,7 @@ struct DispatchMergeSort : SelectedPolicy
     do
     {
       constexpr auto tile_size = merge_sort_helper_t::policy_t::ITEMS_PER_TILE;
-      const auto num_tiles     = cub::DivideAndRoundUp(num_items, tile_size);
+      const auto num_tiles     = ::cuda::ceil_div(num_items, tile_size);
 
       const auto merge_partitions_size         = static_cast<std::size_t>(1 + num_tiles) * sizeof(OffsetT);
       const auto temporary_keys_storage_size   = static_cast<std::size_t>(num_items * sizeof(KeyT));
@@ -597,8 +597,7 @@ struct DispatchMergeSort : SelectedPolicy
 
       const OffsetT num_partitions              = num_tiles + 1;
       constexpr int threads_per_partition_block = 256;
-      const int partition_grid_size =
-        static_cast<int>(cub::DivideAndRoundUp(num_partitions, threads_per_partition_block));
+      const int partition_grid_size = static_cast<int>(::cuda::ceil_div(num_partitions, threads_per_partition_block));
 
       error = CubDebug(detail::DebugSyncStream(stream));
       if (cudaSuccess != error)
diff --git a/cub/cub/device/dispatch/dispatch_radix_sort.cuh b/cub/cub/device/dispatch/dispatch_radix_sort.cuh
index fc0d8b8c22..1da6febf20 100644
--- a/cub/cub/device/dispatch/dispatch_radix_sort.cuh
+++ b/cub/cub/device/dispatch/dispatch_radix_sort.cuh
@@ -2091,10 +2091,10 @@ struct DispatchRadixSort : SelectedPolicy
     // portions handle inputs with >=2**30 elements, due to the way lookback works
     // for testing purposes, one portion is <= 2**28 elements
     constexpr PortionOffsetT PORTION_SIZE = ((1 << 28) - 1) / ONESWEEP_TILE_ITEMS * ONESWEEP_TILE_ITEMS;
-    int num_passes                        = cub::DivideAndRoundUp(end_bit - begin_bit, RADIX_BITS);
-    OffsetT num_portions                  = static_cast<OffsetT>(cub::DivideAndRoundUp(num_items, PORTION_SIZE));
-    PortionOffsetT max_num_blocks         = cub::DivideAndRoundUp(
-      static_cast<int>(CUB_MIN(num_items, static_cast<OffsetT>(PORTION_SIZE))), ONESWEEP_TILE_ITEMS);
+    int num_passes                        = ::cuda::ceil_div(end_bit - begin_bit, RADIX_BITS);
+    OffsetT num_portions                  = static_cast<OffsetT>(::cuda::ceil_div(num_items, PORTION_SIZE));
+    PortionOffsetT max_num_blocks =
+      ::cuda::ceil_div(static_cast<int>(CUB_MIN(num_items, static_cast<OffsetT>(PORTION_SIZE))), ONESWEEP_TILE_ITEMS);
 
     size_t value_size         = KEYS_ONLY ? 0 : sizeof(ValueT);
     size_t allocation_sizes[] = {
@@ -2237,7 +2237,7 @@ struct DispatchRadixSort : SelectedPolicy
           PortionOffsetT portion_num_items = static_cast<PortionOffsetT>(
             CUB_MIN(num_items - portion * PORTION_SIZE, static_cast<OffsetT>(PORTION_SIZE)));
 
-          PortionOffsetT num_blocks = cub::DivideAndRoundUp(portion_num_items, ONESWEEP_TILE_ITEMS);
+          PortionOffsetT num_blocks = ::cuda::ceil_div(portion_num_items, ONESWEEP_TILE_ITEMS);
 
           error = CubDebug(cudaMemsetAsync(d_lookback, 0, num_blocks * RADIX_DIGITS * sizeof(AtomicOffsetT), stream));
           if (cudaSuccess != error)
@@ -2429,7 +2429,7 @@ struct DispatchRadixSort : SelectedPolicy
       // Pass planning.  Run passes of the alternate digit-size configuration until we have an even multiple of our
       // preferred digit size
       int num_bits           = end_bit - begin_bit;
-      int num_passes         = cub::DivideAndRoundUp(num_bits, pass_config.radix_bits);
+      int num_passes         = ::cuda::ceil_div(num_bits, pass_config.radix_bits);
       bool is_num_passes_odd = num_passes & 1;
       int max_alt_passes     = (num_passes * pass_config.radix_bits) - num_bits;
       int alt_end_bit        = CUB_MIN(end_bit, begin_bit + (max_alt_passes * alt_pass_config.radix_bits));
@@ -3055,7 +3055,7 @@ struct DispatchSegmentedRadixSort : SelectedPolicy
       int radix_bits         = ActivePolicyT::SegmentedPolicy::RADIX_BITS;
       int alt_radix_bits     = ActivePolicyT::AltSegmentedPolicy::RADIX_BITS;
       int num_bits           = end_bit - begin_bit;
-      int num_passes         = CUB_MAX(DivideAndRoundUp(num_bits, radix_bits), 1);
+      int num_passes         = CUB_MAX(::cuda::ceil_div(num_bits, radix_bits), 1);
       bool is_num_passes_odd = num_passes & 1;
       int max_alt_passes     = (num_passes * radix_bits) - num_bits;
       int alt_end_bit        = CUB_MIN(end_bit, begin_bit + (max_alt_passes * alt_radix_bits));
diff --git a/cub/cub/device/dispatch/dispatch_reduce_by_key.cuh b/cub/cub/device/dispatch/dispatch_reduce_by_key.cuh
index 8ae232e8d1..00d7280701 100644
--- a/cub/cub/device/dispatch/dispatch_reduce_by_key.cuh
+++ b/cub/cub/device/dispatch/dispatch_reduce_by_key.cuh
@@ -308,7 +308,7 @@ struct DispatchReduceByKey
 
       // Number of input tiles
       int tile_size = block_threads * items_per_thread;
-      int num_tiles = static_cast<int>(cub::DivideAndRoundUp(num_items, tile_size));
+      int num_tiles = static_cast<int>(::cuda::ceil_div(num_items, tile_size));
 
       // Specify temporary storage allocation requirements
       size_t allocation_sizes[1];
@@ -344,7 +344,7 @@ struct DispatchReduceByKey
       }
 
       // Log init_kernel configuration
-      int init_grid_size = CUB_MAX(1, cub::DivideAndRoundUp(num_tiles, INIT_KERNEL_THREADS));
+      int init_grid_size = CUB_MAX(1, ::cuda::ceil_div(num_tiles, INIT_KERNEL_THREADS));
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
       _CubLog("Invoking init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, INIT_KERNEL_THREADS, (long long) stream);
diff --git a/cub/cub/device/dispatch/dispatch_rle.cuh b/cub/cub/device/dispatch/dispatch_rle.cuh
index 2ca3527b60..2a6a0b3b64 100644
--- a/cub/cub/device/dispatch/dispatch_rle.cuh
+++ b/cub/cub/device/dispatch/dispatch_rle.cuh
@@ -312,7 +312,7 @@ struct DeviceRleDispatch
 
       // Number of input tiles
       int tile_size = block_threads * items_per_thread;
-      int num_tiles = static_cast<int>(cub::DivideAndRoundUp(num_items, tile_size));
+      int num_tiles = static_cast<int>(::cuda::ceil_div(num_items, tile_size));
 
       // Specify temporary storage allocation requirements
       size_t allocation_sizes[1];
@@ -347,7 +347,7 @@ struct DeviceRleDispatch
       }
 
       // Log device_scan_init_kernel configuration
-      int init_grid_size = CUB_MAX(1, cub::DivideAndRoundUp(num_tiles, INIT_KERNEL_THREADS));
+      int init_grid_size = CUB_MAX(1, ::cuda::ceil_div(num_tiles, INIT_KERNEL_THREADS));
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
       _CubLog("Invoking device_scan_init_kernel<<<%d, %d, 0, %lld>>>()\n",
@@ -401,7 +401,7 @@ struct DeviceRleDispatch
       // Get grid size for scanning tiles
       dim3 scan_grid_size;
       scan_grid_size.z = 1;
-      scan_grid_size.y = cub::DivideAndRoundUp(num_tiles, max_dim_x);
+      scan_grid_size.y = ::cuda::ceil_div(num_tiles, max_dim_x);
       scan_grid_size.x = CUB_MIN(num_tiles, max_dim_x);
 
 // Log device_rle_sweep_kernel configuration
diff --git a/cub/cub/device/dispatch/dispatch_scan.cuh b/cub/cub/device/dispatch/dispatch_scan.cuh
index 7d2fc4ac17..d1efaa01cd 100644
--- a/cub/cub/device/dispatch/dispatch_scan.cuh
+++ b/cub/cub/device/dispatch/dispatch_scan.cuh
@@ -382,7 +382,7 @@ struct DispatchScan : SelectedPolicy
 
       // Number of input tiles
       int tile_size = Policy::BLOCK_THREADS * Policy::ITEMS_PER_THREAD;
-      int num_tiles = static_cast<int>(cub::DivideAndRoundUp(num_items, tile_size));
+      int num_tiles = static_cast<int>(::cuda::ceil_div(num_items, tile_size));
 
       // Specify temporary storage allocation requirements
       size_t allocation_sizes[1];
@@ -424,7 +424,7 @@ struct DispatchScan : SelectedPolicy
       }
 
       // Log init_kernel configuration
-      int init_grid_size = cub::DivideAndRoundUp(num_tiles, INIT_KERNEL_THREADS);
+      int init_grid_size = ::cuda::ceil_div(num_tiles, INIT_KERNEL_THREADS);
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
       _CubLog("Invoking init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, INIT_KERNEL_THREADS, (long long) stream);
diff --git a/cub/cub/device/dispatch/dispatch_scan_by_key.cuh b/cub/cub/device/dispatch/dispatch_scan_by_key.cuh
index 5dfffa5e77..b1d295a604 100644
--- a/cub/cub/device/dispatch/dispatch_scan_by_key.cuh
+++ b/cub/cub/device/dispatch/dispatch_scan_by_key.cuh
@@ -389,7 +389,7 @@ struct DispatchScanByKey : SelectedPolicy
 
       // Number of input tiles
       int tile_size = Policy::BLOCK_THREADS * Policy::ITEMS_PER_THREAD;
-      int num_tiles = static_cast<int>(cub::DivideAndRoundUp(num_items, tile_size));
+      int num_tiles = static_cast<int>(::cuda::ceil_div(num_items, tile_size));
 
       // Specify temporary storage allocation requirements
       size_t allocation_sizes[2];
@@ -435,7 +435,7 @@ struct DispatchScanByKey : SelectedPolicy
       }
 
       // Log init_kernel configuration
-      int init_grid_size = cub::DivideAndRoundUp(num_tiles, INIT_KERNEL_THREADS);
+      int init_grid_size = ::cuda::ceil_div(num_tiles, INIT_KERNEL_THREADS);
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
       _CubLog("Invoking init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, INIT_KERNEL_THREADS, (long long) stream);
 #endif // CUB_DETAIL_DEBUG_ENABLE_LOG
diff --git a/cub/cub/device/dispatch/dispatch_segmented_sort.cuh b/cub/cub/device/dispatch/dispatch_segmented_sort.cuh
index 702df00df3..80d8973c75 100644
--- a/cub/cub/device/dispatch/dispatch_segmented_sort.cuh
+++ b/cub/cub/device/dispatch/dispatch_segmented_sort.cuh
@@ -581,10 +581,10 @@ CUB_RUNTIME_FUNCTION _CCCL_VISIBILITY_HIDDEN cudaError_t DeviceSegmentedSortCont
   const unsigned int small_segments  = group_sizes[1];
   const unsigned int medium_segments = static_cast<unsigned int>(num_segments) - (large_segments + small_segments);
 
-  const unsigned int small_blocks = DivideAndRoundUp(small_segments, SmallAndMediumPolicyT::SEGMENTS_PER_SMALL_BLOCK);
+  const unsigned int small_blocks = ::cuda::ceil_div(small_segments, SmallAndMediumPolicyT::SEGMENTS_PER_SMALL_BLOCK);
 
   const unsigned int medium_blocks =
-    DivideAndRoundUp(medium_segments, SmallAndMediumPolicyT::SEGMENTS_PER_MEDIUM_BLOCK);
+    ::cuda::ceil_div(medium_segments, SmallAndMediumPolicyT::SEGMENTS_PER_MEDIUM_BLOCK);
 
   const unsigned int small_and_medium_blocks_in_grid = small_blocks + medium_blocks;
 
@@ -1311,7 +1311,7 @@ struct DispatchSegmentedSort : SelectedPolicy
        * To avoid these issues, we have to use extra memory. The extra memory
        * holds temporary storage for writing intermediate results of each stage.
        * Since we iterate over digits in keys, we potentially need:
-       * `sizeof(KeyT) * num_items * DivideAndRoundUp(sizeof(KeyT),RADIX_BITS)`
+       * `sizeof(KeyT) * num_items * cuda::ceil_div(sizeof(KeyT),RADIX_BITS)`
        * auxiliary memory bytes. To reduce the auxiliary memory storage
        * requirements, the algorithm relies on a double buffer facility. The
        * idea behind it is in swapping destination and source buffers at each
@@ -1476,7 +1476,7 @@ private:
   {
     constexpr int byte_size = 8;
     constexpr int num_bits  = sizeof(KeyT) * byte_size;
-    const int num_passes    = DivideAndRoundUp(num_bits, radix_bits);
+    const int num_passes    = ::cuda::ceil_div(num_bits, radix_bits);
     return num_passes;
   }
 
diff --git a/cub/cub/device/dispatch/dispatch_select_if.cuh b/cub/cub/device/dispatch/dispatch_select_if.cuh
index 50a2022184..37528662a0 100644
--- a/cub/cub/device/dispatch/dispatch_select_if.cuh
+++ b/cub/cub/device/dispatch/dispatch_select_if.cuh
@@ -415,7 +415,7 @@ struct DispatchSelectIf : SelectedPolicy
     constexpr auto block_threads    = VsmemHelperT::agent_policy_t::BLOCK_THREADS;
     constexpr auto items_per_thread = VsmemHelperT::agent_policy_t::ITEMS_PER_THREAD;
     constexpr int tile_size         = block_threads * items_per_thread;
-    int num_tiles                   = static_cast<int>(cub::DivideAndRoundUp(num_items, tile_size));
+    int num_tiles                   = static_cast<int>(::cuda::ceil_div(num_items, tile_size));
     const auto vsmem_size           = num_tiles * VsmemHelperT::vsmem_per_block;
 
     do
@@ -462,7 +462,7 @@ struct DispatchSelectIf : SelectedPolicy
       }
 
       // Log scan_init_kernel configuration
-      int init_grid_size = CUB_MAX(1, cub::DivideAndRoundUp(num_tiles, INIT_KERNEL_THREADS));
+      int init_grid_size = CUB_MAX(1, ::cuda::ceil_div(num_tiles, INIT_KERNEL_THREADS));
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
       _CubLog(
@@ -504,7 +504,7 @@ struct DispatchSelectIf : SelectedPolicy
       // Get grid size for scanning tiles
       dim3 scan_grid_size;
       scan_grid_size.z = 1;
-      scan_grid_size.y = cub::DivideAndRoundUp(num_tiles, max_dim_x);
+      scan_grid_size.y = ::cuda::ceil_div(num_tiles, max_dim_x);
       scan_grid_size.x = CUB_MIN(num_tiles, max_dim_x);
 
 // Log select_if_kernel configuration
diff --git a/cub/cub/device/dispatch/dispatch_spmv_orig.cuh b/cub/cub/device/dispatch/dispatch_spmv_orig.cuh
index 0519dcc739..e6aeb9f9f1 100644
--- a/cub/cub/device/dispatch/dispatch_spmv_orig.cuh
+++ b/cub/cub/device/dispatch/dispatch_spmv_orig.cuh
@@ -624,7 +624,7 @@ struct DispatchSpmv
         }
 
         constexpr int threads_in_block = EMPTY_MATRIX_KERNEL_THREADS;
-        const int blocks_in_grid       = cub::DivideAndRoundUp(spmv_params.num_rows, threads_in_block);
+        const int blocks_in_grid       = ::cuda::ceil_div(spmv_params.num_rows, threads_in_block);
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
         _CubLog("Invoking spmv_empty_matrix_kernel<<<%d, %d, 0, %lld>>>()\n",
@@ -661,7 +661,7 @@ struct DispatchSpmv
 
         // Get search/init grid dims
         int degen_col_kernel_block_size = INIT_KERNEL_THREADS;
-        int degen_col_kernel_grid_size  = cub::DivideAndRoundUp(spmv_params.num_rows, degen_col_kernel_block_size);
+        int degen_col_kernel_grid_size  = ::cuda::ceil_div(spmv_params.num_rows, degen_col_kernel_block_size);
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
         _CubLog("Invoking spmv_1col_kernel<<<%d, %d, 0, %lld>>>()\n",
@@ -720,8 +720,8 @@ struct DispatchSpmv
       int segment_fixup_tile_size = segment_fixup_config.block_threads * segment_fixup_config.items_per_thread;
 
       // Number of tiles for kernels
-      int num_merge_tiles         = cub::DivideAndRoundUp(num_merge_items, merge_tile_size);
-      int num_segment_fixup_tiles = cub::DivideAndRoundUp(num_merge_tiles, segment_fixup_tile_size);
+      int num_merge_tiles         = ::cuda::ceil_div(num_merge_items, merge_tile_size);
+      int num_segment_fixup_tiles = ::cuda::ceil_div(num_merge_tiles, segment_fixup_tile_size);
 
       // Get SM occupancy for kernels
       int spmv_sm_occupancy;
@@ -738,10 +738,10 @@ struct DispatchSpmv
       }
 
       // Get grid dimensions
-      dim3 spmv_grid_size(CUB_MIN(num_merge_tiles, max_dim_x), cub::DivideAndRoundUp(num_merge_tiles, max_dim_x), 1);
+      dim3 spmv_grid_size(CUB_MIN(num_merge_tiles, max_dim_x), ::cuda::ceil_div(num_merge_tiles, max_dim_x), 1);
 
       dim3 segment_fixup_grid_size(
-        CUB_MIN(num_segment_fixup_tiles, max_dim_x), cub::DivideAndRoundUp(num_segment_fixup_tiles, max_dim_x), 1);
+        CUB_MIN(num_segment_fixup_tiles, max_dim_x), ::cuda::ceil_div(num_segment_fixup_tiles, max_dim_x), 1);
 
       // Get the temporary storage allocation requirements
       size_t allocation_sizes[3];
@@ -777,7 +777,7 @@ struct DispatchSpmv
 
       // Get search/init grid dims
       int search_block_size = INIT_KERNEL_THREADS;
-      int search_grid_size  = cub::DivideAndRoundUp(num_merge_tiles + 1, search_block_size);
+      int search_grid_size  = ::cuda::ceil_div(num_merge_tiles + 1, search_block_size);
 
       if (search_grid_size < sm_count)
       //            if (num_merge_tiles < spmv_sm_occupancy * sm_count)
diff --git a/cub/cub/device/dispatch/dispatch_three_way_partition.cuh b/cub/cub/device/dispatch/dispatch_three_way_partition.cuh
index be01d7508c..e77f82e064 100644
--- a/cub/cub/device/dispatch/dispatch_three_way_partition.cuh
+++ b/cub/cub/device/dispatch/dispatch_three_way_partition.cuh
@@ -237,7 +237,7 @@ struct DispatchThreeWayPartitionIf
 
       // Number of input tiles
       int tile_size = block_threads * items_per_thread;
-      int num_tiles = static_cast<int>(DivideAndRoundUp(num_items, tile_size));
+      int num_tiles = static_cast<int>(::cuda::ceil_div(num_items, tile_size));
 
       // Specify temporary storage allocation requirements
       size_t allocation_sizes[1]; // bytes needed for tile status descriptors
@@ -281,7 +281,7 @@ struct DispatchThreeWayPartitionIf
       }
 
       // Log three_way_partition_init_kernel configuration
-      int init_grid_size = CUB_MAX(1, DivideAndRoundUp(num_tiles, INIT_KERNEL_THREADS));
+      int init_grid_size = CUB_MAX(1, ::cuda::ceil_div(num_tiles, INIT_KERNEL_THREADS));
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
       _CubLog("Invoking three_way_partition_init_kernel<<<%d, %d, 0, %lld>>>()\n",
@@ -319,7 +319,7 @@ struct DispatchThreeWayPartitionIf
       // Get grid size for scanning tiles
       dim3 scan_grid_size;
       scan_grid_size.z = 1;
-      scan_grid_size.y = DivideAndRoundUp(num_tiles, max_dim_x);
+      scan_grid_size.y = ::cuda::ceil_div(num_tiles, max_dim_x);
       scan_grid_size.x = CUB_MIN(num_tiles, max_dim_x);
 
 // Log select_if_kernel configuration
diff --git a/cub/cub/device/dispatch/dispatch_unique_by_key.cuh b/cub/cub/device/dispatch/dispatch_unique_by_key.cuh
index c943034221..a9c4008beb 100644
--- a/cub/cub/device/dispatch/dispatch_unique_by_key.cuh
+++ b/cub/cub/device/dispatch/dispatch_unique_by_key.cuh
@@ -385,7 +385,7 @@ struct DispatchUniqueByKey : SelectedPolicy
       constexpr auto block_threads    = VsmemHelperT::agent_policy_t::BLOCK_THREADS;
       constexpr auto items_per_thread = VsmemHelperT::agent_policy_t::ITEMS_PER_THREAD;
       int tile_size                   = block_threads * items_per_thread;
-      int num_tiles                   = static_cast<int>(cub::DivideAndRoundUp(num_items, tile_size));
+      int num_tiles                   = static_cast<int>(::cuda::ceil_div(num_items, tile_size));
       const auto vsmem_size           = num_tiles * VsmemHelperT::vsmem_per_block;
 
       // Specify temporary storage allocation requirements
@@ -423,7 +423,7 @@ struct DispatchUniqueByKey : SelectedPolicy
 
       // Log init_kernel configuration
       num_tiles          = CUB_MAX(1, num_tiles);
-      int init_grid_size = cub::DivideAndRoundUp(num_tiles, INIT_KERNEL_THREADS);
+      int init_grid_size = ::cuda::ceil_div(num_tiles, INIT_KERNEL_THREADS);
 
 #ifdef CUB_DETAIL_DEBUG_ENABLE_LOG
       _CubLog("Invoking init_kernel<<<%d, %d, 0, %lld>>>()\n", init_grid_size, INIT_KERNEL_THREADS, (long long) stream);
@@ -464,7 +464,7 @@ struct DispatchUniqueByKey : SelectedPolicy
       // Get grid size for scanning tiles
       dim3 scan_grid_size;
       scan_grid_size.z = 1;
-      scan_grid_size.y = cub::DivideAndRoundUp(num_tiles, max_dim_x);
+      scan_grid_size.y = ::cuda::ceil_div(num_tiles, max_dim_x);
       scan_grid_size.x = CUB_MIN(num_tiles, max_dim_x);
 
 // Log select_if_kernel configuration
diff --git a/cub/cub/grid/grid_even_share.cuh b/cub/cub/grid/grid_even_share.cuh
index b194d62630..83ff8f92e3 100644
--- a/cub/cub/grid/grid_even_share.cuh
+++ b/cub/cub/grid/grid_even_share.cuh
@@ -130,7 +130,7 @@ public:
     this->block_offset      = num_items_; // Initialize past-the-end
     this->block_end         = num_items_; // Initialize past-the-end
     this->num_items         = num_items_;
-    this->total_tiles       = static_cast<int>(cub::DivideAndRoundUp(num_items_, tile_items));
+    this->total_tiles       = static_cast<int>(::cuda::ceil_div(num_items_, tile_items));
     this->grid_size         = CUB_MIN(total_tiles, max_grid_size);
     int avg_tiles_per_block = total_tiles / grid_size;
     // leftover grains go to big blocks:
diff --git a/cub/cub/util_math.cuh b/cub/cub/util_math.cuh
index e5b8444466..d170664c80 100644
--- a/cub/cub/util_math.cuh
+++ b/cub/cub/util_math.cuh
@@ -42,6 +42,7 @@
 #  pragma system_header
 #endif // no system header
 
+#include <cuda/cmath>
 #include <cuda/std/type_traits>
 
 CUB_NAMESPACE_BEGIN
@@ -76,17 +77,16 @@ _CCCL_HOST_DEVICE _CCCL_FORCEINLINE OffsetT safe_add_bound_to_max(OffsetT lhs, O
  *
  * Effectively performs `(n + d - 1) / d`, but is robust against the case where
  * `(n + d - 1)` would overflow.
+ * deprecated [Since 2.8.0] `cub::DivideAndRoundUp` is deprecated. Use `cuda::ceil_div` instead.
  */
 template <typename NumeratorT, typename DenominatorT>
+CUB_DEPRECATED_BECAUSE("Use cuda::ceil_div instead")
 _CCCL_HOST_DEVICE _CCCL_FORCEINLINE constexpr NumeratorT DivideAndRoundUp(NumeratorT n, DenominatorT d)
 {
-  // TODO(bgruber): implement using ::cuda::ceil_div
   static_assert(
     cub::detail::is_integral_or_enum<NumeratorT>::value && cub::detail::is_integral_or_enum<DenominatorT>::value,
     "DivideAndRoundUp is only intended for integral types.");
-
-  // Static cast to undo integral promotion.
-  return static_cast<NumeratorT>(n / d + (n % d != 0 ? 1 : 0));
+  return ::cuda::ceil_div(n, d);
 }
 
 constexpr _CCCL_HOST_DEVICE int Nominal4BItemsToItemsCombined(int nominal_4b_items_per_thread, int combined_bytes)
diff --git a/cub/examples/device/example_device_decoupled_look_back.cu b/cub/examples/device/example_device_decoupled_look_back.cu
index 505d430ae6..806a9159d6 100644
--- a/cub/examples/device/example_device_decoupled_look_back.cu
+++ b/cub/examples/device/example_device_decoupled_look_back.cu
@@ -105,7 +105,7 @@ void decoupled_look_back_example(int blocks_in_grid)
   scan_tile_state_t tile_status;
   tile_status.Init(blocks_in_grid, d_temp_storage, temp_storage_bytes);
   constexpr unsigned int threads_in_init_block = 256;
-  const unsigned int blocks_in_init_grid       = cub::DivideAndRoundUp(blocks_in_grid, threads_in_init_block);
+  const unsigned int blocks_in_init_grid       = ::cuda::ceil_div(blocks_in_grid, threads_in_init_block);
   init_kernel<<<blocks_in_init_grid, threads_in_init_block>>>(tile_status, blocks_in_grid);
 
   // Launch decoupled look-back
diff --git a/cub/test/catch2_large_array_sort_helper.cuh b/cub/test/catch2_large_array_sort_helper.cuh
index cf17b4b160..547a5f0c3b 100644
--- a/cub/test/catch2_large_array_sort_helper.cuh
+++ b/cub/test/catch2_large_array_sort_helper.cuh
@@ -398,7 +398,7 @@ private:
     using summary_t = detail::summary<KeyType>;
 
     const std::size_t max_summary_mem = num_items * (sizeof(KeyType) + sizeof(ValueType));
-    const std::size_t max_summaries   = cub::DivideAndRoundUp(max_summary_mem, sizeof(summary_t));
+    const std::size_t max_summaries   = ::cuda::ceil_div(max_summary_mem, sizeof(summary_t));
     return max_summaries;
   }
 
diff --git a/cub/test/catch2_radix_sort_helper.cuh b/cub/test/catch2_radix_sort_helper.cuh
index 61b02fc6f1..4bdcec3992 100644
--- a/cub/test/catch2_radix_sort_helper.cuh
+++ b/cub/test/catch2_radix_sort_helper.cuh
@@ -461,7 +461,7 @@ template <class OffsetT>
 void generate_segment_offsets(c2h::seed_t seed, c2h::device_vector<OffsetT>& offsets, std::size_t num_items)
 {
   const std::size_t num_segments        = offsets.size() - 1;
-  const OffsetT expected_segment_length = static_cast<OffsetT>(cub::DivideAndRoundUp(num_items, num_segments));
+  const OffsetT expected_segment_length = static_cast<OffsetT>(::cuda::ceil_div(num_items, num_segments));
   const OffsetT max_segment_length      = (expected_segment_length * 2) + 1;
   c2h::gen(seed, offsets, OffsetT{0}, max_segment_length);
   thrust::exclusive_scan(
diff --git a/cub/test/catch2_segmented_sort_helper.cuh b/cub/test/catch2_segmented_sort_helper.cuh
index 1e454f3fa6..4892914305 100644
--- a/cub/test/catch2_segmented_sort_helper.cuh
+++ b/cub/test/catch2_segmented_sort_helper.cuh
@@ -1374,7 +1374,7 @@ inline c2h::device_vector<int>
 generate_random_offsets(c2h::seed_t seed, int max_items, int max_segment, int num_segments)
 {
   C2H_TIME_SCOPE("generate_random_offsets");
-  const int expected_segment_length = cub::DivideAndRoundUp(max_items, num_segments);
+  const int expected_segment_length = ::cuda::ceil_div(max_items, num_segments);
   const int max_segment_length      = CUB_MIN(max_segment, (expected_segment_length * 2) + 1);
 
   c2h::device_vector<int> offsets(num_segments + 1);
diff --git a/cub/test/catch2_test_device_decoupled_look_back.cu b/cub/test/catch2_test_device_decoupled_look_back.cu
index dab95aa187..ff778a3023 100644
--- a/cub/test/catch2_test_device_decoupled_look_back.cu
+++ b/cub/test/catch2_test_device_decoupled_look_back.cu
@@ -147,7 +147,7 @@ CUB_TEST("Decoupled look-back works with various message types", "[decoupled loo
   REQUIRE(status == cudaSuccess);
 
   constexpr unsigned int threads_in_init_block = 256;
-  const unsigned int blocks_in_init_grid       = cub::DivideAndRoundUp(num_tiles, threads_in_init_block);
+  const unsigned int blocks_in_init_grid       = ::cuda::ceil_div(num_tiles, threads_in_init_block);
   init_kernel<<<blocks_in_init_grid, threads_in_init_block>>>(tile_status, num_tiles);
   REQUIRE(cudaSuccess == cudaPeekAtLastError());
   REQUIRE(cudaSuccess == cudaDeviceSynchronize());
diff --git a/cub/test/catch2_test_vsmem.cu b/cub/test/catch2_test_vsmem.cu
index 91069fcab1..da83e444d4 100644
--- a/cub/test/catch2_test_vsmem.cu
+++ b/cub/test/catch2_test_vsmem.cu
@@ -302,7 +302,7 @@ struct dispatch_dummy_algorithm_t : SelectedPolicy
     constexpr auto block_threads    = vsmem_helper_t::agent_policy_t::BLOCK_THREADS;
     constexpr auto items_per_thread = vsmem_helper_t::agent_policy_t::ITEMS_PER_THREAD;
     constexpr auto tile_size        = block_threads * items_per_thread;
-    const auto num_tiles            = cub::DivideAndRoundUp(num_items, tile_size);
+    const auto num_tiles            = ::cuda::ceil_div(num_items, tile_size);
     const auto total_vsmem          = num_tiles * vsmem_helper_t::vsmem_per_block;
 
     // Get device ordinal
diff --git a/cub/test/test_util.h b/cub/test/test_util.h
index 01a0de2db4..3bea08ebae 100644
--- a/cub/test/test_util.h
+++ b/cub/test/test_util.h
@@ -1392,7 +1392,7 @@ void InitializeSegments(OffsetT num_items, int num_segments, OffsetT* h_segment_
     return;
   }
 
-  OffsetT expected_segment_length = CUB_NS_QUALIFIER::DivideAndRoundUp(num_items, OffsetT(num_segments));
+  OffsetT expected_segment_length = ::cuda::ceil_div(num_items, OffsetT(num_segments));
   OffsetT offset                  = 0;
   for (int i = 0; i < num_segments; ++i)
   {
diff --git a/libcudacxx/include/cuda/__cmath/ceil_div.h b/libcudacxx/include/cuda/__cmath/ceil_div.h
new file mode 100644
index 0000000000..d6ee4f74f3
--- /dev/null
+++ b/libcudacxx/include/cuda/__cmath/ceil_div.h
@@ -0,0 +1,87 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of libcu++, the C++ Standard Library for your entire system,
+// under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+// SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef _CUDA___CMATH_CEIL_DIV_H
+#define _CUDA___CMATH_CEIL_DIV_H
+
+#include <cuda/std/detail/__config>
+
+#if defined(_CCCL_IMPLICIT_SYSTEM_HEADER_GCC)
+#  pragma GCC system_header
+#elif defined(_CCCL_IMPLICIT_SYSTEM_HEADER_CLANG)
+#  pragma clang system_header
+#elif defined(_CCCL_IMPLICIT_SYSTEM_HEADER_MSVC)
+#  pragma system_header
+#endif // no system header
+
+#include <cuda/std/__type_traits/common_type.h>
+#include <cuda/std/__type_traits/enable_if.h>
+#include <cuda/std/__type_traits/is_enum.h>
+#include <cuda/std/__type_traits/is_integral.h>
+#include <cuda/std/__type_traits/is_signed.h>
+#include <cuda/std/__type_traits/is_unsigned.h>
+#include <cuda/std/__type_traits/make_unsigned.h>
+#include <cuda/std/__type_traits/underlying_type.h>
+#include <cuda/std/detail/libcxx/include/__debug>
+
+_LIBCUDACXX_BEGIN_NAMESPACE_CUDA
+
+//! @brief Divides two numbers \p __a and \p __b, rounding up if there is a remainder
+//! @param __a The dividend
+//! @param __b The divisor
+//! @pre \p __a must be non-negative
+//! @pre \p __b must be positive
+template <class _Tp,
+          class _Up,
+          _CUDA_VSTD::__enable_if_t<_CCCL_TRAIT(_CUDA_VSTD::is_unsigned, _Tp), int> = 0,
+          _CUDA_VSTD::__enable_if_t<_CCCL_TRAIT(_CUDA_VSTD::is_integral, _Up), int> = 0>
+_CCCL_NODISCARD _LIBCUDACXX_HIDE_FROM_ABI _CCCL_CONSTEXPR_CXX14 _Tp ceil_div(const _Tp __a, const _Up __b) noexcept
+{
+  _LIBCUDACXX_DEBUG_ASSERT(__b > _Tp(0), "cuda::ceil_div: b must be positive");
+  using _UCommon   = _CUDA_VSTD::__make_unsigned_t<_CUDA_VSTD::__common_type_t<_Tp, _Up>>;
+  const auto __res = static_cast<_UCommon>(__a) / static_cast<_UCommon>(__b);
+  return static_cast<_Tp>(__res + (__res * static_cast<_UCommon>(__b) != static_cast<_UCommon>(__a)));
+}
+
+//! @brief Divides two numbers \p __a and \p __b, rounding up if there is a remainder
+//! @param __a The dividend
+//! @param __b The divisor
+//! @pre \p __a must be non-negative
+//! @pre \p __b must be positive
+template <class _Tp,
+          class _Up,
+          _CUDA_VSTD::__enable_if_t<_CCCL_TRAIT(_CUDA_VSTD::is_signed, _Tp), int>   = 0,
+          _CUDA_VSTD::__enable_if_t<_CCCL_TRAIT(_CUDA_VSTD::is_integral, _Up), int> = 0>
+_CCCL_NODISCARD _LIBCUDACXX_HIDE_FROM_ABI _CCCL_CONSTEXPR_CXX14 _Tp ceil_div(const _Tp __a, const _Up __b) noexcept
+{
+  _LIBCUDACXX_DEBUG_ASSERT(__a >= _Tp(0), "cuda::ceil_div: a must be non negative");
+  _LIBCUDACXX_DEBUG_ASSERT(__b > _Tp(0), "cuda::ceil_div: b must be positive");
+  using _UCommon = _CUDA_VSTD::__make_unsigned_t<_CUDA_VSTD::__common_type_t<_Tp, _Up>>;
+  // Due to the precondition `__a >= 0` we can safely cast to unsigned without danger of overflowing
+  return static_cast<_Tp>((static_cast<_UCommon>(__a) + static_cast<_UCommon>(__b) - 1) / static_cast<_UCommon>(__b));
+}
+
+//! @brief Divides two numbers \p __a and \p __b, rounding up if there is a remainder, \p __b is an enum
+//! @param __a The dividend
+//! @param __b The divisor
+//! @pre \p __a must be non-negative
+//! @pre \p __b must be positive
+template <class _Tp,
+          class _Up,
+          _CUDA_VSTD::__enable_if_t<_CCCL_TRAIT(_CUDA_VSTD::is_integral, _Tp), int> = 0,
+          _CUDA_VSTD::__enable_if_t<_CCCL_TRAIT(_CUDA_VSTD::is_enum, _Up), int>     = 0>
+_CCCL_NODISCARD _LIBCUDACXX_HIDE_FROM_ABI _CCCL_CONSTEXPR_CXX14 _Tp ceil_div(const _Tp __a, const _Up __b) noexcept
+{
+  return ::cuda::ceil_div(__a, static_cast<_CUDA_VSTD::__underlying_type_t<_Up>>(__b));
+}
+
+_LIBCUDACXX_END_NAMESPACE_CUDA
+
+#endif // _CUDA___CMATH_CEIL_DIV_H
diff --git a/libcudacxx/include/cuda/cmath b/libcudacxx/include/cuda/cmath
index 0b613ef83f..3de1cc6e92 100644
--- a/libcudacxx/include/cuda/cmath
+++ b/libcudacxx/include/cuda/cmath
@@ -21,22 +21,7 @@
 #  pragma system_header
 #endif // no system header
 
-#include <cuda/std/__type_traits/enable_if.h>
-#include <cuda/std/__type_traits/is_integral.h>
+#include <cuda/__cmath/ceil_div.h>
 #include <cuda/std/cmath>
-#include <cuda/std/detail/libcxx/include/__debug>
-
-_LIBCUDACXX_BEGIN_NAMESPACE_CUDA
-
-template <class _Tp, _CUDA_VSTD::__enable_if_t<_CCCL_TRAIT(_CUDA_VSTD::is_integral, _Tp), int> = 0>
-_CCCL_NODISCARD _LIBCUDACXX_HIDE_FROM_ABI _CCCL_CONSTEXPR_CXX14 _Tp ceil_div(const _Tp __a, const _Tp __b) noexcept
-{
-  _LIBCUDACXX_DEBUG_ASSERT(__a >= _Tp(0), "cuda::ceil_div: a must be non negative");
-  _LIBCUDACXX_DEBUG_ASSERT(__b > _Tp(0), "cuda::ceil_div: b must be positive");
-  const _Tp __res = static_cast<_Tp>(__a / __b);
-  return static_cast<_Tp>(__res + (__res * __b != __a));
-}
-
-_LIBCUDACXX_END_NAMESPACE_CUDA
 
 #endif // _CUDA_CMATH
diff --git a/libcudacxx/include/cuda/std/__type_traits/underlying_type.h b/libcudacxx/include/cuda/std/__type_traits/underlying_type.h
index 2e8691d945..039a80ae7f 100644
--- a/libcudacxx/include/cuda/std/__type_traits/underlying_type.h
+++ b/libcudacxx/include/cuda/std/__type_traits/underlying_type.h
@@ -43,6 +43,9 @@ template <class _Tp>
 struct underlying_type : __underlying_type_impl<_Tp, is_enum<_Tp>::value>
 {};
 
+template <class _Tp>
+using __underlying_type_t = typename underlying_type<_Tp>::type;
+
 #  if _CCCL_STD_VER > 2011
 template <class _Tp>
 using underlying_type_t = typename underlying_type<_Tp>::type;
diff --git a/libcudacxx/test/libcudacxx/cuda/cmath.pass.cpp b/libcudacxx/test/libcudacxx/cuda/cmath.pass.cpp
index deb24dfabd..bf27a71b68 100644
--- a/libcudacxx/test/libcudacxx/cuda/cmath.pass.cpp
+++ b/libcudacxx/test/libcudacxx/cuda/cmath.pass.cpp
@@ -19,20 +19,67 @@
 #  include <cstdint>
 #endif // !TEST_COMPILER_NVRTC
 
-template <class T>
+template <class T, class U>
 __host__ __device__ TEST_CONSTEXPR_CXX14 void test()
 {
   constexpr T maxv = cuda::std::numeric_limits<T>::max();
 
-  assert(cuda::ceil_div(T(0), T(1)) == T(0));
-  assert(cuda::ceil_div(T(1), T(1)) == T(1));
-  assert(cuda::ceil_div(T(126), T(64)) == T(2));
+  // ensure that we return the right type
+  static_assert(cuda::std::is_same<decltype(cuda::ceil_div(T(0), U(1))), T>::value, "");
+
+  assert(cuda::ceil_div(T(0), U(1)) == T(0));
+  assert(cuda::ceil_div(T(1), U(1)) == T(1));
+  assert(cuda::ceil_div(T(126), U(64)) == T(2));
 
   // ensure that we are resilient against overflow
-  assert(cuda::ceil_div(maxv, T(1)) == maxv);
+  assert(cuda::ceil_div(maxv, U(1)) == maxv);
   assert(cuda::ceil_div(maxv, maxv) == T(1));
 }
 
+template <class T>
+__host__ __device__ TEST_CONSTEXPR_CXX14 void test()
+{
+  // Builtin integer types:
+  test<T, char>();
+  test<T, signed char>();
+  test<T, unsigned char>();
+
+  test<T, short>();
+  test<T, unsigned short>();
+
+  test<T, int>();
+  test<T, unsigned int>();
+
+  test<T, long>();
+  test<T, unsigned long>();
+
+  test<T, long long>();
+  test<T, unsigned long long>();
+
+#if !defined(TEST_COMPILER_NVRTC)
+  // cstdint types:
+  test<T, std::size_t>();
+  test<T, std::ptrdiff_t>();
+  test<T, std::intptr_t>();
+  test<T, std::uintptr_t>();
+
+  test<T, std::int8_t>();
+  test<T, std::int16_t>();
+  test<T, std::int32_t>();
+  test<T, std::int64_t>();
+
+  test<T, std::uint8_t>();
+  test<T, std::uint16_t>();
+  test<T, std::uint32_t>();
+  test<T, std::uint64_t>();
+#endif // !TEST_COMPILER_NVRTC
+
+#if !defined(TEST_HAS_NO_INT128_T)
+  test<T, __int128_t>();
+  test<T, __uint128_t>();
+#endif // !TEST_HAS_NO_INT128_T
+}
+
 __host__ __device__ TEST_CONSTEXPR_CXX14 bool test()
 {
   // Builtin integer types:
diff --git a/thrust/thrust/system/cuda/detail/extrema.h b/thrust/thrust/system/cuda/detail/extrema.h
index c987b4800c..515971a337 100644
--- a/thrust/thrust/system/cuda/detail/extrema.h
+++ b/thrust/thrust/system/cuda/detail/extrema.h
@@ -273,7 +273,7 @@ cudaError_t THRUST_RUNTIME_FUNCTION doit_step(
     else if (reduce_plan.grid_mapping == cub::GRID_MAPPING_DYNAMIC)
     {
       // Work is distributed dynamically
-      size_t num_tiles = cub::DivideAndRoundUp(num_items, reduce_plan.items_per_tile);
+      size_t num_tiles = ::cuda::ceil_div(num_items, reduce_plan.items_per_tile);
 
       // if not enough to fill the device with threadblocks
       // then fill the device with threadblocks
diff --git a/thrust/thrust/system/cuda/detail/reduce.h b/thrust/thrust/system/cuda/detail/reduce.h
index e66c90883c..298c2bc10f 100644
--- a/thrust/thrust/system/cuda/detail/reduce.h
+++ b/thrust/thrust/system/cuda/detail/reduce.h
@@ -713,7 +713,7 @@ cudaError_t THRUST_RUNTIME_FUNCTION doit_step(
     else if (reduce_plan.grid_mapping == cub::GRID_MAPPING_DYNAMIC)
     {
       // Work is distributed dynamically
-      size_t num_tiles = cub::DivideAndRoundUp(num_items, reduce_plan.items_per_tile);
+      size_t num_tiles = ::cuda::ceil_div(num_items, reduce_plan.items_per_tile);
 
       // if not enough to fill the device with threadblocks
       // then fill the device with threadblocks
diff --git a/thrust/thrust/system/cuda/detail/reduce_by_key.h b/thrust/thrust/system/cuda/detail/reduce_by_key.h
index 526ae8b7dc..cdbcb82e90 100644
--- a/thrust/thrust/system/cuda/detail/reduce_by_key.h
+++ b/thrust/thrust/system/cuda/detail/reduce_by_key.h
@@ -808,7 +808,7 @@ THRUST_RUNTIME_FUNCTION cudaError_t doit_step(
 
   // Number of input tiles
   int tile_size  = reduce_by_key_plan.items_per_tile;
-  Size num_tiles = cub::DivideAndRoundUp(num_items, tile_size);
+  Size num_tiles = ::cuda::ceil_div(num_items, tile_size);
 
   size_t vshmem_size = core::vshmem_size(reduce_by_key_plan.shared_memory_size, num_tiles);
 
diff --git a/thrust/thrust/system/cuda/detail/unique.h b/thrust/thrust/system/cuda/detail/unique.h
index 1cf58247d3..c3ad771561 100644
--- a/thrust/thrust/system/cuda/detail/unique.h
+++ b/thrust/thrust/system/cuda/detail/unique.h
@@ -509,7 +509,7 @@ static cudaError_t THRUST_RUNTIME_FUNCTION doit_step(
   typename get_plan<unique_agent>::type unique_plan = unique_agent::get_plan(stream);
 
   int tile_size    = unique_plan.items_per_tile;
-  size_t num_tiles = cub::DivideAndRoundUp(num_items, tile_size);
+  size_t num_tiles = ::cuda::ceil_div(num_items, tile_size);
 
   size_t vshmem_size = core::vshmem_size(unique_plan.shared_memory_size, num_tiles);
 

From ee94bb9154a8556f285da0b3c74dcfc51b10fb8c Mon Sep 17 00:00:00 2001
From: Allison Piper <alliepiper16@gmail.com>
Date: Thu, 19 Sep 2024 12:18:23 -0400
Subject: [PATCH 5/5] Reduce PR testing matrix. (#2436)

* Remove file annotation from verbose matrix warnings.

* Allow 'min', 'max', 'minmax' values for matrix `std` tags.

* Error when no supported stds available.

* Reduce PR testing matrix.

1. Temporarily remove current nightly matrix pending NVKS bringup.
2. Move current per-PR matrix to nightly.
3. Reduce the number of jobs in the PR matrix while maintaining decent coverage.

Before: (total jobs: 437)
|  320 | `linux-amd64-cpu16`
|   66 | `linux-amd64-gpu-v100-latest-1`
|   28 | `linux-arm64-cpu16`
|   23 | `windows-amd64-cpu16`

After (total jobs: 183)
|  126 | `linux-amd64-cpu16`
|   26 | `linux-amd64-gpu-v100-latest-1`
|   21 | `windows-amd64-cpu16`
|   10 | `linux-arm64-cpu16`

* Restore old build matrix.
---
 .../actions/workflow-build/build-workflow.py  | 32 ++++++++---
 ci/matrix.yaml                                | 57 ++++++++++++-------
 2 files changed, 59 insertions(+), 30 deletions(-)

diff --git a/.github/actions/workflow-build/build-workflow.py b/.github/actions/workflow-build/build-workflow.py
index cd2aad01d9..da3ba64420 100755
--- a/.github/actions/workflow-build/build-workflow.py
+++ b/.github/actions/workflow-build/build-workflow.py
@@ -334,6 +334,8 @@ def lookup_supported_stds(matrix_job):
     if 'project' in matrix_job:
         project = get_project(matrix_job['project'])
         stds = stds & set(project['stds'])
+    if len(stds) == 0:
+        raise Exception(error_message_with_matrix_job(matrix_job, "No supported stds found."))
     return sorted(list(stds))
 
 
@@ -626,18 +628,18 @@ def finalize_workflow_dispatch_groups(workflow_dispatch_groups_orig):
                 matching_consumers = merged_consumers[producer_index]
 
                 producer_name = producer['name']
-                print(f"::notice file=ci/matrix.yaml::Merging consumers for duplicate producer '{producer_name}' in '{group_name}'",
+                print(f"::notice::Merging consumers for duplicate producer '{producer_name}' in '{group_name}'",
                       file=sys.stderr)
                 consumer_names = ", ".join([consumer['name'] for consumer in matching_consumers])
-                print(f"::notice file=ci/matrix.yaml::Original consumers: {consumer_names}", file=sys.stderr)
+                print(f"::notice::Original consumers: {consumer_names}", file=sys.stderr)
                 consumer_names = ", ".join([consumer['name'] for consumer in consumers])
-                print(f"::notice file=ci/matrix.yaml::Duplicate consumers: {consumer_names}", file=sys.stderr)
+                print(f"::notice::Duplicate consumers: {consumer_names}", file=sys.stderr)
                 # Merge if unique:
                 for consumer in consumers:
                     if not dispatch_job_in_container(consumer, matching_consumers):
                         matching_consumers.append(consumer)
                 consumer_names = ", ".join([consumer['name'] for consumer in matching_consumers])
-                print(f"::notice file=ci/matrix.yaml::Merged consumers: {consumer_names}", file=sys.stderr)
+                print(f"::notice::Merged consumers: {consumer_names}", file=sys.stderr)
             else:
                 merged_producers.append(producer)
                 merged_consumers.append(consumers)
@@ -653,7 +655,7 @@ def finalize_workflow_dispatch_groups(workflow_dispatch_groups_orig):
         unique_standalone_jobs = []
         for job_json in standalone_jobs:
             if dispatch_job_in_container(job_json, unique_standalone_jobs):
-                print(f"::notice file=ci/matrix.yaml::Removing duplicate standalone job '{job_json['name']}' in '{group_name}'",
+                print(f"::notice::Removing duplicate standalone job '{job_json['name']}' in '{group_name}'",
                       file=sys.stderr)
             else:
                 unique_standalone_jobs.append(job_json)
@@ -663,12 +665,12 @@ def finalize_workflow_dispatch_groups(workflow_dispatch_groups_orig):
         for two_stage_job in two_stage_jobs:
             for producer in two_stage_job['producers']:
                 if remove_dispatch_job_from_container(producer, unique_standalone_jobs):
-                    print(f"::notice file=ci/matrix.yaml::Removing standalone job '{producer['name']}' " +
+                    print(f"::notice::Removing standalone job '{producer['name']}' " +
                           f"as it appears as a producer in '{group_name}'",
                           file=sys.stderr)
             for consumer in two_stage_job['consumers']:
                 if remove_dispatch_job_from_container(producer, unique_standalone_jobs):
-                    print(f"::notice file=ci/matrix.yaml::Removing standalone job '{consumer['name']}' " +
+                    print(f"::notice::Removing standalone job '{consumer['name']}' " +
                           f"as it appears as a consumer in '{group_name}'",
                           file=sys.stderr)
         standalone_jobs = list(unique_standalone_jobs)
@@ -864,8 +866,20 @@ def set_derived_tags(matrix_job):
         gpu = get_gpu(matrix_job['gpu'])
         matrix_job['sm'] = gpu['sm']
 
-    if 'std' in matrix_job and matrix_job['std'] == 'all':
-        matrix_job['std'] = lookup_supported_stds(matrix_job)
+    if 'std' in matrix_job:
+        if matrix_job['std'] == 'all':
+            matrix_job['std'] = lookup_supported_stds(matrix_job)
+        elif matrix_job['std'] == 'min':
+            matrix_job['std'] = min(lookup_supported_stds(matrix_job))
+        elif matrix_job['std'] == 'max':
+            matrix_job['std'] = max(lookup_supported_stds(matrix_job))
+        elif matrix_job['std'] == 'minmax':
+            stds = lookup_supported_stds(matrix_job)
+            if len(stds) == 1:
+                matrix_job['std'] = stds[0]
+            else:
+                matrix_job['std'] = [min(stds), max(stds)]
+
 
     # Add all deps before applying project job maps:
     for job in matrix_job['jobs']:
diff --git a/ci/matrix.yaml b/ci/matrix.yaml
index e3102f8487..e412929197 100644
--- a/ci/matrix.yaml
+++ b/ci/matrix.yaml
@@ -13,11 +13,17 @@ workflows:
     # Old CTK
     - {jobs: ['build'], std: 'all', ctk: '11.1', cxx: ['gcc6', 'gcc7', 'gcc8', 'gcc9', 'clang9', 'msvc2017']}
     - {jobs: ['build'], std: 'all', ctk: '11.8', cxx: ['gcc11'], sm: '60;70;80;90'}
-    # Current CTK
+    # Current CTK build-only
     - {jobs: ['build'], std: 'all', cxx: ['gcc7', 'gcc8', 'gcc9', 'gcc10', 'gcc11', 'gcc12']}
     - {jobs: ['build'], std: 'all', cxx: ['clang9', 'clang10', 'clang11', 'clang12', 'clang13', 'clang14', 'clang15', 'clang16', 'clang17']}
     - {jobs: ['build'], std: 'all', cxx: ['intel', 'msvc2019']}
-    - {jobs: ['test'],  std: 'all', cxx: ['gcc13', 'clang18', 'msvc2022']}
+    # Current CTK testing:
+    - {jobs: ['test'],  project: ['libcudacxx', 'thrust'], std: 'minmax', cxx: ['gcc']}
+    - {jobs: ['test'],  project: ['libcudacxx', 'thrust'], std: 'max',    cxx: ['clang', 'msvc']}
+    # Split up cub tests:
+    - {jobs: ['test_nolid', 'test_lid0'], project: ['cub'], std: 'minmax', cxx: ['gcc']}
+    - {jobs: ['test_lid1',  'test_lid2'], project: ['cub'], std: 'max',    cxx: ['gcc']}
+    - {jobs: ['test_nolid', 'test_lid0'], project: ['cub'], std: 'max',    cxx: ['clang', 'msvc']}
     # Modded builds:
     - {jobs: ['build'], std: 'all', cxx: ['gcc', 'clang'], cpu: 'arm64'}
     - {jobs: ['build'], std: 'all', cxx: ['gcc'], sm: '90a'}
@@ -36,33 +42,41 @@ workflows:
     - {jobs: ['build'], project: 'cudax', ctk: ['12.0'        ], std: 17,    cxx: ['gcc12'], sm: "90"}
     - {jobs: ['build'], project: 'cudax', ctk: [        'curr'], std: 17,    cxx: ['gcc13'], sm: "90a"}
     - {jobs: ['build'], project: 'cudax', ctk: [        'curr'], std: 'all', cxx: ['gcc13', 'clang16'], cpu: 'arm64'}
-    - {jobs: ['test'],  project: 'cudax', ctk: ['12.0', 'curr'], std: 'all', cxx: ['gcc12']}
-    - {jobs: ['test'],  project: 'cudax', ctk: ['12.0'        ], std: 'all', cxx: ['clang14']}
-    - {jobs: ['test'],  project: 'cudax', ctk: [        'curr'], std: 'all', cxx: ['clang18']}
+    - {jobs: ['test'],  project: 'cudax', ctk: ['12.0'        ], std: 'min', cxx: ['gcc12']}
+    - {jobs: ['test'],  project: 'cudax', ctk: [        'curr'], std: 'all', cxx: ['gcc12']}
+    - {jobs: ['test'],  project: 'cudax', ctk: ['12.0'        ], std: 'max', cxx: ['clang14']}
+    - {jobs: ['test'],  project: 'cudax', ctk: [        'curr'], std: 'max', cxx: ['clang18']}
     # Python jobs:
     - {jobs: ['test'], project: 'pycuda', ctk: ['12.5']}
     # cccl-infra:
     - {jobs: ['infra'], project: 'cccl', ctk: '11.1', cxx: ['gcc6', 'clang9']}
     - {jobs: ['infra'], project: 'cccl', ctk: 'curr', cxx: ['gcc',  'clang']}
-    # Edge-case jobs
-    - {jobs: ['limited'], project: 'cub', std: 17}
 
   nightly:
-    - {jobs: ['test'],  ctk: '11.1', gpu: 'v100',     sm: 'gpu', cxx: 'gcc6',    std: [11]}
-    - {jobs: ['test'],  ctk: '11.1', gpu: 't4',       sm: 'gpu', cxx: 'clang9',  std: [17]}
-    - {jobs: ['test'],  ctk: '11.8', gpu: 'rtx2080',  sm: 'gpu', cxx: 'gcc11',   std: [17]}
-    - {jobs: ['test'],  ctk: 'curr', gpu: 'rtxa6000', sm: 'gpu', cxx: 'gcc7',    std: [14]}
-    - {jobs: ['test'],  ctk: 'curr', gpu: 'l4',       sm: 'gpu', cxx: 'gcc13',   std: 'all'}
-    - {jobs: ['test'],  ctk: 'curr', gpu: 'rtx4090',  sm: 'gpu', cxx: 'clang9',  std: [11]}
-    # H100 runners are currently flakey, only build since those use CPU-only runners:
-    - {jobs: ['build'], ctk: 'curr', gpu: 'h100',     sm: 'gpu', cxx: 'gcc12',   std: [11, 20]}
-    - {jobs: ['build'], ctk: 'curr', gpu: 'h100',     sm: 'gpu', cxx: 'clang18', std: [17]}
+    # Increased test coverage compared to nightlies:
+    - {jobs: ['test'],  std: 'all', cxx: ['gcc13', 'clang18', 'msvc2022']}
+    - {jobs: ['test'],  project: 'cudax', ctk: ['12.0', 'curr'], std: 'all', cxx: ['gcc12']}
+    - {jobs: ['test'],  project: 'cudax', ctk: ['12.0'        ], std: 'all', cxx: ['clang14']}
+    - {jobs: ['test'],  project: 'cudax', ctk: [        'curr'], std: 'all', cxx: ['clang18']}
+    # Edge-case jobs
+    - {jobs: ['limited'], project: 'cub', std: 17}
 
-   # nvrtc:
-    - {jobs: ['nvrtc'], ctk: 'curr', gpu: 't4',       sm: 'gpu', cxx: 'gcc13',  std: [20],     project: ['libcudacxx']}
-    - {jobs: ['nvrtc'], ctk: 'curr', gpu: 'rtxa6000', sm: 'gpu', cxx: 'gcc13',  std: [20],     project: ['libcudacxx']}
-    - {jobs: ['nvrtc'], ctk: 'curr', gpu: 'l4',       sm: 'gpu', cxx: 'gcc13',  std: 'all',    project: ['libcudacxx']}
-    - {jobs: ['nvrtc'], ctk: 'curr', gpu: 'h100',     sm: 'gpu', cxx: 'gcc13',  std: [11, 20], project: ['libcudacxx']}
+#  # These are waiting on the NVKS nodes:
+#    - {jobs: ['test'],  ctk: '11.1', gpu: 'v100',     sm: 'gpu', cxx: 'gcc6',    std: [11]}
+#    - {jobs: ['test'],  ctk: '11.1', gpu: 't4',       sm: 'gpu', cxx: 'clang9',  std: [17]}
+#    - {jobs: ['test'],  ctk: '11.8', gpu: 'rtx2080',  sm: 'gpu', cxx: 'gcc11',   std: [17]}
+#    - {jobs: ['test'],  ctk: 'curr', gpu: 'rtxa6000', sm: 'gpu', cxx: 'gcc7',    std: [14]}
+#    - {jobs: ['test'],  ctk: 'curr', gpu: 'l4',       sm: 'gpu', cxx: 'gcc13',   std: 'all'}
+#    - {jobs: ['test'],  ctk: 'curr', gpu: 'rtx4090',  sm: 'gpu', cxx: 'clang9',  std: [11]}
+#    # H100 runners are currently flakey, only build since those use CPU-only runners:
+#    - {jobs: ['build'], ctk: 'curr', gpu: 'h100',     sm: 'gpu', cxx: 'gcc12',   std: [11, 20]}
+#    - {jobs: ['build'], ctk: 'curr', gpu: 'h100',     sm: 'gpu', cxx: 'clang18', std: [17]}
+#
+#   # nvrtc:
+#    - {jobs: ['nvrtc'], ctk: 'curr', gpu: 't4',       sm: 'gpu', cxx: 'gcc13',  std: [20],     project: ['libcudacxx']}
+#    - {jobs: ['nvrtc'], ctk: 'curr', gpu: 'rtxa6000', sm: 'gpu', cxx: 'gcc13',  std: [20],     project: ['libcudacxx']}
+#    - {jobs: ['nvrtc'], ctk: 'curr', gpu: 'l4',       sm: 'gpu', cxx: 'gcc13',  std: 'all',    project: ['libcudacxx']}
+#    - {jobs: ['nvrtc'], ctk: 'curr', gpu: 'h100',     sm: 'gpu', cxx: 'gcc13',  std: [11, 20], project: ['libcudacxx']}
 
   # Any generated jobs that match the entries in `exclude` will be removed from the final matrix for all workflows.
   exclude:
@@ -256,6 +270,7 @@ tags:
   project: { default: ['libcudacxx', 'cub', 'thrust'] }
   # C++ standard
   # If set to 'all', all stds supported by the ctk/compilers/project are used.
+  # If set to 'min', 'max', or 'minmax', the minimum, maximum, or both stds are used.
   # If set, will be passed to script with `-std <std>`.
   std: { required: false }
   # GPU architecture