[BUG]: BlockLoad/WarpLoad never attempts to vectorize even using vectorized algorithm. #431

iclementine · 2023-09-12T01:30:58Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Something else

Component

CUB

Describe the bug

BlockLoad/WarpLoad never attempts to vectorize even using BLOCK_LOAD_VECTORIZE/WARP_LOAD_VECTORIZE algorithm.

Since the implementation use

InternalLoad(temp_storage, linear_tid).Load(block_itr, items);

And the Sepialization of LoadInternal for BLOCK_LOAD_VECTORIZE has a bug that two overload of Load have a template parameter InputIteratorT that cannot be inferred from the input arguments. So it will always use the overload that skips vectorization.

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/block/block_load.cuh#L793C1-L810C1

        /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputT               *block_ptr,                     ///< [in] The thread block's base input iterator for loading from
            InputT               (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
        {
            InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
        }


        /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            const InputT         *block_ptr,                     ///< [in] The thread block's base input iterator for loading from
            InputT               (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
        {
            InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
        }

WarpLoad has similar issues.

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/warp/warp_load.cuh#L320C1-L330C6

    template <typename InputIteratorT>
    __device__ __forceinline__ void Load(InputT *block_ptr, InputT (&items)[ITEMS_PER_THREAD])
    {
      InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
    }


    template <typename InputIteratorT>
    __device__ __forceinline__ void Load(const InputT *block_ptr, InputT (&items)[ITEMS_PER_THREAD])
    {
      InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
    }

I tried removing that template parameter which enables vectorized load. But I've found another issue that some functions like ThreadLoad (which is used by BlockLoad) does not have const T* as input parameter, which prevents passing in const T* to BlockLoad::Load in some cases. The non-constness propagates all the way to user code. I hope there will be a fixe to that too.

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/block/block_load.cuh#L185C1-L190C6

    #pragma unroll
    for (int ITEM = 0; ITEM < VECTORS_PER_THREAD; ITEM++)
    {
        vec_items[ITEM] = ThreadLoad<MODIFIER>(vec_ptr + ITEM);
    }

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/thread/thread_load.cuh#L301C1-L311C2

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/thread/thread_load.cuh#L348C1-L360C1

template <typename T>
__device__ __forceinline__ T ThreadLoad(
    T                       *ptr,
    Int2Type<LOAD_DEFAULT>  /*modifier*/,
    Int2Type<true>          /*is_pointer*/)
{
    return *ptr;
}

/**
 * ThreadLoad definition for LOAD_VOLATILE modifier on pointer types
 */
template <typename T>
__device__ __forceinline__ T ThreadLoad(
    T                       *ptr,
    Int2Type<LOAD_VOLATILE> /*modifier*/,
    Int2Type<true>          /*is_pointer*/)
{
    // Apply tags for partial-specialization
    return ThreadLoadVolatilePointer(ptr, Int2Type<Traits<T>::PRIMITIVE>());
}

It is forwarded from NVIDIA/cub#739

How to Reproduce

make a kernel that load a block of data using vectorized algorithm.

template <typename InputT, int BLOCK_SIZE>
__global__ void load_and_verify(const InputT *input_ptr, int numel) {
  using BlockLoadT =
      cub::BlockLoad<InputT, BLOCK_SIZE, 4, cub::BLOCK_LOAD_VECTORIZE>;
  BlockLoadT block_load;
  InputT thread_local_elems[4];
  int block_offset =
      blockIdx.x * (4 * BLOCK_SIZE); // offset of mem for this block to load
  block_load.Load(input_ptr + block_offset, thread_local_elems);
  cub::CTA_SYNC();
}

allocate a thrust device vector and pass its data pointer to that kernel. Since the address of the data is aligned and the size is a power of 2, we can expect vectorized load.

int main() {
  int numel = 1 << 10;
  std::vector<float> input(numel);
  std::iota(input.begin(), input.end(), 0.0);

  thrust::device_vector<float> ginput(input);

  constexpr int BLOCK_SIZE = 64;
  constexpr int ITEMS_PER_BLOCK = 4;
  int grid_size = (numel + BLOCK_SIZE * ITEMS_PER_BLOCK - 1) /
                  (BLOCK_SIZE * ITEMS_PER_BLOCK);

  load_and_verify<float, BLOCK_SIZE>
      <<<grid_size, BLOCK_SIZE>>>(
        reinterpret_cast<float*>(thrust::raw_pointer_cast(ginput.data())),
                                  numel);
  cudaDeviceSynchronize();
  return 0;
}

But by cuda-gdb we can find that vectorized load is not used. It falls back to direct load.

Expected behavior

Vectorized Load should be used. More specifically, the code below should be invoked.

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/block/block_load.cuh#L793C1-L800C10

Reproduction link

No response

Operating System

Ubuntu linux 23.04

nvidia-smi output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:31:00.0 Off |                  N/A |
| 30%   31C    P8              30W / 350W |      3MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

The text was updated successfully, but these errors were encountered:

github-actions · 2023-09-12T01:31:12Z

Hi @iclementine!

Thanks for submitting this issue - the CCCL team has been notified and we'll get back to you as soon as we can!
In the mean time, feel free to add any relevant information to this issue.

gevtushenko · 2023-09-12T14:02:21Z

Copy some important information:

I can verify that vectorization doesn't happen due to the extra template parameter. Removing extra template parameters leads to vectorized loads, but I don't think we can fix the issue right now because vectorized load guarantees that:

    /* - The following conditions will prevent vectorization and loading will fall
     *   back to cub::BLOCK_LOAD_DIRECT:
     *   - The block input offset is not quadword-aligned
     */

There are no checks of the input alignment. When given a pointer that's not aligned to the vectorized type, block / warp load leads to invalid loads. I believe this has to be addressed before enabling vectorization.

I suggest we file another issue to check input pointer alignment and fallback to direct load and mark this issue as blocked. Since adding extra checks might affect performance, I suggest we also consider providing cuda::aligned_size_t overload.

victoroliv2 · 2024-05-30T00:08:02Z

Hi folks, it seems that BLOCK_LOAD_VECTORIZE is not working due to this bug. I tried it on Halfx8 and don't see STG.E.128.SYS instructions being used.

If the issue is how to guarantee alignment perhaps the documentation should be updated to say it's undefined behavior using an unaligned pointer in BlockLoad.load() ? This is equivalent behavior to loads/stores using vectorized types.

iclementine added the bug Something isn't working right. label Sep 12, 2023

github-actions bot assigned jrhemstad Sep 12, 2023

github-actions bot added the needs triage Issues that require the team's attention label Sep 12, 2023

iclementine mentioned this issue Sep 12, 2023

BlockLoad never attempts to vectorize NVIDIA/cub#739

Closed

gevtushenko added blocked This PR cannot be merged due to various reasons and removed needs triage Issues that require the team's attention labels Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: BlockLoad/WarpLoad never attempts to vectorize even using vectorized algorithm. #431

[BUG]: BlockLoad/WarpLoad never attempts to vectorize even using vectorized algorithm. #431

iclementine commented Sep 12, 2023 •

edited

Loading

github-actions bot commented Sep 12, 2023

gevtushenko commented Sep 12, 2023

victoroliv2 commented May 30, 2024

[BUG]: BlockLoad/WarpLoad never attempts to vectorize even using vectorized algorithm. #431

[BUG]: BlockLoad/WarpLoad never attempts to vectorize even using vectorized algorithm. #431

Comments

iclementine commented Sep 12, 2023 • edited Loading

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

github-actions bot commented Sep 12, 2023

gevtushenko commented Sep 12, 2023

victoroliv2 commented May 30, 2024

iclementine commented Sep 12, 2023 •

edited

Loading