Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: BlockLoad/WarpLoad never attempts to vectorize even using vectorized algorithm. #431

Open
1 task done
iclementine opened this issue Sep 12, 2023 · 3 comments
Open
1 task done
Assignees
Labels
blocked This PR cannot be merged due to various reasons bug Something isn't working right.

Comments

@iclementine
Copy link

iclementine commented Sep 12, 2023

Is this a duplicate?

Type of Bug

Something else

Component

CUB

Describe the bug

BlockLoad/WarpLoad never attempts to vectorize even using BLOCK_LOAD_VECTORIZE/WARP_LOAD_VECTORIZE algorithm.

Since the implementation use

InternalLoad(temp_storage, linear_tid).Load(block_itr, items);

And the Sepialization of LoadInternal for BLOCK_LOAD_VECTORIZE has a bug that two overload of Load have a template parameter InputIteratorT that cannot be inferred from the input arguments. So it will always use the overload that skips vectorization.

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/block/block_load.cuh#L793C1-L810C1

        /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            InputT               *block_ptr,                     ///< [in] The thread block's base input iterator for loading from
            InputT               (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
        {
            InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
        }


        /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization)
        template <typename InputIteratorT>
        __device__ __forceinline__ void Load(
            const InputT         *block_ptr,                     ///< [in] The thread block's base input iterator for loading from
            InputT               (&items)[ITEMS_PER_THREAD])     ///< [out] Data to load
        {
            InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
        }

WarpLoad has similar issues.

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/warp/warp_load.cuh#L320C1-L330C6

    template <typename InputIteratorT>
    __device__ __forceinline__ void Load(InputT *block_ptr, InputT (&items)[ITEMS_PER_THREAD])
    {
      InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
    }


    template <typename InputIteratorT>
    __device__ __forceinline__ void Load(const InputT *block_ptr, InputT (&items)[ITEMS_PER_THREAD])
    {
      InternalLoadDirectBlockedVectorized<LOAD_DEFAULT>(linear_tid, block_ptr, items);
    }

I tried removing that template parameter which enables vectorized load. But I've found another issue that some functions like ThreadLoad (which is used by BlockLoad) does not have const T* as input parameter, which prevents passing in const T* to BlockLoad::Load in some cases. The non-constness propagates all the way to user code. I hope there will be a fixe to that too.

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/block/block_load.cuh#L185C1-L190C6

    #pragma unroll
    for (int ITEM = 0; ITEM < VECTORS_PER_THREAD; ITEM++)
    {
        vec_items[ITEM] = ThreadLoad<MODIFIER>(vec_ptr + ITEM);
    }

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/thread/thread_load.cuh#L301C1-L311C2

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/thread/thread_load.cuh#L348C1-L360C1

template <typename T>
__device__ __forceinline__ T ThreadLoad(
    T                       *ptr,
    Int2Type<LOAD_DEFAULT>  /*modifier*/,
    Int2Type<true>          /*is_pointer*/)
{
    return *ptr;
}
/**
 * ThreadLoad definition for LOAD_VOLATILE modifier on pointer types
 */
template <typename T>
__device__ __forceinline__ T ThreadLoad(
    T                       *ptr,
    Int2Type<LOAD_VOLATILE> /*modifier*/,
    Int2Type<true>          /*is_pointer*/)
{
    // Apply tags for partial-specialization
    return ThreadLoadVolatilePointer(ptr, Int2Type<Traits<T>::PRIMITIVE>());
}

It is forwarded from NVIDIA/cub#739

How to Reproduce

  1. make a kernel that load a block of data using vectorized algorithm.
template <typename InputT, int BLOCK_SIZE>
__global__ void load_and_verify(const InputT *input_ptr, int numel) {
  using BlockLoadT =
      cub::BlockLoad<InputT, BLOCK_SIZE, 4, cub::BLOCK_LOAD_VECTORIZE>;
  BlockLoadT block_load;
  InputT thread_local_elems[4];
  int block_offset =
      blockIdx.x * (4 * BLOCK_SIZE); // offset of mem for this block to load
  block_load.Load(input_ptr + block_offset, thread_local_elems);
  cub::CTA_SYNC();
}
  1. allocate a thrust device vector and pass its data pointer to that kernel. Since the address of the data is aligned and the size is a power of 2, we can expect vectorized load.
int main() {
  int numel = 1 << 10;
  std::vector<float> input(numel);
  std::iota(input.begin(), input.end(), 0.0);

  thrust::device_vector<float> ginput(input);

  constexpr int BLOCK_SIZE = 64;
  constexpr int ITEMS_PER_BLOCK = 4;
  int grid_size = (numel + BLOCK_SIZE * ITEMS_PER_BLOCK - 1) /
                  (BLOCK_SIZE * ITEMS_PER_BLOCK);

  load_and_verify<float, BLOCK_SIZE>
      <<<grid_size, BLOCK_SIZE>>>(
        reinterpret_cast<float*>(thrust::raw_pointer_cast(ginput.data())),
                                  numel);
  cudaDeviceSynchronize();
  return 0;
}

But by cuda-gdb we can find that vectorized load is not used. It falls back to direct load.

Expected behavior

Vectorized Load should be used. More specifically, the code below should be invoked.

https://github.com/NVIDIA/cccl/blob/3ef9dd9642da2d4e0b3ff77e445e73d7aabd4687/cub/cub/block/block_load.cuh#L793C1-L800C10

Reproduction link

No response

Operating System

Ubuntu linux 23.04

nvidia-smi output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:31:00.0 Off |                  N/A |
| 30%   31C    P8              30W / 350W |      3MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
@iclementine iclementine added the bug Something isn't working right. label Sep 12, 2023
@github-actions github-actions bot added the needs triage Issues that require the team's attention label Sep 12, 2023
@github-actions
Copy link
Contributor

Hi @iclementine!

Thanks for submitting this issue - the CCCL team has been notified and we'll get back to you as soon as we can!
In the mean time, feel free to add any relevant information to this issue.

@gevtushenko
Copy link
Collaborator

Copy some important information:

I can verify that vectorization doesn't happen due to the extra template parameter. Removing extra template parameters leads to vectorized loads, but I don't think we can fix the issue right now because vectorized load guarantees that:

    /* - The following conditions will prevent vectorization and loading will fall
     *   back to cub::BLOCK_LOAD_DIRECT:
     *   - The block input offset is not quadword-aligned
     */

There are no checks of the input alignment. When given a pointer that's not aligned to the vectorized type, block / warp load leads to invalid loads. I believe this has to be addressed before enabling vectorization.

I suggest we file another issue to check input pointer alignment and fallback to direct load and mark this issue as blocked. Since adding extra checks might affect performance, I suggest we also consider providing cuda::aligned_size_t overload.

@gevtushenko gevtushenko added blocked This PR cannot be merged due to various reasons and removed needs triage Issues that require the team's attention labels Sep 12, 2023
@victoroliv2
Copy link

Hi folks, it seems that BLOCK_LOAD_VECTORIZE is not working due to this bug. I tried it on Halfx8 and don't see STG.E.128.SYS instructions being used.

If the issue is how to guarantee alignment perhaps the documentation should be updated to say it's undefined behavior using an unaligned pointer in BlockLoad.load() ? This is equivalent behavior to loads/stores using vectorized types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked This PR cannot be merged due to various reasons bug Something isn't working right.
Projects
Status: Todo
Development

No branches or pull requests

4 participants