Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about what guarnatees the correctness of the c11_atomics atomic_flag test #2052

Open
karolherbst opened this issue Aug 16, 2024 · 2 comments

Comments

@karolherbst
Copy link
Contributor

This is the kernel being executed in the first atomic_flag test, but I'm actually wondering what in the OpenCL spec guarnatees that this executes as expected by the CTS.

More specifically, what prevents a thread to race with a thread from a different subgroup on destMemory[cnt]. One thread could enter the iteration for e.g. cnt == 5, while another thread just reached atomic_flag_clear_explicit(&destMemory[cnt], for the same cnt, which means that two threads will enter the critical section, just the second one won't do anything.

I'm seeing this behavior with rusticl on zink on radv and at the moment it's not clear to me if that's my or the CTS' bug. Removing the atomic_flag_clear_explicit(&destMemory[cnt], makes only one thread execute the "criticial section" for each value of cnt, but that fails the test later.

So what's actually guaranteeing the correctness of this test here? Or would this test need to be rewritten? I'm also mildly wondering what clvk did in order to pass this test or if that was never a problem in the first place. I can probably come up with a fix to guarantee that behavior, I'm just wondering if there is some undefined behavior at play here.

Program source:
__kernel void test_atomic_kernel(uint threadCount, uint numDestItems, __global int *finalDest, __global int *oldValues, volatile __local atomic_flag *destMemory)
{
  uint  tid = get_global_id(0);


              // initialize atomics not reachable from host (first thread
              // is doing this, other threads are waiting on barrier)
              if(get_local_id(0) == 0)
                for(uint dstItemIdx = 0; dstItemIdx < numDestItems; dstItemIdx++)
                {
                  if(finalDest[dstItemIdx])
                    atomic_flag_test_and_set_explicit(destMemory+dstItemIdx,
                                                      memory_order_relaxed,
                                                      memory_scope_work_group);
                  else
                    atomic_flag_clear_explicit(destMemory+dstItemIdx,
                                               memory_order_relaxed,
                                               memory_scope_work_group);    }
  barrier(CLK_LOCAL_MEM_FENCE);

  uint cnt, stop = 0;
  for(cnt = 0; !stop && cnt < threadCount; cnt++) // each thread must find critical section where it is the first visitor
  {
    bool set = atomic_flag_test_and_set_explicit(&destMemory[cnt], memory_order_relaxed, memory_scope_work_group);
    atomic_work_item_fence(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE, memory_order_acquire,memory_scope_work_group);
    if (!set)
    {
      uint csIndex = get_enqueued_local_size(0)*get_group_id(0)+cnt;
      // verify that thread is the first visitor
      if(oldValues[csIndex] == 1000000000)
      {
        oldValues[csIndex] = tid; // set the winner id for this critical section
        stop = 1;
      }
      atomic_work_item_fence(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE, memory_order_release,memory_scope_work_group);
      atomic_flag_clear_explicit(&destMemory[cnt], memory_order_relaxed, memory_scope_work_group);
    }
  }

  // Copy final values to host reachable buffer
  barrier(CLK_LOCAL_MEM_FENCE);
  if(get_local_id(0) == 0) // first thread in workgroup
    for(uint dstItemIdx = 0; dstItemIdx < numDestItems; dstItemIdx++)

                finalDest[dstItemIdx] =
                    atomic_flag_test_and_set_explicit(destMemory+dstItemIdx,
                                                      memory_order_relaxed,
                                                      memory_scope_work_group);}
@karolherbst
Copy link
Contributor Author

I dug a bit deeper and it seems I have threads racing on the if(oldValues[csIndex] == 1000000000). Should that be an atomic operation instead?

@karolherbst
Copy link
Contributor Author

mhh, no, that would mean that something with atomic_flag_test_and_set_explicit is going wrong, but that's kinda of weird.. maybe there is a coherency issue somewhere, anyway, it's probably not a CTS bug then and I initially thought something else is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant