Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Ampere mbarrier support for barriers with non-default completion function #419

Open
1 task done
gonzalobg opened this issue Sep 8, 2023 · 6 comments
Open
1 task done
Labels
feature request New feature or request.

Comments

@gonzalobg
Copy link
Collaborator

Is this a duplicate?

Area

libcu++

Is your feature request related to a problem? Please describe.

We are not using mbarriers for cuda::barrier<thread_scope_block, userdefined>. We should use mbarrier for those, since in Hopper the userdefined completion function is required to automatically perform an expect_tx operation during the phase completion step.

Describe the solution you'd like

We are not using mbarriers for cuda::barrier<thread_scope_block, userdefined>. We should use mbarrier for those, since in Hopper the userdefined completion function is required to automatically perform an expect_tx operation during the phase completion step.

Describe alternatives you've considered

No response

Additional context

No response

@gonzalobg gonzalobg added the feature request New feature or request. label Sep 8, 2023
@griwes
Copy link
Collaborator

griwes commented Sep 8, 2023

I don't know how to implement what you are asking for without a bunch of additional state and additional atomics + branches on, well, somewhat hot paths. If a barrier is completed by an async thread, who calls the completion function? Do you want every call to arrive do a cmpxchg on some additional state that tracks the phase and branch on the result of that to invoke this function?

I also don't understand where this need comes from; a user will be able to do a barrier_arrive_tx with the count that will perform the expect_tx operation. Why is there a need to do this automatically on phase flip? Can you provide an example of code that would use the completion function this way?

@gonzalobg
Copy link
Collaborator Author

If a barrier is completed by an async thread, who calls the completion function?

An async thread that completes a cuda::barrier does not run the phase completion step; that's done by a thread that waits. If no thread ever waits, the phase completion step doesn't need to run.

There are a bunch of ways to implement this:

  • Artificially modify arrival count so that no arrival completes the mbarrier. Waiters poll on the minimum value or mbarrier changing phase. Waiters that observe minimum value attempt to win a CAS, one wins, runs phase completion, and arrives agains, unblocking waiters.
  • mbarrier pair: waiters wait on a pair of mbarriers to flip. The second mbarrier has an expected count of 1. Only one of the threads unblocked from the first mbarrier wins a CAS, executes the completion function, and arrives at the second mbarrier. All threads that wait on the first barrier and observe it has changed phase, test the second mbarrier before trying to win the CAS.

I also don't understand where this need comes from;

If the question is, what are completion functions useful for, we have a tutorial showing how to use them to perform a reduction in the programming guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#completion-function

@griwes
Copy link
Collaborator

griwes commented Sep 8, 2023

If the question is, what are completion functions useful for

No, I mean specifically the case you are describing with expect_tx.

@gonzalobg
Copy link
Collaborator Author

I'm not describing anything about expect_tx in this issue. I think you are looking for: #420

@griwes
Copy link
Collaborator

griwes commented Sep 8, 2023

You used expect_tx as a motivating example here, but I guess that #420 is that motivating example.

@gonzalobg
Copy link
Collaborator Author

Yes, that example is common enough that it deserves a built in solution; this issue is just for general purpose support for completion functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Status: Todo
Development

No branches or pull requests

2 participants