Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CUDART hijack #1730

Open
2 of 3 tasks
muraj opened this issue Jul 26, 2024 · 12 comments
Open
2 of 3 tasks

Remove CUDART hijack #1730

muraj opened this issue Jul 26, 2024 · 12 comments
Assignees
Labels
best effort indicates the milestone tag for an issue is a goal rather than a commitment Realm Issues pertaining to Realm
Milestone

Comments

@muraj
Copy link

muraj commented Jul 26, 2024

This issue is to track progress on removing the need for the CUDART hijack and eventual removal of it from the Realm codebase. Current use cases known are (will be updated as use cases come up):

@muraj muraj added Realm Issues pertaining to Realm best effort indicates the milestone tag for an issue is a goal rather than a commitment labels Jul 26, 2024
@muraj muraj self-assigned this Jul 26, 2024
@elliottslaughter
Copy link
Contributor

#1059 is a blocker on the Regent side.

@lightsighter
Copy link
Contributor

Noting here that the only way that Legion Prof accurately represents the time that CUDA kernels for a GPU task spend on the device today is by relying on Realm's hijack. If the hijack is disabled, then Realm currently over approximates the lifetime of the GPU kernels by assuming that are launched and start running on the GPU immediately as soon as the GPU task starts, which is not always the case. Detecting when the first kernel is launched an enqueuing a CUDA event right before it is probably going to be challenging without the hijack.

@muraj
Copy link
Author

muraj commented Jul 28, 2024

@lightsighter If I were to replace the current timestamp reporting for the kernel launched with a CUPTI activity timestamp difference, would that cover the Legion profiler use case? I am thinking I can enable the following and retrieve the completion field to get the information requested:
https://docs.nvidia.com/cupti/api/structCUpti__ActivityKernel9.html
I can then use the correlationId to correlate to the currently running task using a cupti callback:
https://docs.nvidia.com/cupti/api/structCUpti__CallbackData.html

These should not have any additional overhead than what is already in place with the Realm CUDART hijack, but it does require CUPTI to be installed, which requires a CUDA Toolkit to be install locally on the system somewhere. CUPTI is ABI stable IIRC, so a dynamic loader can be built and we can dynamically detect it's presence on the system (and can toggle it via a cmdline arg or w/e if you'd like). Is that acceptable?

@elliottslaughter I'm not sure I understand what the actual issue is here, would it be possible for you to summarize it in the issue? I've added myself to the issue and I can talk to @magnatelee about it next week for more clarity.

@lightsighter
Copy link
Contributor

These should not have any additional overhead than what is already in place with the Realm CUDART hijack, but it does require CUPTI to be installed, which requires a CUDA Toolkit to be install locally on the system somewhere. CUPTI is ABI stable IIRC, so a dynamic loader can be built and we can dynamically detect it's presence on the system (and can toggle it via a cmdline arg or w/e if you'd like). Is that acceptable?

That seems reasonable to me, but I'll let other folks comment as well as I'm not the only one who expects that to work. Note that we don't need to profile every kernel. We mainly want this profiling response from Realm to provide tight bounds on when the kernels on the GPU are actually running on the device:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/profiling.h?ref_type=heads#L127-147

@muraj
Copy link
Author

muraj commented Jul 28, 2024

@lightsighter Unless I'm mistaken, the hijack doesn't seem to come into play with this. When we add an OperationTimelineGPU to the operation, it enables this path:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/cuda/cuda_module.cc#L1597

This just puts either a stream callback on the task stream, or records an event and schedules a bgworker to retrieve that event and record the CPU time.
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/cuda/cuda_module.cc#L1214

No tight bounding is done as far as I can tell.

@eddy16112
Copy link
Contributor

If I remember correctly. We do not record the tight bounding of each kernel, but only record the start of the first kernel and the end of the last kernel.

@muraj
Copy link
Author

muraj commented Jul 28, 2024

@eddy16112 I understand that, but where is this done? If it's done with GPUWorkStart, then it's not tight bound at all.

@eddy16112
Copy link
Contributor

We record an event before and after the task body respectively, and use the first event as the start and the second one as the end.

@muraj
Copy link
Author

muraj commented Jul 28, 2024

Yes, but those events are queued before the task is run. If the stream is idle at the start of the task body (which it must be IIUC), then the event is recorded immediately, not when the first kernel starts. This logic also has nothing to do with the hijack.

@lightsighter
Copy link
Contributor

Right, I think the CUDA runtime hijack used to do that by only enqueuing the event upon the first kernel launch and not at the start of the task, but that code seems to have been lost. It could have happened anytime in the last 9 years when I wrote the first version of the hijack and then stopped working on it myself.

I'll note that we should probably fix this regardless of whether it is for getting rid of the hijack or not. I know for certain that @jjwilke could use it right now for running code without the hijack.

@muraj
Copy link
Author

muraj commented Jul 29, 2024

@lightsighter Can you file a separate issue for this with all the requirements and steps to actually test if this is set up properly? I haven't gotten familiar enough with Legion Prof to ensure I can see the issue and ensure it's fixed.

@lightsighter
Copy link
Contributor

I've started an issue on improving the precision of the GPU profiling here: #1732

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
best effort indicates the milestone tag for an issue is a goal rather than a commitment Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

4 participants