-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Fix resource leak related to SYCL_FALLBACK_ASSERT #12532
[SYCL] Fix resource leak related to SYCL_FALLBACK_ASSERT #12532
Conversation
intel#6837 enabled asynchronous buffer destruction for buffers constructed without host data. However, initial fallback assert implementation in intel#3767 predates it and as such had to place the buffer inside `queue_impl` to avoid unintended synchronization point. I don't know if there was the same crash observed on the end-to-end test added as part of this PR prior to intel#3767, but it doesn't even matter because the "new" implementation is both simpler and doesn't result in a crash. I suspect that without it (with the buffer for fallback assert implementation being a data member of `sycl::queue_impl`) we had a cyclic dependency somewhere leading to resource leak and ultimately to the assert in `DeviceGlobalUSMMem::~DeviceGlobalUSMMem()`.
✅ With the latest revision this PR passed the C/C++ code formatter. |
The newly added test is failing. Should I wait for further changes before review? |
My bet is those are yet another issue. AMD/HIP is totally different because it has The only remaining failure that could possibly be related is the leak on Lin/gen12 configuration, but even that is improvement over previous state where we'd fail with an assert because nothing was released. Here, at least, it's just one queue and one event. |
HIP failure is unrelated, so are (likely) extra memory leaks found in CI. The manifestation of the original bug is that almost nothing was freed resulting in an assert in `DeviceGlobalUSMMem::~DeviceGlobalUSMMem()`. That is what this PR addresses and is being verified by the test even without UR_L0_LEAKS_DEBUG.
@maarquitos14 , I've removed the leak tracking from the test as those issues are distinct from what is being fixed here. @KseniyaTikhomirova , @sergey-semenov I know you are busy but I think you're the most experienced in this area, would appreciate if you can take a quick look too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, relying on the deferred release is much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Post commit failure (Arc GPU):
Looks same as the failure in #12555 (comment) (on a different test). |
It's been unused since intel#12532 awaiting ABI breaking window for the final removal.
) It's been unused since #12532 awaiting ABI breaking window for the final removal.
#6837 enabled asynchronous buffer destruction for buffers constructed without host data. However, initial fallback assert implementation in
#3767 predates it and as such had to place the buffer inside
queue_impl
to avoid unintended synchronization point. I don't know if there was the same crash observed on the end-to-end test added as part of this PR prior to#3767, but it doesn't even matter because the "new" implementation is both simpler and doesn't result in a crash.
I suspect that without it (with the buffer for fallback assert implementation being a data member of
sycl::queue_impl
) we had a cyclic dependency somewhere leading to resource leak and ultimately to the assert inDeviceGlobalUSMMem::~DeviceGlobalUSMMem()
.