Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Fix resource leak related to SYCL_FALLBACK_ASSERT #12532

Merged
merged 4 commits into from
Jan 31, 2024

Conversation

aelovikov-intel
Copy link
Contributor

#6837 enabled asynchronous buffer destruction for buffers constructed without host data. However, initial fallback assert implementation in
#3767 predates it and as such had to place the buffer inside queue_impl to avoid unintended synchronization point. I don't know if there was the same crash observed on the end-to-end test added as part of this PR prior to
#3767, but it doesn't even matter because the "new" implementation is both simpler and doesn't result in a crash.

I suspect that without it (with the buffer for fallback assert implementation being a data member of sycl::queue_impl) we had a cyclic dependency somewhere leading to resource leak and ultimately to the assert in DeviceGlobalUSMMem::~DeviceGlobalUSMMem().

intel#6837 enabled asynchronous
buffer destruction for buffers constructed without host data. However,
initial fallback assert implementation in
intel#3767 predates it and as such had to
place the buffer inside `queue_impl` to avoid unintended synchronization
point. I don't know if there was the same crash observed on the
end-to-end test added as part of this PR prior to
intel#3767, but it doesn't even matter
because the "new" implementation is both simpler and doesn't result in a
crash.

I suspect that without it (with the buffer for fallback assert
implementation being a data member of `sycl::queue_impl`) we had a
cyclic dependency somewhere leading to resource leak and ultimately to
the assert in `DeviceGlobalUSMMem::~DeviceGlobalUSMMem()`.
Copy link
Contributor

github-actions bot commented Jan 29, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@maarquitos14
Copy link
Contributor

The newly added test is failing. Should I wait for further changes before review?

@aelovikov-intel
Copy link
Contributor Author

My bet is those are yet another issue. AMD/HIP is totally different because it has aspect::ext_oneapi_native_assert and different code path is executed. Memory leaks detection on Windows is unreliable and I should disable it there.

The only remaining failure that could possibly be related is the leak on Lin/gen12 configuration, but even that is improvement over previous state where we'd fail with an assert because nothing was released. Here, at least, it's just one queue and one event.

HIP failure is unrelated, so are (likely) extra memory leaks found in
CI. The manifestation of the original bug is that almost nothing was
freed resulting in an assert in
`DeviceGlobalUSMMem::~DeviceGlobalUSMMem()`. That is what this PR
addresses and is being verified by the test even without
UR_L0_LEAKS_DEBUG.
@aelovikov-intel
Copy link
Contributor Author

@maarquitos14 , I've removed the leak tracking from the test as those issues are distinct from what is being fixed here.

@KseniyaTikhomirova , @sergey-semenov I know you are busy but I think you're the most experienced in this area, would appreciate if you can take a quick look too.

Copy link
Contributor

@sergey-semenov sergey-semenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, relying on the deferred release is much better.

Copy link
Contributor

@maarquitos14 maarquitos14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

sycl/include/sycl/queue.hpp Show resolved Hide resolved
@aelovikov-intel aelovikov-intel merged commit b478d2f into intel:sycl Jan 31, 2024
19 of 20 checks passed
@aelovikov-intel aelovikov-intel deleted the fix-fallback-assert branch January 31, 2024 20:08
@aelovikov-intel
Copy link
Contributor Author

aelovikov-intel commented Jan 31, 2024

Post commit failure (Arc GPU):

Failed Tests (1):
  SYCL :: Plugin/sycl-partition-info.cpp
FAIL: SYCL :: Plugin/sycl-partition-info.cpp (1486 of 1857)
******************** TEST 'SYCL :: Plugin/sycl-partition-info.cpp' FAILED ********************
Exit Code: -6

Command Output (stdout):
--
# RUN: at line 1
/__w/llvm/llvm/toolchain/bin//clang++   -fsycl -fsycl-targets=spir64 /__w/llvm/llvm/llvm/sycl/test-e2e/Plugin/sycl-partition-info.cpp -o /__w/llvm/llvm/build-e2e/Plugin/Output/sycl-partition-info.cpp.tmp.out
# executed command: /__w/llvm/llvm/toolchain/bin//clang++ -fsycl -fsycl-targets=spir64 /__w/llvm/llvm/llvm/sycl/test-e2e/Plugin/sycl-partition-info.cpp -o /__w/llvm/llvm/build-e2e/Plugin/Output/sycl-partition-info.cpp.tmp.out
# note: command had no output on stdout or stderr
# RUN: at line 2
env ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /__w/llvm/llvm/build-e2e/Plugin/Output/sycl-partition-info.cpp.tmp.out
# executed command: env ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/Plugin/Output/sycl-partition-info.cpp.tmp.out
# .---command stdout------------
# | Abort was called at 253 line in file:
# | ../../neo/level_zero/core/source/builtin/builtin_functions_lib_impl.cpp
# `-----------------------------
# .---command stderr------------
# | pure virtual method called
# | terminate called without an active exception
# `-----------------------------
# error: command failed with exit status: -6

Looks same as the failure in #12555 (comment) (on a different test).

aelovikov-intel added a commit to aelovikov-intel/llvm that referenced this pull request Mar 28, 2024
It's been unused since intel#12532 awaiting
ABI breaking window for the final removal.
aelovikov-intel added a commit that referenced this pull request Apr 1, 2024
)

It's been unused since #12532 awaiting
ABI breaking window for the final removal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants