Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Dask-CUDA does not work with Merlin/NVTabular #363

Open
rjzamora opened this issue Dec 18, 2023 · 3 comments
Open

[BUG] Dask-CUDA does not work with Merlin/NVTabular #363

rjzamora opened this issue Dec 18, 2023 · 3 comments

Comments

@rjzamora
Copy link
Contributor

As pointed out by @oliverholworthy in #274 (comment), cuda_isavailable() is used in merlin.core.compat to check for cuda support. Unfortunately, this is a known problem for dask-cuda.

This most likely means that Merlin/NVTabular has not worked properly with Dask-CUDA for more than six months now. For example, the following code will produce an OOM error for 32GB V100s:

import time
from merlin.core.utils import Distributed
if __name__ == "__main__":
    with Distributed(rmm_pool_size="24GiB"):
        time.sleep(30)

You will also see an error if you don't import any merlin/nvt code, but use the offending cuda.is_available() command:

import time
from numba import cuda # This is fine
cuda.is_available() # This is NOT
from dask_cuda import LocalCUDACluster
if __name__ == "__main__":
    with LocalCUDACluster(rmm_pool_size="24GiB") as cluster:
        time.sleep(30)

Meanwhile, the code works fine if you don't sue the offending command or import code that also imports merlin.core.compat:

import time
from dask_cuda import LocalCUDACluster
if __name__ == "__main__":
    with LocalCUDACluster(rmm_pool_size="24GiB") as cluster:
        time.sleep(30)
@rjzamora
Copy link
Contributor Author

cc @jperez999 @karlhigley

@rjzamora
Copy link
Contributor Author

rjzamora commented Jan 10, 2024

@jperez999 - Do you think this line is actually necessary? If we already have HAS_GPU from the line above, maybe we can just do:

if not HAS_GPU:
    cuda = None

@pentschev
Copy link

Besides the above I was looking at the code in more detail and I see the following block:

if kind == "free":
return int(cuda.current_context().get_memory_info()[0])
else:
return int(cuda.current_context().get_memory_info()[1])

This is creating a new context on a GPU only to query memory size, and CUDA context should never be addressed before Dask initializes the cluster. Also note in the pynvml_mem_size there's an equivalent code block:

if kind == "free":
size = int(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(index)).free)
elif kind == "total":
size = int(pynvml.nvmlDeviceGetMemoryInfo(pynvml.nvmlDeviceGetHandleByIndex(index)).total)

The PyNVML code will NOT create CUDA context and is safe to run before Dask. Is there a reason why you're using the code block with Numba to query GPU memory instead of always using PyNVML for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants