Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DLRM v2] Using the model for the inference reference implementation #648

Open
pgmpablo157321 opened this issue May 17, 2023 · 6 comments
Open

Comments

@pgmpablo157321
Copy link
Contributor

pgmpablo157321 commented May 17, 2023

I am currently making the reference implementation and am stuck deploying the model in multiple GPUs.

Here is a link to the PR: mlcommons/inference#1373
Here is a link to the file where the model is: https://github.com/mlcommons/inference/blob/7c64689b261f97a4fc3410bff584ac2439453bcc/recommendation/dlrm_v2/pytorch/python/backend_pytorch_native.py

Currently this works for a debugging model and a single GPU, but fails when I try to run it with multiple ones. Here are the issues that I have:

  1. If I run the benchmark, it gets stuck in this line. This is because you need to run that line for each rank, but I am not able to run it, load the model in the variable and store it there to query it.
  2. Running the benchmark in CPU, I get the following error when making a prediction.
[...]'fbgemm' object has no attribute 'jagged_2d_to_dense' (this happens when importing torchrec)

or

[...]fbgemm object has no attribute 'bounds_check_indices' (this happens when making a prediction)

This can be because I am trying to load a sharded model in a different number of ranks. Do you know if that could be related if thats related?

I have tried with pytorch versions 1.12, 1.13, 2.0.0, 2.0.1 and fbgemm version 0.3.2 and 0.4.1

@yuankuns
Copy link

Hi Pablo,
You need to install fbgemm-gpu-cpu==0.3.2 to avoid this error.

@pgmpablo157321
Copy link
Contributor Author

Already have this version, but the error persist

Name: fbgemm-gpu-cpu
Version: 0.3.2
Summary: 
Home-page: https://github.com/pytorch/fbgemm
Author: FBGEMM Team
Author-email: [email protected]
License: BSD-3
Location: /opt/conda/lib/python3.7/site-packages
Requires: 
Required-by:

@yuankuns
Copy link

Have you tried to remove fbgemm-gpu as well?

@pgmpablo157321
Copy link
Contributor Author

@yuankuns When i try to remove the fbgemm-gpu, the following import error:

ModuleNotFoundError: No module named 'fbgemm_gpu'

I managed to run the cpu version with fbgemm-gpu-cpu==0.3.2 fbgemm-gpu==0.4.1 pytorch==1.13.1 in a machine with gpu. Without gpu I get an fbgemm error like the ones I posted before

@yuankuns
Copy link

@pgmpablo157321 It's interesting, since there is no GPU on our server, and it (only fbgemm-gpu-cpu==0.3.2) work for our case.

@ShriyaPalsamudram
Copy link
Contributor

@pgmpablo157321 is this still an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants