Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to slurm for TorchDistributedEnv #1706

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

jrapin
Copy link
Contributor

@jrapin jrapin commented Aug 3, 2022

No description provided.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 3, 2022
@jrapin jrapin changed the title Default to slurm for TorchDistributedEnv Fallback to slurm for TorchDistributedEnv Aug 3, 2022
self._job_env = JobEnvironment()
except RuntimeError as e:
if SlurmJobEnvironment._env["job_id"] in os.environ:
# identified a slurm env without submitit, so let's use it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this is a really weird use case and a surprising thing to try to fix: this is basically to make it possible for users (that are not using submitit to launch jobs) to use a helper function from submitit...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is that a problem? I'm actually happy that we can avoid some inter-dependencies, it gives more freedom

Copy link

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, this change enables someone not using submitit to still be able to retrieve those environment variables that are normally set by torchrun.

This seems a bit weird to me, as this is a helper function from within submitit, so I would expect it to only be relevant when using it in conjunction with submitit.

Maybe what we need to do instead is to see if we can setup those env vars in user code (maybe by using torchrun?).

@jrapin
Copy link
Contributor Author

jrapin commented Aug 23, 2022

From my understanding, this change enables someone not using submitit to still be able to retrieve those environment variables that are normally set by torchrun.

can torchrun be used from python and not commandline?

This seems a bit weird to me, as this is a helper function from within submitit, so I would expect it to only be relevant when using it in conjunction with submitit.
Maybe what we need to do instead is to see if we can setup those env vars in user code (maybe by using torchrun?).

i'm fine with it being in a user code, then again with only a couple of line changes we are able to accomodate more use cases easily, without duplicating code which can also bring some positive aspects :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants