-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fallback to slurm for TorchDistributedEnv #1706
base: main
Are you sure you want to change the base?
Conversation
self._job_env = JobEnvironment() | ||
except RuntimeError as e: | ||
if SlurmJobEnvironment._env["job_id"] in os.environ: | ||
# identified a slurm env without submitit, so let's use it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, this is a really weird use case and a surprising thing to try to fix: this is basically to make it possible for users (that are not using submitit to launch jobs) to use a helper function from submitit...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is that a problem? I'm actually happy that we can avoid some inter-dependencies, it gives more freedom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, this change enables someone not using submitit to still be able to retrieve those environment variables that are normally set by torchrun
.
This seems a bit weird to me, as this is a helper function from within submitit, so I would expect it to only be relevant when using it in conjunction with submitit.
Maybe what we need to do instead is to see if we can setup those env vars in user code (maybe by using torchrun?).
can torchrun be used from python and not commandline?
i'm fine with it being in a user code, then again with only a couple of line changes we are able to accomodate more use cases easily, without duplicating code which can also bring some positive aspects :) |
No description provided.