Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization on multiple nodes on a HPC-cluster #2179

Closed
NormanTUD opened this issue Jan 25, 2024 · 1 comment
Closed

Parallelization on multiple nodes on a HPC-cluster #2179

NormanTUD opened this issue Jan 25, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@NormanTUD
Copy link

🚀 Feature Request

I have a HPC cluster at my hands, and we want to use ax/botorch to optimize hyperparameters from neural networks. Currently, we use HyperOpt, which allows you to have worker-processes on different nodes that communicate to a single MongoDB server, which coordinates which hyperparameter constellations should be tried out, and which already have been tried out, and saves the results.

We have hundreds of nodes, and with HyperOpt, we can run a worker on each of them and let it train a neural network on a certain parameter configuration and then use the result to further find good possible points. We'd love to do something similiar, but with ax/botorch, but I just cannot get it to work.

I've tried using multiprocessing, like suggested here -> facebook/Ax#896 , but it didnt work out for me. I got, depending on the code I tried to use, many different error messages, like "DataRequiredError: All trials for current model have been generated, but not enough data has been observed to fit next model. Try again when more data are available.", but many more, too many to fit them all here.

I've been looking through the documentation, and I thought I may use OptimizationLoop in ax, together with run_async, to create a temporary file that a worker and work on, and then, when done, can return, but but it turned out that the only thing that this option does is to trigger an assertion: assert not run_async, "OptimizationLoop does not yet support async.".

Is there any example on how I could do that? I'd prefer botorch, as, if I understood it correctly, offers a more abstract interface, but as long as it's possible, if someone here tells me "use ax, it's easy with that", I'll do that as well.

In short, again, what I have and what I want:

  • I have a huge cluster of computers, interconnected by a network
  • I want to use many of them as workers that test out hyperparameter-configurations that are promising
  • I need them to be executed and tested in parallel, coordinated by a single process on one of the nodes, communicating either over the network (or maybe over a filesystem with direct files, it doesn't really matter to me as long as they can communicate at all)
  • It really needs to be parallelized, similiar to how I can do with with workers in HyperOpt.
  • The main process coordinating should get the results from the workers and from there on generate new promising points that are to be tested by the workers

Is there any option, or an example that I was not able to find, how to do that? I'd really be happy if someone just pointed out a (very simple) example of how something like that could be achieved.

@NormanTUD NormanTUD added the enhancement New feature or request label Jan 25, 2024
@Balandat
Copy link
Contributor

What you're looking for is the Ax Scheduler, which allows you to do just that, provided you have some way of deploying a trial to a machine, and then checking its status and returning the results of the training job. This tutorial should get you started: https://github.com/facebook/Ax/blob/main/tutorials/scheduler.ipynb

The tutorial is somewhat abstract; here is a WIP submitit integration that allows scheduling jobs on a SLURM cluster: facebook/Ax#2125

@Balandat Balandat self-assigned this Jan 25, 2024
@Balandat Balandat added question Further information is requested and removed enhancement New feature or request labels Jan 25, 2024
@pytorch pytorch locked and limited conversation to collaborators Feb 14, 2024
@Balandat Balandat converted this issue into discussion #2204 Feb 14, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants