Parallelization on multiple nodes on a HPC-cluster #2179

NormanTUD · 2024-01-25T09:16:42Z

🚀 Feature Request

I have a HPC cluster at my hands, and we want to use ax/botorch to optimize hyperparameters from neural networks. Currently, we use HyperOpt, which allows you to have worker-processes on different nodes that communicate to a single MongoDB server, which coordinates which hyperparameter constellations should be tried out, and which already have been tried out, and saves the results.

We have hundreds of nodes, and with HyperOpt, we can run a worker on each of them and let it train a neural network on a certain parameter configuration and then use the result to further find good possible points. We'd love to do something similiar, but with ax/botorch, but I just cannot get it to work.

I've tried using multiprocessing, like suggested here -> facebook/Ax#896 , but it didnt work out for me. I got, depending on the code I tried to use, many different error messages, like "DataRequiredError: All trials for current model have been generated, but not enough data has been observed to fit next model. Try again when more data are available.", but many more, too many to fit them all here.

I've been looking through the documentation, and I thought I may use OptimizationLoop in ax, together with run_async, to create a temporary file that a worker and work on, and then, when done, can return, but but it turned out that the only thing that this option does is to trigger an assertion: assert not run_async, "OptimizationLoop does not yet support async.".

Is there any example on how I could do that? I'd prefer botorch, as, if I understood it correctly, offers a more abstract interface, but as long as it's possible, if someone here tells me "use ax, it's easy with that", I'll do that as well.

In short, again, what I have and what I want:

I have a huge cluster of computers, interconnected by a network
I want to use many of them as workers that test out hyperparameter-configurations that are promising
I need them to be executed and tested in parallel, coordinated by a single process on one of the nodes, communicating either over the network (or maybe over a filesystem with direct files, it doesn't really matter to me as long as they can communicate at all)
It really needs to be parallelized, similiar to how I can do with with workers in HyperOpt.
The main process coordinating should get the results from the workers and from there on generate new promising points that are to be tested by the workers

Is there any option, or an example that I was not able to find, how to do that? I'd really be happy if someone just pointed out a (very simple) example of how something like that could be achieved.

The text was updated successfully, but these errors were encountered:

Balandat · 2024-01-25T15:09:09Z

What you're looking for is the Ax Scheduler, which allows you to do just that, provided you have some way of deploying a trial to a machine, and then checking its status and returning the results of the training job. This tutorial should get you started: https://github.com/facebook/Ax/blob/main/tutorials/scheduler.ipynb

The tutorial is somewhat abstract; here is a WIP submitit integration that allows scheduling jobs on a SLURM cluster: facebook/Ax#2125

NormanTUD added the enhancement New feature or request label Jan 25, 2024

Balandat self-assigned this Jan 25, 2024

Balandat added question Further information is requested and removed enhancement New feature or request labels Jan 25, 2024

pytorch locked and limited conversation to collaborators Feb 14, 2024

Balandat converted this issue into discussion #2204 Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Parallelization on multiple nodes on a HPC-cluster #2179

Parallelization on multiple nodes on a HPC-cluster #2179

NormanTUD commented Jan 25, 2024

Balandat commented Jan 25, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Parallelization on multiple nodes on a HPC-cluster #2179

Parallelization on multiple nodes on a HPC-cluster #2179

Comments

NormanTUD commented Jan 25, 2024

🚀 Feature Request

Balandat commented Jan 25, 2024

This issue was moved to a discussion.