Run sweep on cluster

apjansen · June 17, 2022, 9:06am

I want to run a sweep on a cluster, say I want to use just 1/4th of a node which has 32 cpus (It’s an RNN so cpus are fine and cheaper). One cpu has enough memory to do a run, but of course I want to use all, so ideally I’d want to do 32 training loops in parallel.

How do I get this to work?

mohammadbakir · June 21, 2022, 5:47pm

Hi @apjansen ,

Thank you for writing in with your question. W&B does support Distributed training, see here. In addition, we highly recommend using wandb service , see here , which enhances how W&B handles multiprocessing runs and thus improves reliability in a distributed training setting. Please let me know if you have additional questions.

Regards,

Mohammad

mohammadbakir · June 24, 2022, 10:42pm

Hi @apjansen , following up on your request regarding distributed training. Is there anything I can help clarify for you from our docs on how to implement your process?

mohammadbakir · June 30, 2022, 8:28pm

Hi @apjansen since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

system · August 29, 2022, 8:28pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multithreading support for Sweeps W&B Help sweeps , wandb	10	1341	January 1, 2024
Sweeps using more RAM memory W&B Help sweeps	6	808	November 24, 2023
Multiprocessing mp wandb sweeps and the count parameter, how to do sweeps with mp? W&B Help sweeps	6	457	June 3, 2024
Has anyone used wandb sweeps and torch.distributed before? W&B Help	2	414	June 3, 2022
Population Based Training W&B Help sweeps , wandb	2	562	September 19, 2022

Run sweep on cluster

Related topics