I want to run a sweep on a cluster, say I want to use just 1/4th of a node which has 32 cpus (It’s an RNN so cpus are fine and cheaper). One cpu has enough memory to do a run, but of course I want to use all, so ideally I’d want to do 32 training loops in parallel.
How do I get this to work?
Hi @apjansen ,
Thank you for writing in with your question. W&B does support Distributed training, see here. In addition, we highly recommend using
wandb service , see here , which enhances how W&B handles multiprocessing runs and thus improves reliability in a distributed training setting. Please let me know if you have additional questions.
Hi @apjansen , following up on your request regarding distributed training. Is there anything I can help clarify for you from our docs on how to implement your process?
Hi @apjansen since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!