Help with running a sweep agent on a multi-gpu machine with pytorch DistributedDataParallel

Hello,

I need to do sweep agent runs on a multi-gpu machine with pytorch DistributedDataParallel. This means that I prepend torchrun --standalone --nproc_per_node=8 to the training script command. The desired behavior includes 1) only the master process calls wandb.agent(...), 2) data populated in wandb.config needs to be passed from the master process to all other processes, and 3) when the end of the run is reached, the agent starts another run when sweep count > 1.

I was able to do 1) and 2) in a very hacky way, so I dont need help here but if there are some recommended ways of doing them, please let me know.

  1. remains unsolved for me. The primary problem is that, because only the master process calls wandb.agent() and wandb.agent() calls the main training function where init_process_group and DDP are used, then at the end of the script, only the master process is re-run again (because all other processes did not go through wandb.agent). Would really appreciate any help or pointers.

You can find my entire training code here: yif-AI/utils/train.py at main · yiphei/yif-AI · GitHub. Problem 3) should be self-evident in this block: yif-AI/utils/train.py at main · yiphei/yif-AI · GitHub

Hi @yanyiphei, thanks for your question. Here you have our reference docs about running sweeps in a multi-gpu environment and here is a report that could be helpful.

@luis_bergua1 did you even read my post? The links you sent are entirely irrelevant. Or was I not clear? i can help clarify.

Hey @yanyiphei, apologies if I caused any kind of confusion. Currently, we don’t have any official examples to run multi-gpu sweeps with DDP so that’s why I shared that page from our docs and the report since the recommended way is to set CUDA_VISIBLE_DEVICES for each process. Another option would be to use our feature Launch to push the sweep to a Launch Queue with the specified hyperparameters to sweep over.