Hello,
I need to do sweep agent runs on a multi-gpu machine with pytorch DistributedDataParallel. This means that I prepend torchrun --standalone --nproc_per_node=8
to the training script command. The desired behavior includes 1) only the master process calls wandb.agent(...)
, 2) data populated in wandb.config
needs to be passed from the master process to all other processes, and 3) when the end of the run is reached, the agent starts another run when sweep count > 1.
I was able to do 1) and 2) in a very hacky way, so I dont need help here but if there are some recommended ways of doing them, please let me know.
- remains unsolved for me. The primary problem is that, because only the master process calls
wandb.agent()
andwandb.agent()
calls the main training function whereinit_process_group
and DDP are used, then at the end of the script, only the master process is re-run again (because all other processes did not go through wandb.agent). Would really appreciate any help or pointers.
You can find my entire training code here: yif-AI/utils/train.py at main · yiphei/yif-AI · GitHub. Problem 3) should be self-evident in this block: yif-AI/utils/train.py at main · yiphei/yif-AI · GitHub