Help with running a sweep agent on a multi-gpu machine with pytorch DistributedDataParallel

yanyiphei · April 4, 2024, 12:03am

Hello,

I need to do sweep agent runs on a multi-gpu machine with pytorch DistributedDataParallel. This means that I prepend torchrun --standalone --nproc_per_node=8 to the training script command. The desired behavior includes 1) only the master process calls wandb.agent(...), 2) data populated in wandb.config needs to be passed from the master process to all other processes, and 3) when the end of the run is reached, the agent starts another run when sweep count > 1.

I was able to do 1) and 2) in a very hacky way, so I dont need help here but if there are some recommended ways of doing them, please let me know.

remains unsolved for me. The primary problem is that, because only the master process calls wandb.agent() and wandb.agent() calls the main training function where init_process_group and DDP are used, then at the end of the script, only the master process is re-run again (because all other processes did not go through wandb.agent). Would really appreciate any help or pointers.

You can find my entire training code here: yif-AI/utils/train.py at main · yiphei/yif-AI · GitHub. Problem 3) should be self-evident in this block: yif-AI/utils/train.py at main · yiphei/yif-AI · GitHub

luis_bergua · April 9, 2024, 10:09am

Hi @yanyiphei, thanks for your question. Here you have our reference docs about running sweeps in a multi-gpu environment and here is a report that could be helpful.

yanyiphei · April 9, 2024, 5:40pm

@luis_bergua did you even read my post? The links you sent are entirely irrelevant. Or was I not clear? i can help clarify.

luis_bergua · April 17, 2024, 3:11pm

Hey @yanyiphei, apologies if I caused any kind of confusion. Currently, we don’t have any official examples to run multi-gpu sweeps with DDP so that’s why I shared that page from our docs and the report since the recommended way is to set CUDA_VISIBLE_DEVICES for each process. Another option would be to use our feature Launch to push the sweep to a Launch Queue with the specified hyperparameters to sweep over.

kokocakes · January 8, 2025, 2:13am

Hello, did you ever find a solution to this problem? I’m also stuck on problem (3) you described above. @luis_bergua Are there any examples on how to run a sweep where the training function is run on multiple GPUs?

Topic		Replies	Views
WandB sweeps and ddp W&B Help sweeps , wandb	3	1171	November 5, 2023
Make each sweep run with more than 1 GPU W&B Help	3	1049	February 9, 2024
Sweep in DDP mode W&B Help sweeps , wandb	4	1041	March 6, 2022
How do I select a GPU before running a wandb agent? W&B Help sweeps , wandb	10	3145	June 4, 2023
Sweep on remote cluster GPUs W&B Help sweeps	5	1215	September 18, 2022

Help with running a sweep agent on a multi-gpu machine with pytorch DistributedDataParallel

Related topics