WandB sweeps and ddp

evgeny-tanhilevich · August 26, 2023, 7:08pm

Hello,

My model runs on multiple nodes/GPUs using the “ddp” strategy in Pytorch Lightning. I log my runs to wandb via WandbLoggger in Lightning. I wonder, if it is possible to use wandb sweeps in this setup? From the docs I got the impression that it is not, hope that I am mistaken.

Right now I do wandb.init() only on the rank zero worker. I guess with sweeps this would not be possible, because all the workers would need to pick up the hyperparameters from wandb.config? Also there is the issue of running the agent from multiple nodes - will they step on each other’s toes?

Thanks in advance,
Evgeny

mohammadbakir · August 31, 2023, 9:10pm

Hi @evgeny-tanhilevich , thanks for writing in and happy to help. We have relatively limited examples with multi-GPU training, because our examples are mostly backed by Colab, which only offers a single GPU. If you’re comfortable only submitting metrics from the rank0 process, this shouldn’t be any harder, on the wandb side, than running a sweep with single-node/single-GPU agents. There’s no need for wandb to know about the other processes driving the other GPUs. That approach is Method 1 here. You would need to establish a form of synchronized interprocess communication channel in order to get this working in the best way possible. This way, you can share the wandb.config to all your other processes and the data to be logged back to the Rank 0 process.

You can additionally look into paralleling sweep agents so you could specify which cuda enabled gpu to run a sweep agent on and keep everything separate.

mohammadbakir · September 6, 2023, 4:08pm

Hi @evgeny-tanhilevich since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

system · November 5, 2023, 4:09pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help with running a sweep agent on a multi-gpu machine with pytorch DistributedDataParallel W&B Help sweeps	4	723	January 8, 2025
Sweep in DDP mode W&B Help sweeps , wandb	4	1044	March 6, 2022
Distributed data parallel with pytorch lightning W&B Help	6	476	August 21, 2024
Make each sweep run with more than 1 GPU W&B Help	3	1061	February 9, 2024
About hyperparameters sweeping for DDP program W&B Help sweeps	4	400	July 8, 2022

WandB sweeps and ddp

Related topics