Showing total loss in distributed computing

mkhoshle · August 13, 2022, 10:40pm

Hello,

I am trying to use wandb to monitor my loss functions and I am running my code on multiple GPU nodes. When logging my loss function I can see 4 different links for each node. However, I would like to see the total loss and have one process reporting that through wandb. How should I take care of this?

I would appreciate any feedback as I am totally new to this community,
Thanks,

thanos-wandb · August 17, 2022, 8:14pm

Hi @mkhoshle welcome to W&B community!

Is it one experiment distributed in 4 GPUs or are you running 4 different experiments? Which framework are you using? Any code snippet or link to the workspace would help to further investigate this issue.

Would this example with PyTorch DDP fit your case? You may also find useful our reference docs here about distributed training.

mkhoshle · August 17, 2022, 8:48pm

Hi @thanos-wandb ,

Yes, I am running my experiment using multiple GPUs (e.g. 4 GPUs) and yes PyTorch DistributedDataParallel is my case. I just need some information regarding when I should log on all processes and when I should log using only one process. Which would be more useful and provide more insight?

Thanks

mohammadbakir · August 29, 2022, 7:04pm

Hi @mkhoshle , this document may be helpful, listing two methods recommend by W&B for multiprocessing logging. Method 1, logging through the rank0 process would work for you case of only wanting to log a single value from a single process.

mohammadbakir · September 1, 2022, 10:12pm

Hi @mkhoshle , do you still need assistance with W&B multiprocessing?

mohammadbakir · September 9, 2022, 11:55pm

Hi @mkhoshle , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

system · November 8, 2022, 11:55pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
MultiGPU training W&B Help dashboard , projects , wandb , beginner-friendly , pytorch	2	1310	January 24, 2024
Distributed data parallel with pytorch lightning W&B Help	6	459	August 21, 2024
WandB sweeps and ddp W&B Help sweeps , wandb	3	1175	November 5, 2023
Is it possible to log to multiple runs simultaneously W&B Help wandb	3	4296	July 16, 2023
Log multiple variables at the same plot - multi gpu version W&B Help	5	811	August 20, 2023

Showing total loss in distributed computing

Related topics