Showing total loss in distributed computing

Hello,

I am trying to use wandb to monitor my loss functions and I am running my code on multiple GPU nodes. When logging my loss function I can see 4 different links for each node. However, I would like to see the total loss and have one process reporting that through wandb. How should I take care of this?

I would appreciate any feedback as I am totally new to this community,
Thanks,

Hi @mkhoshle welcome to W&B community!

Is it one experiment distributed in 4 GPUs or are you running 4 different experiments? Which framework are you using? Any code snippet or link to the workspace would help to further investigate this issue.

Would this example with PyTorch DDP fit your case? You may also find useful our reference docs here about distributed training.

Hi @thanos-wandb ,

Yes, I am running my experiment using multiple GPUs (e.g. 4 GPUs) and yes PyTorch DistributedDataParallel is my case. I just need some information regarding when I should log on all processes and when I should log using only one process. Which would be more useful and provide more insight?

Thanks

Hi @mkhoshle , this document may be helpful, listing two methods recommend by W&B for multiprocessing logging. Method 1, logging through the rank0 process would work for you case of only wanting to log a single value from a single process.

Hi @mkhoshle , do you still need assistance with W&B multiprocessing?

Hi @mkhoshle , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.