Wandb process not getting terminated properly

My process is not getting terminated properly (running in a multi-GPU setting). It is trying to upload information but gets stuck for some reason. I am facing this problem since yesterday, and haven’t made any changes to the version of the library (although this didn’t get resolved after upgrading the library to the latest version). Any help will be highly appreciated. I can disable wandb completely by passing mode = "disabled" in the test setting, but need it while running sweeps or logging training metrics.
P.S.: Same code was running just fine till yesterday.

Hi @adi-iitd,

Thanks for your question! Could you share some details of your project so that I can get a better understanding of the issue. Specifically:

  • What OS are you using?
  • What version of python are you currently running?
  • What version of the wandb client are you using?
  • What library are you working with?

If possible, I would also appreciate a minimal script that replicates the issue, so that I can get to the root of the problem.

Thanks,
Ramit

Hi @ramit_goolry, thanks for the response.
OS: Linux 18.04
Python: 3.8
Wandb: Faced this issue first in 0.12.4, later updated it to 0.12.6, but the problem is not resolved yet.
Library: PyTorch Lightning (DDP Setting) + HuggingFace

Hope this helps!

Hi @adi-iitd,

We have noticed a few users have similar issues with PyTorch Lightning DDP - your issue probably is related to how PyTorch lightning synchronizes GPUs with DDP. You can refer to this issue, there is a workaround over there that might work for you.

We are currently working on updating our Documentation and Examples so that we have better guides for situations like this. Till then, I hope the link above is helpful!

Thanks,
Ramit

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.