Wandb process not getting terminated properly

adi-iitd · October 30, 2021, 1:24pm

My process is not getting terminated properly (running in a multi-GPU setting). It is trying to upload information but gets stuck for some reason. I am facing this problem since yesterday, and haven’t made any changes to the version of the library (although this didn’t get resolved after upgrading the library to the latest version). Any help will be highly appreciated. I can disable wandb completely by passing mode = "disabled" in the test setting, but need it while running sweeps or logging training metrics.
P.S.: Same code was running just fine till yesterday.

ramit_goolry · November 1, 2021, 10:01pm

Hi @adi-iitd,

Thanks for your question! Could you share some details of your project so that I can get a better understanding of the issue. Specifically:

What OS are you using?
What version of python are you currently running?
What version of the wandb client are you using?
What library are you working with?

If possible, I would also appreciate a minimal script that replicates the issue, so that I can get to the root of the problem.

Thanks,
Ramit

adi-iitd · November 4, 2021, 2:33am

Hi @ramit_goolry, thanks for the response.
OS: Linux 18.04
Python: 3.8
Wandb: Faced this issue first in 0.12.4, later updated it to 0.12.6, but the problem is not resolved yet.
Library: PyTorch Lightning (DDP Setting) + HuggingFace

Hope this helps!

ramit_goolry · November 4, 2021, 4:32pm

Hi @adi-iitd,

We have noticed a few users have similar issues with PyTorch Lightning DDP - your issue probably is related to how PyTorch lightning synchronizes GPUs with DDP. You can refer to this issue, there is a workaround over there that might work for you.

We are currently working on updating our Documentation and Examples so that we have better guides for situations like this. Till then, I hope the link above is helpful!

Thanks,
Ramit

system · January 3, 2022, 4:33pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Training hangs with GPU Utilization 100% and wandb trying to sync W&B Help	5	1165	January 16, 2023
Distributed data parallel with pytorch lightning W&B Help	6	476	August 21, 2024
Finish() is going into loop in distributed setting W&B Help wandb	3	266	February 1, 2024
WandB sweeps and ddp W&B Help sweeps , wandb	3	1183	November 5, 2023
Wandb puts experiment to sleep, training just freezes W&B Help wandb	3	389	April 8, 2024

Wandb process not getting terminated properly

Related topics