Training hangs with GPU Utilization 100% and wandb trying to sync

ndrwnaguib · November 3, 2022, 11:12pm

I’ve been trying to get wandb to work with pytorch lightning on multiple gpus, it works fine, in the sense that the model is being trained, and metrics are being reported properly to the dashboard; however, only after a couple of hours and sometimes 20 mins, the system is maxed to use all the resources, causing the whole training process to freeze without any progress. I used py-spy to generate the following dumps

Hopefully, they’d be helpful to figure out where is the issue.

Thanks.

ndrwnaguib · November 3, 2022, 11:13pm

Here is also the flamegraph,

thanos-wandb · November 9, 2022, 11:45am

Hi @ndrwnaguib thank you for reporting this. Can you please provide a bit more context on the training environment and resources, eg how many gpus, model size? Would it be possible to share your debug.log and debug-internal.log files of the run that’s hanging? I also checked your W&B account, and it seems you’re part of a team where the rate limits are higher. Could you log your experiments in the team so as to rule out the possibility this being caused by rate limits? Thanks!

thanos-wandb · November 14, 2022, 2:01pm

Hi @ndrwnaguib I wanted to follow up with you regarding this issue, can you please provide some more information to help us debug this? thanks!

thanos-wandb · November 17, 2022, 4:22pm

Gi @ndrwnaguib since we haven’t heard back from you in a while, I will go ahead and close this ticket for now. If the issue still persists, please provide us with some further information requested above, and we will be happy to reopen this and keep investigating!

system · January 16, 2023, 4:23pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wandb process not getting terminated properly W&B Help wandb	4	1012	January 3, 2022
Wandb puts experiment to sleep, training just freezes W&B Help wandb	3	399	April 8, 2024
PyTorch Tensorboard Sync in distributed training experiments W&B Help	5	462	March 1, 2024
GPU utilization on wandb W&B Help	3	1132	May 22, 2023
No GPU usage on HPC environment W&B Help	7	487	May 20, 2024

Training hangs with GPU Utilization 100% and wandb trying to sync

Related topics