Training hangs with GPU Utilization 100% and wandb trying to sync

I’ve been trying to get wandb to work with pytorch lightning on multiple gpus, it works fine, in the sense that the model is being trained, and metrics are being reported properly to the dashboard; however, only after a couple of hours and sometimes 20 mins, the system is maxed to use all the resources, causing the whole training process to freeze without any progress. I used py-spy to generate the following dumps

Hopefully, they’d be helpful to figure out where is the issue.

Thanks.

Here is also the flamegraph,

Hi @ndrwnaguib thank you for reporting this. Can you please provide a bit more context on the training environment and resources, eg how many gpus, model size? Would it be possible to share your debug.log and debug-internal.log files of the run that’s hanging? I also checked your W&B account, and it seems you’re part of a team where the rate limits are higher. Could you log your experiments in the team so as to rule out the possibility this being caused by rate limits? Thanks!

Hi @ndrwnaguib I wanted to follow up with you regarding this issue, can you please provide some more information to help us debug this? thanks!

Gi @ndrwnaguib since we haven’t heard back from you in a while, I will go ahead and close this ticket for now. If the issue still persists, please provide us with some further information requested above, and we will be happy to reopen this and keep investigating!