I’ve been trying to get wandb to work with pytorch lightning on multiple gpus, it works fine, in the sense that the model is being trained, and metrics are being reported properly to the dashboard; however, only after a couple of hours and sometimes 20 mins, the system is maxed to use all the resources, causing the whole training process to freeze without any progress. I used py-spy to generate the following dumps
Hi @ndrwnaguib thank you for reporting this. Can you please provide a bit more context on the training environment and resources, eg how many gpus, model size? Would it be possible to share your debug.log and debug-internal.log files of the run that’s hanging? I also checked your W&B account, and it seems you’re part of a team where the rate limits are higher. Could you log your experiments in the team so as to rule out the possibility this being caused by rate limits? Thanks!
Gi @ndrwnaguib since we haven’t heard back from you in a while, I will go ahead and close this ticket for now. If the issue still persists, please provide us with some further information requested above, and we will be happy to reopen this and keep investigating!