Wandb puts experiment to sleep, training just freezes

Hello!

Sadly, when we train our model with PyTorch, the training process just freezes. Everything works just fine when we use wandb offline, but when we use wandb in the online mode, more often than not, the training process freezes in the first few steps of an training or validation epoch. We have noticed, that this problem arises when the training or the validation epochs are short, taking less than 1 or 2 minutes. The progress bar just freezes, the training stops and we get no error message whatsoever. When we kill the run, wandb just tells us that it did not find a program path.

When we inspect with htop, we see, that the process running the training loop is in the sleeping state and never again switches into a running mode. The corresponding wandb process is sleeping as well, but some CPU usage is shown. We therefore suspect, that some file-upload is stuck.

As this issue seems to be very uncommon and we found no help in existing issue-posts, we kindly ask for your help.

wandb version: 0.16.4
PyTorch version: 2.2.0+cu118
PyTorch Lightning: 2.2.0

Here the debug files of a stuck run:
https://drive.google.com/drive/folders/1rmJh-Ep-8wuknLziOCaZHrzT_S-3Uphb?usp=sharing

Edit: Downgrading wandb to version 0.12.21 seems to fix this problem. Nonetheless, it would be interesting how this problem can be fixed without downgrading.

Thanks, Tobi

Hi @tobiasm! Apologies you are seeing this behavior! Thank you very much for sending over your wandb version as well as the debug logs. Unfortunately, I wasn’t able to find much in the logs.

Edit: Downgrading wandb to version 0.12.21 seems to fix this problem. Nonetheless, it would be interesting how this problem can be fixed without downgrading.

That is a very interesting point, thank you for pointing it out.

Could you please send me a link to your workspace where you are experiencing this issue?

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi, since we have not heard back from you, we are going to close this request. If you would like to reopen the conversation, please let us know! Unfortunately, at the moment, we do not receive notifications if a thread reopens on Discourse. So, please feel free to create a new ticket regarding your concern if you’d like to continue the conversation.