Hello!
Sadly, when we train our model with PyTorch, the training process just freezes. Everything works just fine when we use wandb offline, but when we use wandb in the online mode, more often than not, the training process freezes in the first few steps of an training or validation epoch. We have noticed, that this problem arises when the training or the validation epochs are short, taking less than 1 or 2 minutes. The progress bar just freezes, the training stops and we get no error message whatsoever. When we kill the run, wandb just tells us that it did not find a program path.
When we inspect with htop, we see, that the process running the training loop is in the sleeping state and never again switches into a running mode. The corresponding wandb process is sleeping as well, but some CPU usage is shown. We therefore suspect, that some file-upload is stuck.
As this issue seems to be very uncommon and we found no help in existing issue-posts, we kindly ask for your help.
wandb version: 0.16.4
PyTorch version: 2.2.0+cu118
PyTorch Lightning: 2.2.0
Here the debug files of a stuck run:
https://drive.google.com/drive/folders/1rmJh-Ep-8wuknLziOCaZHrzT_S-3Uphb?usp=sharing
Edit: Downgrading wandb to version 0.12.21 seems to fix this problem. Nonetheless, it would be interesting how this problem can be fixed without downgrading.
Thanks, Tobi