Wandb puts experiment to sleep, training just freezes

tobiasm · March 28, 2024, 1:08pm

Hello!

Sadly, when we train our model with PyTorch, the training process just freezes. Everything works just fine when we use wandb offline, but when we use wandb in the online mode, more often than not, the training process freezes in the first few steps of an training or validation epoch. We have noticed, that this problem arises when the training or the validation epochs are short, taking less than 1 or 2 minutes. The progress bar just freezes, the training stops and we get no error message whatsoever. When we kill the run, wandb just tells us that it did not find a program path.

When we inspect with htop, we see, that the process running the training loop is in the sleeping state and never again switches into a running mode. The corresponding wandb process is sleeping as well, but some CPU usage is shown. We therefore suspect, that some file-upload is stuck.

As this issue seems to be very uncommon and we found no help in existing issue-posts, we kindly ask for your help.

wandb version: 0.16.4
PyTorch version: 2.2.0+cu118
PyTorch Lightning: 2.2.0

Here the debug files of a stuck run:
https://drive.google.com/drive/folders/1rmJh-Ep-8wuknLziOCaZHrzT_S-3Uphb?usp=sharing

Edit: Downgrading wandb to version 0.12.21 seems to fix this problem. Nonetheless, it would be interesting how this problem can be fixed without downgrading.

Thanks, Tobi

artsiom · April 2, 2024, 2:59pm

Hi @tobiasm! Apologies you are seeing this behavior! Thank you very much for sending over your wandb version as well as the debug logs. Unfortunately, I wasn’t able to find much in the logs.

Edit: Downgrading wandb to version 0.12.21 seems to fix this problem. Nonetheless, it would be interesting how this problem can be fixed without downgrading.

That is a very interesting point, thank you for pointing it out.

Could you please send me a link to your workspace where you are experiencing this issue?

artsiom · April 4, 2024, 9:30pm

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

artsiom · April 8, 2024, 3:13pm

Hi, since we have not heard back from you, we are going to close this request. If you would like to reopen the conversation, please let us know! Unfortunately, at the moment, we do not receive notifications if a thread reopens on Discourse. So, please feel free to create a new ticket regarding your concern if you’d like to continue the conversation.

Topic		Replies	Views
Training hangs with GPU Utilization 100% and wandb trying to sync W&B Help	5	1148	January 16, 2023
Wandb stops uploading data W&B Help wandb	19	1606	February 29, 2024
Uploading stuck for both 'wandb online' OR 'wandb offline' + 'wandb sync' W&B Help wandb	8	206	August 14, 2024
Horrible performance when viewing charts for WandB run W&B Help dashboard , wandb , pytorch	4	697	April 6, 2023
Wandb process not getting terminated properly W&B Help wandb	4	984	January 3, 2022

Wandb puts experiment to sleep, training just freezes

Related topics