Resolving Wandb Service Interruption When Using GPU for Model Training

I found that when I launch a Weights & Biases (wandb) service with simulated data alone, there are no issues with the service communication. However, when I simultaneously load a model on the GPU, the wandb service immediately stops (with the same error as mentioned above). If I restart the wandb service at this point, I notice that it will automatically stop after a fixed period (about 1 minute). Could this be related to the load balancer?

Training /chenhui/zhangwuhan/stage2/trained_model/qwen1.5_7b_5_5e-5_2_1k_plugin 0
BrokenPipeError: [Errno 32] Broken pipe
wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe
I hope the official can resolve this issue as soon as possible.

Hello @endnone ,thank you for reaching out and happy to help. Could you please provide the following to investigate further:

-SDK debug logs
-Wandb —version
-Code snippets
-Notebook environment you are currently using

I’ve uploaded the script and additional details, You can access it here: wandb_debug.

Please let me know if there’s anything else you need from my end to assist further.

Additionally, some similar issues have been discussed in this thread on the wandb GitHub issues page: wandb/issues/6449. It might provide some useful insights.

Hi @endnone thank you for the detailed information. Just to clarify, is the python script in this github repo the one raising that BrokenPipe error for you? additionally, from these logs it seems that the Python process was killed due to a lack of available memory. May I please ask what’s your current RAM size, and if you’re running into similar issue for a training that doesn’t require that many resources?

The issue mainly occurs when the wandb service is automatically terminated after loading a checkpoint on the GPU. If the wandb service is not started, there is enough memory to complete the script. RAM information is as follows:

Thank you for the additional information, @endnone it looks like that you have sufficient memory resources. I am wondering how you’re restarting service that you’ve mentioned in your original post here? also, it seems you’re in an older SDK version 0.15.5 and I was wondering if you’re still noticing the same issue after upgrading to our most recent SDK release 0.17.0?

Hi @endnone we wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.