Hello,
I am running into a strange issue when using WandB to log experiments online. My goal is to make it work with Pytorch Lightning however an annoying termination issue, makes it impossible to run any experiments on the WSL under Windows 11. The finish()
operation remains stuck in a while True
loop and therefore the training script which waits for the wand logger to finish never ends.
My setup is WSL2 with Ubuntu 22.04 on Windows 11, Python 3.10 or 3.11 (neither works), wandb 0.15.10. I managed to narrow the issue down to the following line in wandb/sdk/lib/mailbox.py
:
283: found, abandoned = self._slot._get_and_clear(timeout=wait_timeout)
which will return found==None
forever while waiting for the background process to signal the end. So the while True
loop around this couple of lines will loop forever. This only happens in the WSL env; if I am running on Windows the logger finishes correctly and returns the focus to the Command Line.
You don’t have to actually do anything fancy in between init()
and finish()
. Here is a minimal running example:
import wandb
wandb.init()
wandb.finish()
I don’t know how to attach the debug
files so here is an excerpt from debug-internal
:
2023-09-18 17:53:21,028 INFO HandlerThread:32321 [handler.py:handle_request_defer():170] handle defer: 10
2023-09-18 17:53:21,034 DEBUG SenderThread:32321 [sender.py:send_request():406] send_request: defer
2023-09-18 17:53:21,036 INFO SenderThread:32321 [sender.py:send_request_defer():608] handle sender defer: 10
2023-09-18 17:53:21,036 INFO SenderThread:32321 [file_pusher.py:finish():175] shutting down file pusher
2023-09-18 17:53:21,445 INFO wandb-upload_2:32321 [upload_job.py:push():131] Uploaded file /home/user/scripts/wandb/run-20230918_175313-m49jqceu/files/conda-environment.yaml
2023-09-18 17:53:21,514 INFO wandb-upload_1:32321 [upload_job.py:push():131] Uploaded file /home/user/scripts/wandb/run-20230918_175313-m49jqceu/files/requirements.txt
2023-09-18 17:53:21,844 INFO wandb-upload_0:32321 [upload_job.py:push():131] Uploaded file /home/user/scripts/wandb/run-20230918_175313-m49jqceu/files/config.yaml
2023-09-18 17:53:25,964 DEBUG HandlerThread:32321 [handler.py:handle_request():144] handle_request: keepalive
2023-09-18 17:53:26,038 DEBUG HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
2023-09-18 17:53:30,965 DEBUG HandlerThread:32321 [handler.py:handle_request():144] handle_request: keepalive
2023-09-18 17:53:31,039 DEBUG HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
...
this goes on forever
...
2023-09-18 17:54:41,056 DEBUG HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
2023-09-18 17:54:45,985 DEBUG HandlerThread:32321 [handler.py:handle_request():144] handle_request: keepalive
2023-09-18 17:54:46,057 DEBUG HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
2023-09-18 17:54:50,986 DEBUG HandlerThread:32321 [handler.py:handle_request():144] handle_request: keepalive
2023-09-18 17:54:51,058 DEBUG HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
2023-09-18 17:54:56,060 DEBUG HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
...
until i stop the process from the CMD with Ctrl+C
...
2023-09-18 17:54:58,109 WARNING StreamThr :32321 [internal.py:is_dead():414] Internal process exiting, parent pid 32298 disappeared
2023-09-18 17:54:58,109 ERROR StreamThr :32321 [internal.py:wandb_internal():152] Internal process shutdown.
2023-09-18 17:54:59,060 INFO SenderThread:32321 [sender.py:finish():1531] shutting down sender
2023-09-18 17:54:59,061 INFO HandlerThread:32321 [handler.py:finish():840] shutting down handler
2023-09-18 17:54:59,061 INFO WriterThread:32321 [datastore.py:close():294] close: /home/user/scripts/wandb/run-20230918_175313-m49jqceu/run-m49jqceu.wandb
2023-09-18 17:54:59,061 INFO SenderThread:32321 [file_pusher.py:finish():175] shutting down file pusher
2023-09-18 17:54:59,061 INFO SenderThread:32321 [file_pusher.py:join():181] waiting for file pusher
Anyone can help me with this? Is there any quick fix for it, that I maybe overlooked in another issue.
The similar issues to this one only discuss that finish()
takes a long time to terminate, but in my case I don’t see an end to, even for this small example which should produce so much overhead.
I suspect there is some issue with how WSL2 spawns the wandb background process. Additionally running offline and then trying to wandb sync
the offline runs gets also stuck at uploading. Without offline
mode, the runs are getting uploaded to cloud, however i suspect it is just the termination of the background wandb runner process that fails. And me stopping it with Ctrl+C will flag it online as Crashed.
Thank you in advance