Wandb.finish() stuck forever on WSL2 keeping the main script running

Hello,

I am running into a strange issue when using WandB to log experiments online. My goal is to make it work with Pytorch Lightning however an annoying termination issue, makes it impossible to run any experiments on the WSL under Windows 11. The finish() operation remains stuck in a while True loop and therefore the training script which waits for the wand logger to finish never ends.

My setup is WSL2 with Ubuntu 22.04 on Windows 11, Python 3.10 or 3.11 (neither works), wandb 0.15.10. I managed to narrow the issue down to the following line in wandb/sdk/lib/mailbox.py:

283: found, abandoned = self._slot._get_and_clear(timeout=wait_timeout)

which will return found==None forever while waiting for the background process to signal the end. So the while True loop around this couple of lines will loop forever. This only happens in the WSL env; if I am running on Windows the logger finishes correctly and returns the focus to the Command Line.

You don’t have to actually do anything fancy in between init() and finish(). Here is a minimal running example:

import wandb
wandb.init()
wandb.finish()

I don’t know how to attach the debug files so here is an excerpt from debug-internal:

2023-09-18 17:53:21,028 INFO    HandlerThread:32321 [handler.py:handle_request_defer():170] handle defer: 10
2023-09-18 17:53:21,034 DEBUG   SenderThread:32321 [sender.py:send_request():406] send_request: defer
2023-09-18 17:53:21,036 INFO    SenderThread:32321 [sender.py:send_request_defer():608] handle sender defer: 10
2023-09-18 17:53:21,036 INFO    SenderThread:32321 [file_pusher.py:finish():175] shutting down file pusher
2023-09-18 17:53:21,445 INFO    wandb-upload_2:32321 [upload_job.py:push():131] Uploaded file /home/user/scripts/wandb/run-20230918_175313-m49jqceu/files/conda-environment.yaml
2023-09-18 17:53:21,514 INFO    wandb-upload_1:32321 [upload_job.py:push():131] Uploaded file /home/user/scripts/wandb/run-20230918_175313-m49jqceu/files/requirements.txt
2023-09-18 17:53:21,844 INFO    wandb-upload_0:32321 [upload_job.py:push():131] Uploaded file /home/user/scripts/wandb/run-20230918_175313-m49jqceu/files/config.yaml
2023-09-18 17:53:25,964 DEBUG   HandlerThread:32321 [handler.py:handle_request():144] handle_request: keepalive
2023-09-18 17:53:26,038 DEBUG   HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
2023-09-18 17:53:30,965 DEBUG   HandlerThread:32321 [handler.py:handle_request():144] handle_request: keepalive
2023-09-18 17:53:31,039 DEBUG   HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
...
this goes on forever
...
2023-09-18 17:54:41,056 DEBUG   HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
2023-09-18 17:54:45,985 DEBUG   HandlerThread:32321 [handler.py:handle_request():144] handle_request: keepalive
2023-09-18 17:54:46,057 DEBUG   HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
2023-09-18 17:54:50,986 DEBUG   HandlerThread:32321 [handler.py:handle_request():144] handle_request: keepalive
2023-09-18 17:54:51,058 DEBUG   HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
2023-09-18 17:54:56,060 DEBUG   HandlerThread:32321 [handler.py:handle_request():144] handle_request: status_report
...
until i stop the process from the CMD with Ctrl+C
...
2023-09-18 17:54:58,109 WARNING StreamThr :32321 [internal.py:is_dead():414] Internal process exiting, parent pid 32298 disappeared
2023-09-18 17:54:58,109 ERROR   StreamThr :32321 [internal.py:wandb_internal():152] Internal process shutdown.
2023-09-18 17:54:59,060 INFO    SenderThread:32321 [sender.py:finish():1531] shutting down sender
2023-09-18 17:54:59,061 INFO    HandlerThread:32321 [handler.py:finish():840] shutting down handler
2023-09-18 17:54:59,061 INFO    WriterThread:32321 [datastore.py:close():294] close: /home/user/scripts/wandb/run-20230918_175313-m49jqceu/run-m49jqceu.wandb
2023-09-18 17:54:59,061 INFO    SenderThread:32321 [file_pusher.py:finish():175] shutting down file pusher
2023-09-18 17:54:59,061 INFO    SenderThread:32321 [file_pusher.py:join():181] waiting for file pusher

Anyone can help me with this? Is there any quick fix for it, that I maybe overlooked in another issue.

The similar issues to this one only discuss that finish() takes a long time to terminate, but in my case I don’t see an end to, even for this small example which should produce so much overhead.

I suspect there is some issue with how WSL2 spawns the wandb background process. Additionally running offline and then trying to wandb sync the offline runs gets also stuck at uploading. Without offline mode, the runs are getting uploaded to cloud, however i suspect it is just the termination of the background wandb runner process that fails. And me stopping it with Ctrl+C will flag it online as Crashed.

Thank you in advance

Hi @theo-cheslerean , appreciate you reporting and happy to help. I’ve marked this as a bug for our sdk to review and will circle back once I get feedback from the team. In the meantime, I recommend not using a WSL2 environment for your wandb experiments.

1 Like

I’m having the same issue

Hi @mohammadbakir thanks for the recommendation. Any news on this bug?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.