Wandb.init() time out error

Hi! Recently my wandb.init() started timing out, this has happened several times now… I know there are similar posts, but asking here to share my debug logs below.
Several other runs are fine… it seems to happen during the longer runs, eg after running on a cluster for 40h prior to this wandb.init() call

Any help is greatly appreciated!

Thanks,
Lorenz

My debug.log file:

2024-09-04 04:06:02,932 INFO    MainThread:60431 [wandb_init.py:init():561] calling init triggers
2024-09-04 04:06:02,932 INFO    MainThread:60431 [wandb_init.py:init():568] wandb.init called with sweep_config: {}
config: {...}
2024-09-04 04:06:02,932 INFO    MainThread:60431 [wandb_init.py:init():611] starting backend
2024-09-04 04:06:02,932 INFO    MainThread:60431 [wandb_init.py:init():615] setting up manager
2024-09-04 04:06:03,223 INFO    MainThread:60431 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-09-04 04:06:03,223 INFO    MainThread:60431 [wandb_init.py:init():623] backend started and connected
2024-09-04 04:06:03,308 INFO    MainThread:60431 [wandb_init.py:init():715] updated telemetry
2024-09-04 04:06:03,736 INFO    MainThread:60431 [wandb_init.py:init():748] communicating run to backend with 90.0 second timeout
2024-09-04 04:07:34,159 ERROR   MainThread:60431 [wandb_init.py:init():774] encountered error: Run initialization has timed out after 90.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
2024-09-04 04:07:34,451 ERROR   MainThread:60431 [wandb_init.py:init():1199] Run initialization has timed out after 90.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
Traceback (most recent call last):
  File "/lustre/home/.../lib/python3.11/site-packages/wandb/sdk/wandb_init.py", line 1181, in init
    run = wi.init()
          ^^^^^^^^^
  File "/lustre/home/.../lib/python3.11/site-packages/wandb/sdk/wandb_init.py", line 780, in init
    raise error
wandb.errors.CommError: Run initialization has timed out after 90.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-

And internal debug log:

2024-09-04 04:06:13,359 INFO    StreamThr :11831 [internal.py:wandb_internal():86] W&B internal server running at pid: 11831, started at: 2024-09-04 04:06:08.197662
2024-09-04 04:06:15,837 DEBUG   HandlerThread:11831 [handler.py:handle_request():146] handle_request: status
2024-09-04 04:06:25,016 DEBUG   HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
2024-09-04 04:06:25,304 INFO    WriterThread:11831 [datastore.py:open_for_write():87] open: /lustre/.../wandb/run-20240904_040602-wekm44m0/run-wekm44m0.wandb
2024-09-04 04:06:30,017 DEBUG   HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
2024-09-04 04:06:35,048 DEBUG   HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
2024-09-04 04:06:40,065 DEBUG   HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
...
...
2024-09-04 04:18:02,263 DEBUG   HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
2024-09-04 04:18:07,265 DEBUG   HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
2024-09-04 04:18:10,139 WARNING StreamThr :11831 [internal.py:is_dead():414] Internal process exiting, parent pid 60431 disappeared
2024-09-04 04:18:10,140 ERROR   StreamThr :11831 [internal.py:wandb_internal():152] Internal process shutdown.
2024-09-04 04:18:10,314 INFO    WriterThread:11831 [datastore.py:close():296] close: /lustre/.../wandb/run-20240904_040602-wekm44m0/run-wekm44m0.wandb
2024-09-04 04:18:10,315 INFO    HandlerThread:11831 [handler.py:finish():866] shutting down handler
2024-09-04 04:18:10,338 INFO    SenderThread:11831 [sender.py:finish():1546] shutting down sender

Hi @lorenz-l-wolf ,

Good day and thank you for reaching out to us. Happy to help you on this!

May I know if you are still encountering the timeout issues? If yes, could you please tell us your current SDK version, you can get this by running wandb --version. Thank you!

I have implemented a work around - just logging offline when it doesn’t manage to connect, but would be good to fix this. So current runs seem okay, but assume it is still happening and just logging offline in that case.

Am using wandb version 0.16.6

Thanks for your help!

Hi @lorenz-l-wolf I am glad to hear that you were able to identify a workaround. Would you mind trying this on the latest version of SDK and see if you are still experiencing the same issues?