Hi! Recently my wandb.init() started timing out, this has happened several times now… I know there are similar posts, but asking here to share my debug logs below.
Several other runs are fine… it seems to happen during the longer runs, eg after running on a cluster for 40h prior to this wandb.init() call
Any help is greatly appreciated!
Thanks,
Lorenz
My debug.log file:
2024-09-04 04:06:02,932 INFO MainThread:60431 [wandb_init.py:init():561] calling init triggers
2024-09-04 04:06:02,932 INFO MainThread:60431 [wandb_init.py:init():568] wandb.init called with sweep_config: {}
config: {...}
2024-09-04 04:06:02,932 INFO MainThread:60431 [wandb_init.py:init():611] starting backend
2024-09-04 04:06:02,932 INFO MainThread:60431 [wandb_init.py:init():615] setting up manager
2024-09-04 04:06:03,223 INFO MainThread:60431 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-09-04 04:06:03,223 INFO MainThread:60431 [wandb_init.py:init():623] backend started and connected
2024-09-04 04:06:03,308 INFO MainThread:60431 [wandb_init.py:init():715] updated telemetry
2024-09-04 04:06:03,736 INFO MainThread:60431 [wandb_init.py:init():748] communicating run to backend with 90.0 second timeout
2024-09-04 04:07:34,159 ERROR MainThread:60431 [wandb_init.py:init():774] encountered error: Run initialization has timed out after 90.0 sec.
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
2024-09-04 04:07:34,451 ERROR MainThread:60431 [wandb_init.py:init():1199] Run initialization has timed out after 90.0 sec.
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
Traceback (most recent call last):
File "/lustre/home/.../lib/python3.11/site-packages/wandb/sdk/wandb_init.py", line 1181, in init
run = wi.init()
^^^^^^^^^
File "/lustre/home/.../lib/python3.11/site-packages/wandb/sdk/wandb_init.py", line 780, in init
raise error
wandb.errors.CommError: Run initialization has timed out after 90.0 sec.
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
And internal debug log:
2024-09-04 04:06:13,359 INFO StreamThr :11831 [internal.py:wandb_internal():86] W&B internal server running at pid: 11831, started at: 2024-09-04 04:06:08.197662
2024-09-04 04:06:15,837 DEBUG HandlerThread:11831 [handler.py:handle_request():146] handle_request: status
2024-09-04 04:06:25,016 DEBUG HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
2024-09-04 04:06:25,304 INFO WriterThread:11831 [datastore.py:open_for_write():87] open: /lustre/.../wandb/run-20240904_040602-wekm44m0/run-wekm44m0.wandb
2024-09-04 04:06:30,017 DEBUG HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
2024-09-04 04:06:35,048 DEBUG HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
2024-09-04 04:06:40,065 DEBUG HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
...
...
2024-09-04 04:18:02,263 DEBUG HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
2024-09-04 04:18:07,265 DEBUG HandlerThread:11831 [handler.py:handle_request():146] handle_request: status_report
2024-09-04 04:18:10,139 WARNING StreamThr :11831 [internal.py:is_dead():414] Internal process exiting, parent pid 60431 disappeared
2024-09-04 04:18:10,140 ERROR StreamThr :11831 [internal.py:wandb_internal():152] Internal process shutdown.
2024-09-04 04:18:10,314 INFO WriterThread:11831 [datastore.py:close():296] close: /lustre/.../wandb/run-20240904_040602-wekm44m0/run-wekm44m0.wandb
2024-09-04 04:18:10,315 INFO HandlerThread:11831 [handler.py:finish():866] shutting down handler
2024-09-04 04:18:10,338 INFO SenderThread:11831 [sender.py:finish():1546] shutting down sender