CommError: Run initialization has timed out after 90.0 sec

Hey,
I get this error when i try to train my model using wandb:
CommError: Run initialization has timed out after 90.0 sec. Please refer to the documentation for additional information: Frequently Asked Questions About Experiments

This is the content of debug.log:

2024-02-27 16:22:01,728 INFO MainThread:953563 [wandb_setup.py:_flush():76] Current SDK version is 0.16.2
2024-02-27 16:22:01,728 INFO MainThread:953563 [wandb_setup.py:_flush():76] Configure stats pid to 953563
2024-02-27 16:22:01,728 INFO MainThread:953563 [wandb_setup.py:_flush():76] Loading settings from /linkhome/rech/geniri01/ulf92ec/.config/wandb/settings
2024-02-27 16:22:01,750 INFO MainThread:953563 [wandb_setup.py:_flush():76] Loading settings from /gpfsdswork/projects/rech/aib/ulf92ec/DSI-QG-main/wandb/settings
2024-02-27 16:22:01,750 INFO MainThread:953563 [wandb_setup.py:_flush():76] Loading settings from environment variables: {}
2024-02-27 16:22:01,750 INFO MainThread:953563 [wandb_setup.py:_flush():76] Applying setup settings: {‘_disable_service’: False}
2024-02-27 16:22:01,750 INFO MainThread:953563 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {‘program’: ‘’}
2024-02-27 16:22:01,750 INFO MainThread:953563 [wandb_init.py:_log_setup():526] Logging user logs to /gpfsdswork/projects/rech/aib/ulf92ec/DSI-QG-main/wandb/run-20240227_162201-s27b6c1e/logs/debug.log
2024-02-27 16:22:01,750 INFO MainThread:953563 [wandb_init.py:_log_setup():527] Logging internal logs to /gpfsdswork/projects/rech/aib/ulf92ec/DSI-QG-main/wandb/run-20240227_162201-s27b6c1e/logs/debug-internal.log
2024-02-27 16:22:01,750 INFO MainThread:953563 [wandb_init.py:init():566] calling init triggers
2024-02-27 16:22:01,751 INFO MainThread:953563 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
config: {}
2024-02-27 16:22:01,751 INFO MainThread:953563 [wandb_init.py:init():616] starting backend
2024-02-27 16:22:01,751 INFO MainThread:953563 [wandb_init.py:init():620] setting up manager
2024-02-27 16:22:01,752 INFO MainThread:953563 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-02-27 16:22:01,753 INFO MainThread:953563 [wandb_init.py:init():628] backend started and connected
2024-02-27 16:22:01,763 INFO MainThread:953563 [wandb_run.py:_label_probe_notebook():1294] probe notebook
2024-02-27 16:22:01,763 INFO MainThread:953563 [wandb_run.py:_label_probe_notebook():1304] Unable to probe notebook: ‘NoneType’ object has no attribute ‘get’
2024-02-27 16:22:01,763 INFO MainThread:953563 [wandb_init.py:init():720] updated telemetry
2024-02-27 16:22:01,765 INFO MainThread:953563 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
2024-02-27 16:23:31,817 ERROR MainThread:953563 [wandb_init.py:init():779] encountered error: Run initialization has timed out after 90.0 sec.
Please refer to the documentation for additional information:
2024-02-27 16:23:33,832 ERROR MainThread:953563 [wandb_init.py:init():1194] Run initialization has timed out after 90.0 sec.
Please refer to the documentation for additional information:
Traceback (most recent call last):
File “/linkhome/rech/geniri01/ulf92ec/.local/lib/python3.11/site-packages/wandb/sdk/wandb_init.py”, line 1176, in init
run = wi.init()
^^^^^^^^^
File “/linkhome/rech/geniri01/ulf92ec/.local/lib/python3.11/site-packages/wandb/sdk/wandb_init.py”, line 785, in init
raise error
wandb.errors.CommError: Run initialization has timed out after 90.0 sec.
Please refer to the documentation for additional information:

Any ideas why i get this??

Hey @oussaidene-sma, thanks for flagging this! Would you mind sharing debug-internal.log as well so we can take a look to see what’s going on here?

Hey,
Thanks for the reply.
I’ve sent you debug-internal.log on luis.bergua@wandb.ai (found it on another thread) since the file is too big to copy paste it here

Hey @oussaidene-sma, thanks! I just took a look but it’s not clear why you’re getting the timeout error. Would you have any problems with:

  • Sharing any specific details of your environment. Is this running in a local machine?
  • Setting the following envirnment variables and sharing the logs again
    1. WANDB_HTTP_TIMEOUT=300
    2. WANDB_INIT_TIMEOUT =600
    3. WANDB_DEBUG=true

Thanks!

Hi @oussaidene-sma , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi @oussaidene-sma , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

@luis_bergua1 sorry for the late reply. Is it possible that the issue stems from a firewall within the environment I’m using? If so, what steps can I take to resolve it?

hey, I am facing an error in wandb. The error goes like this:
wandb: Network error (ConnectionError), entering retry loop.
wandb: W&B API key is configured. Use wandb login --relogin to force relogin
wandb: Network error (ConnectionError), entering retry loop.
Problem at: /home/cvpr_int_1/.conda/envs/ip2p/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py 406 experiment
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/cvpr_int_1/.conda/envs/ip2p/lib/python3.8/site-packages/pytorch_lightning/trainer/setup.py:175: PossibleUserWarning: GPU available but not used. Set accelerator and devices using Trainer(accelerator='gpu', devices=1).
rank_zero_warn(

wandb: ERROR Run initialization has timed out after 90.0 sec.
wandb: ERROR Please refer to the documentation for additional information: Frequently Asked Questions About Experiments
Traceback (most recent call last):
File “main.py”, line 636, in
trainer_kwargs[“logger”] = instantiate_from_config(logger_cfg)
File “./stable_diffusion/ldm/util.py”, line 88, in instantiate_from_config
return get_obj_from_str(config[“target”])(**config.get(“params”, dict()))
File “/home/cvpr_int_1/.conda/envs/ip2p/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py”, line 360, in init
_ = self.experiment
File “/home/cvpr_int_1/.conda/envs/ip2p/lib/python3.8/site-packages/pytorch_lightning/loggers/logger.py”, line 53, in experiment
return get_experiment() or DummyExperiment()
File “/home/cvpr_int_1/.conda/envs/ip2p/lib/python3.8/site-packages/lightning_utilities/core/rank_zero.py”, line 42, in wrapped_fn
return fn(*args, **kwargs)
File “/home/cvpr_int_1/.conda/envs/ip2p/lib/python3.8/site-packages/pytorch_lightning/loggers/logger.py”, line 51, in get_experiment
return fn(self)
File “/home/cvpr_int_1/.conda/envs/ip2p/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py”, line 406, in experiment
self._experiment = wandb.init(**self._wandb_init)
File “/home/cvpr_int_1/.conda/envs/ip2p/lib/python3.8/site-packages/wandb/sdk/wandb_init.py”, line 1200, in init
raise e
File “/home/cvpr_int_1/.conda/envs/ip2p/lib/python3.8/site-packages/wandb/sdk/wandb_init.py”, line 1181, in init
run = wi.init()
File “/home/cvpr_int_1/.conda/envs/ip2p/lib/python3.8/site-packages/wandb/sdk/wandb_init.py”, line 780, in init
raise error
wandb.errors.CommError: Run initialization has timed out after 90.0 sec.

Following is the debug.log file:
2024-07-01 17:06:29,916 INFO MainThread:780160 [wandb_setup.py:_flush():76] Current SDK version is 0.17.3
2024-07-01 17:06:29,916 INFO MainThread:780160 [wandb_setup.py:_flush():76] Configure stats pid to 780160
2024-07-01 17:06:29,916 INFO MainThread:780160 [wandb_setup.py:_flush():76] Loading settings from /home/cvpr_int_1/.config/wandb/settings
2024-07-01 17:06:29,916 INFO MainThread:780160 [wandb_setup.py:_flush():76] Loading settings from /home/cvpr_int_1/pix_to_pix/instruct-pix2pix/wandb/settings
2024-07-01 17:06:29,916 INFO MainThread:780160 [wandb_setup.py:_flush():76] Loading settings from environment variables: {}
2024-07-01 17:06:29,916 INFO MainThread:780160 [wandb_setup.py:_flush():76] Applying setup settings: {‘_disable_service’: False}
2024-07-01 17:06:29,917 INFO MainThread:780160 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {‘program_relpath’: ‘main.py’, ‘program_abspath’: ‘/home/cvpr_int_1/pix_to_pix/instruct-pix2pix/main.py’, ‘program’: ‘main.py’}
2024-07-01 17:06:29,917 INFO MainThread:780160 [wandb_setup.py:_flush():76] Applying login settings: {}
2024-07-01 17:06:29,917 INFO MainThread:780160 [wandb_setup.py:_flush():76] Applying login settings: {‘mode’: ‘offline’}
2024-07-01 17:06:29,917 INFO MainThread:780160 [wandb_init.py:_log_setup():520] Logging user logs to logs/train_default/wandb/offline-run-20240701_170629-train_default/logs/debug.log
2024-07-01 17:06:29,917 INFO MainThread:780160 [wandb_init.py:_log_setup():521] Logging internal logs to logs/train_default/wandb/offline-run-20240701_170629-train_default/logs/debug-internal.log
2024-07-01 17:06:29,917 INFO MainThread:780160 [wandb_init.py:init():560] calling init triggers
2024-07-01 17:06:29,917 INFO MainThread:780160 [wandb_init.py:init():567] wandb.init called with sweep_config: {}
config: {}
2024-07-01 17:06:29,917 INFO MainThread:780160 [wandb_init.py:init():610] starting backend
2024-07-01 17:06:29,917 INFO MainThread:780160 [wandb_init.py:init():614] setting up manager
2024-07-01 17:06:29,918 INFO MainThread:780160 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-07-01 17:06:29,919 INFO MainThread:780160 [wandb_init.py:init():622] backend started and connected
2024-07-01 17:06:29,922 INFO MainThread:780160 [wandb_init.py:init():711] updated telemetry
2024-07-01 17:06:30,094 INFO MainThread:780160 [wandb_init.py:init():744] communicating run to backend with 90.0 second timeout
2024-07-01 17:06:30,099 INFO MainThread:780160 [wandb_init.py:init():795] starting run threads in backend
2024-07-01 17:06:32,679 INFO MainThread:780160 [wandb_run.py:_console_start():2380] atexit reg
2024-07-01 17:06:32,679 INFO MainThread:780160 [wandb_run.py:_redirect():2235] redirect: wrap_raw
2024-07-01 17:06:32,679 INFO MainThread:780160 [wandb_run.py:_redirect():2300] Wrapping output streams.
2024-07-01 17:06:32,679 INFO MainThread:780160 [wandb_run.py:_redirect():2325] Redirects installed.
2024-07-01 17:06:32,680 INFO MainThread:780160 [wandb_init.py:init():838] run started, returning control to user process
2024-07-01 17:06:39,518 WARNING MsgRouterThr:780160 [router.py:message_loop():77] message_loop has been closed