Connection timed out always now

Hi, I’m trying to run the Stable Audio Open training code, and train.py started giving me this error a few days ago:

/content/stable-audio-tools
Found 158 files
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
wandb: Currently logged in as: kim-ake. Use `wandb login --relogin` to force relogin
Problem at: /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/wandb.py 399 experiment
Traceback (most recent call last):
  File "/content/stable-audio-tools/./train.py", line 128, in <module>
    main()
  File "/content/stable-audio-tools/./train.py", line 72, in main
    wandb_logger.watch(training_wrapper)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/wandb.py", line 411, in watch
    self.experiment.watch(model, log=log, log_freq=log_freq, log_graph=log_graph)
  File "/usr/local/lib/python3.10/dist-packages/lightning_fabric/loggers/logger.py", line 118, in experiment
    return fn(self)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/wandb.py", line 399, in experiment
    self._experiment = wandb.init(**self._wandb_init)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1171, in init
    raise e
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1152, in init
    run = wi.init()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 768, in init
    raise error
wandb.errors.CommError: Run initialization has timed out after 600.0 sec. 
Please refer to the documentation for additional information:

Debug.log reveals nothing specific:

 2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Current SDK version is 0.15.4
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Configure stats pid to 14803
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Loading settings from /root/.config/wandb/settings
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Loading settings from /content/stable-audio-tools/wandb/settings
2024-07-06 18:07:43,859 WARNING MainThread:14803 [wandb_setup.py:_flush():76] Unknown environment variable: WANDB_HTTP_TIMEOUT
2024-07-06 18:07:43,859 WARNING MainThread:14803 [wandb_setup.py:_flush():76] Unknown environment variable: WANDB_DEBUG
2024-07-06 18:07:43,859 WARNING MainThread:14803 [wandb_setup.py:_flush():76] Unknown environment variable: WANDB_SERVICE
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'init_timeout': '600'}
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'train.py', 'program': '/content/stable-audio-tools/./train.py'}
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:_log_setup():507] Logging user logs to ./wandb/run-20240706_180743-oedjvz8g/logs/debug.log
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:_log_setup():508] Logging internal logs to ./wandb/run-20240706_180743-oedjvz8g/logs/debug-internal.log
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:init():547] calling init triggers
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:init():554] wandb.init called with sweep_config: {}
config: {}
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:init():596] starting backend
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:init():600] setting up manager
2024-07-06 18:07:43,862 INFO    MainThread:14803 [backend.py:_multiprocessing_setup():106] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-07-06 18:07:43,864 INFO    MainThread:14803 [wandb_init.py:init():606] backend started and connected
2024-07-06 18:07:43,866 INFO    MainThread:14803 [wandb_init.py:init():703] updated telemetry
2024-07-06 18:07:43,871 INFO    MainThread:14803 [wandb_init.py:init():736] communicating run to backend with 600.0 second timeout
2024-07-06 18:17:44,047 ERROR   MainThread:14803 [wandb_init.py:init():762] encountered error: Run initialization has timed out after 600.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
2024-07-06 18:17:44,200 ERROR   MainThread:14803 [wandb_init.py:init():1170] Run initialization has timed out after 600.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1152, in init
    run = wi.init()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 768, in init
    raise error
wandb.errors.CommError: Run initialization has timed out after 600.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-

I’m running on Colab, Wandb version is 0.15.4 as required by Stable Audio Open. The curious thing is, that this used to work fine once or twice, and now on several days no luck.
I also set these variables:

os.environ['WANDB_HTTP_TIMEOUT'] = '300'
os.environ['WANDB_INIT_TIMEOUT'] = '600'
os.environ['WANDB_DEBUG'] = 'true'

I did also relogin in the terminal. That did not help.

Ok, interesting, I tried this code suggested by Claude in a new session, it worked perfectly, even if I downgraded Wandb to 0.15.4 from the defaulta v17.

import os
import wandb
from getpass import getpass

# Print wandb version
print(f"Weights & Biases version: {wandb.__version__}")

# Set environment variables
os.environ['WANDB_HTTP_TIMEOUT'] = '300'
os.environ['WANDB_INIT_TIMEOUT'] = '600'
os.environ['WANDB_DEBUG'] = 'true'

# Print environment variables
print("\nEnvironment Variables:")
print(f"WANDB_HTTP_TIMEOUT: {os.environ.get('WANDB_HTTP_TIMEOUT')}")
print(f"WANDB_INIT_TIMEOUT: {os.environ.get('WANDB_INIT_TIMEOUT')}")
print(f"WANDB_DEBUG: {os.environ.get('WANDB_DEBUG')}")

# Attempt to log in
print("\nAttempting to log in to Weights & Biases...")

try:
    # First, try logging in without an API key (in case you're already logged in)
    wandb.login()
except Exception as e:
    print(f"Login without API key failed. Error: {e}")
    print("Trying again with manual API key input...")
    
    # If that fails, prompt for API key
    api_key = getpass("Enter your W&B API key: ")
    try:
        wandb.login(key=api_key)
    except Exception as e:
        print(f"Login with API key failed. Error: {e}")
    else:
        print("Successfully logged in with provided API key.")
else:
    print("Successfully logged in without needing to provide API key.")

# Print system info
print("\nSystem Info:")
!python --version
!pip freeze | grep wandb

# Check connection to W&B servers
print("\nChecking connection to W&B servers...")
!curl -I https://api.wandb.ai

# Initialize a wandb run (this will test if everything is working)
print("\nInitializing a test W&B run...")
try:
    with wandb.init(project="test_project", job_type="test") as run:
        print("W&B run initialized successfully.")
        print(f"Run URL: {run.get_url()}")
except Exception as e:
    print(f"Failed to initialize W&B run. Error: {e}")

print("\nTest complete.")

I’ve been testing, and it seems that the logging hangs at this line 73 of train.py:
wandb_logger.watch(training_wrapper)

Hey @kim-ake ! Thank you so much for your patience with this!

Thank you so much for reaching out to Wandb and for flagging this to us.

From my side, I have a couple of suggestions/requests:

  • You shared that you downgraded to version 0.15.4, in which version were you receiving those errors? Our current version is 0.17.5, do you get issues with this?
  • Have you in your original code considered to use core? wandb-core is a new and improved backend for the W&B SDK that is more performant, versatile, and robust. You can use it by easily calling: wandb.require("core")
  • could you share with us more about the code of train.py and a bit more of the context of how you used and implemented wandb_logger.watch(training_wrapper), specifically the training_wrapper? It would be extremely helpful for us to dig this further.

Thank you so much for pushing through despite how frustrating this could have been - let us know when you can these details so we can help out!

Thanks,
W&B

Hi Kim,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Weights & Biases

Hi Kim, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!