Connection timed out always now

Hi, I’m trying to run the Stable Audio Open training code, and train.py started giving me this error a few days ago:

/content/stable-audio-tools
Found 158 files
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
wandb: Currently logged in as: kim-ake. Use `wandb login --relogin` to force relogin
Problem at: /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/wandb.py 399 experiment
Traceback (most recent call last):
  File "/content/stable-audio-tools/./train.py", line 128, in <module>
    main()
  File "/content/stable-audio-tools/./train.py", line 72, in main
    wandb_logger.watch(training_wrapper)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/wandb.py", line 411, in watch
    self.experiment.watch(model, log=log, log_freq=log_freq, log_graph=log_graph)
  File "/usr/local/lib/python3.10/dist-packages/lightning_fabric/loggers/logger.py", line 118, in experiment
    return fn(self)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/wandb.py", line 399, in experiment
    self._experiment = wandb.init(**self._wandb_init)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1171, in init
    raise e
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1152, in init
    run = wi.init()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 768, in init
    raise error
wandb.errors.CommError: Run initialization has timed out after 600.0 sec. 
Please refer to the documentation for additional information:

Debug.log reveals nothing specific:

 2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Current SDK version is 0.15.4
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Configure stats pid to 14803
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Loading settings from /root/.config/wandb/settings
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Loading settings from /content/stable-audio-tools/wandb/settings
2024-07-06 18:07:43,859 WARNING MainThread:14803 [wandb_setup.py:_flush():76] Unknown environment variable: WANDB_HTTP_TIMEOUT
2024-07-06 18:07:43,859 WARNING MainThread:14803 [wandb_setup.py:_flush():76] Unknown environment variable: WANDB_DEBUG
2024-07-06 18:07:43,859 WARNING MainThread:14803 [wandb_setup.py:_flush():76] Unknown environment variable: WANDB_SERVICE
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'init_timeout': '600'}
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'train.py', 'program': '/content/stable-audio-tools/./train.py'}
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:_log_setup():507] Logging user logs to ./wandb/run-20240706_180743-oedjvz8g/logs/debug.log
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:_log_setup():508] Logging internal logs to ./wandb/run-20240706_180743-oedjvz8g/logs/debug-internal.log
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:init():547] calling init triggers
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:init():554] wandb.init called with sweep_config: {}
config: {}
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:init():596] starting backend
2024-07-06 18:07:43,859 INFO    MainThread:14803 [wandb_init.py:init():600] setting up manager
2024-07-06 18:07:43,862 INFO    MainThread:14803 [backend.py:_multiprocessing_setup():106] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-07-06 18:07:43,864 INFO    MainThread:14803 [wandb_init.py:init():606] backend started and connected
2024-07-06 18:07:43,866 INFO    MainThread:14803 [wandb_init.py:init():703] updated telemetry
2024-07-06 18:07:43,871 INFO    MainThread:14803 [wandb_init.py:init():736] communicating run to backend with 600.0 second timeout
2024-07-06 18:17:44,047 ERROR   MainThread:14803 [wandb_init.py:init():762] encountered error: Run initialization has timed out after 600.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
2024-07-06 18:17:44,200 ERROR   MainThread:14803 [wandb_init.py:init():1170] Run initialization has timed out after 600.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1152, in init
    run = wi.init()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 768, in init
    raise error
wandb.errors.CommError: Run initialization has timed out after 600.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-

I’m running on Colab, Wandb version is 0.15.4 as required by Stable Audio Open. The curious thing is, that this used to work fine once or twice, and now on several days no luck.
I also set these variables:

os.environ['WANDB_HTTP_TIMEOUT'] = '300'
os.environ['WANDB_INIT_TIMEOUT'] = '600'
os.environ['WANDB_DEBUG'] = 'true'

I did also relogin in the terminal. That did not help.

Ok, interesting, I tried this code suggested by Claude in a new session, it worked perfectly, even if I downgraded Wandb to 0.15.4 from the defaulta v17.

import os
import wandb
from getpass import getpass

# Print wandb version
print(f"Weights & Biases version: {wandb.__version__}")

# Set environment variables
os.environ['WANDB_HTTP_TIMEOUT'] = '300'
os.environ['WANDB_INIT_TIMEOUT'] = '600'
os.environ['WANDB_DEBUG'] = 'true'

# Print environment variables
print("\nEnvironment Variables:")
print(f"WANDB_HTTP_TIMEOUT: {os.environ.get('WANDB_HTTP_TIMEOUT')}")
print(f"WANDB_INIT_TIMEOUT: {os.environ.get('WANDB_INIT_TIMEOUT')}")
print(f"WANDB_DEBUG: {os.environ.get('WANDB_DEBUG')}")

# Attempt to log in
print("\nAttempting to log in to Weights & Biases...")

try:
    # First, try logging in without an API key (in case you're already logged in)
    wandb.login()
except Exception as e:
    print(f"Login without API key failed. Error: {e}")
    print("Trying again with manual API key input...")
    
    # If that fails, prompt for API key
    api_key = getpass("Enter your W&B API key: ")
    try:
        wandb.login(key=api_key)
    except Exception as e:
        print(f"Login with API key failed. Error: {e}")
    else:
        print("Successfully logged in with provided API key.")
else:
    print("Successfully logged in without needing to provide API key.")

# Print system info
print("\nSystem Info:")
!python --version
!pip freeze | grep wandb

# Check connection to W&B servers
print("\nChecking connection to W&B servers...")
!curl -I https://api.wandb.ai

# Initialize a wandb run (this will test if everything is working)
print("\nInitializing a test W&B run...")
try:
    with wandb.init(project="test_project", job_type="test") as run:
        print("W&B run initialized successfully.")
        print(f"Run URL: {run.get_url()}")
except Exception as e:
    print(f"Failed to initialize W&B run. Error: {e}")

print("\nTest complete.")

I’ve been testing, and it seems that the logging hangs at this line 73 of train.py:
wandb_logger.watch(training_wrapper)