Initial timeout after 90 sec

I cannot start a project. whenever i call wandb.init, it runs for 90 seconds and then it automatically stops with an error “ commerror: run initialization has timed out after 90.0 sec.”

Hi there,

Thanks for reaching out about the init issues you’re experiencing. I’d love to help you get to the bottom of this. To better understand what’s going on, could you share a few more details?

  1. What system are you running this on? (e.g., Notebook, local, could etc)
  2. Which version of our software are you using?
  3. Do you have a code snippet that demonstrates the problem?

This information will really help me pinpoint the cause and find a solution for you. Feel free to share whatever you’re comfortable with.

Looking forward to hearing back from you!

Best,
Jason

  1. ipynb, local machine, linux
  2. 0.17.8, also tried the latest one. same problem

wandb.init(project=EXP_NAME, entity=‘ahamedrobin45’)
wandb.config.update(config.dict)

error: CommError: Run initialization has timed out after 90.0 sec.

Hi there! Thanks for clarifying. It may also be useful to have a look at the output.log file and the debug.log and debug-internal.log files from the ./wandb/run-date_time-runid/ folder (located in files and logs subfolders) if you have access to the working directory where the experiment is running

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi, I have the same error, I can provide with the .log file.

First, code of a.py:

import wandb
import random # for demo script
import os

wandb_api_key = xxx
wandb.login(key=wandb_api_key, relogin=True)

epochs = 10
lr = 0.01

run = wandb.init(name = “jj”)

offset = random.random() / 5
print(f"lr: {lr}")

for epoch in range(2, epochs):
acc = 1 - 2**-epoch - random.random() / epoch - offset
loss = 2**-epoch + random.random() / epoch + offset
print(f"epoch={epoch}, accuracy={acc}, loss={loss}")
wandb.log({“accuracy”: acc, “loss”: loss})

When I run “python a.py” in linux, it works. But when I run sbatch b.sh, it does not work!

b.sh:

#!/bin/bash -l

#SBATCH --nodes=1 # Resource requirements, job runtime, other options
#SBATCH --ntasks-per-node=1 #All #SBATCH lines have to follow uninterrupted
#SBATCH --time=24:00:00
#SBATCH --job-name=zephyr-7b_dpo_4gpu
#SBATCH --export=NONE # do not export environment from submitting shell
#SBATCH --output=zephyr-7b_dpo_4gpu.txt
#SBATCH --cpus-per-gpu=8
#SBATCH --gres=gpu:a40:1

cd xxx
source ~/.bashrc
conda activate simpo
echo “start to run”
python -c “import torch; print(torch.cuda.is_available())”

python a.py

debug.log:
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_setup.py:_flush():79] Current SDK version is 0.18.3
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_setup.py:_flush():79] Configure stats pid to 1173237
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_setup.py:_flush():79] Loading settings from /home/hpc/v100dd/v100dd18/.config/wandb/settings
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_setup.py:_flush():79] Loading settings from /home/hpc/v100dd/v100dd18/dpo/trl/wandb/settings
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_setup.py:_flush():79] Loading settings from environment variables: {}
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_setup.py:_flush():79] Inferring run settings from compute environment: {‘program_relpath’: ‘fei.py’, ‘program_abspath’: ‘/home/hpc/v100dd/v100dd18/dpo/trl/fei.py’, ‘program’: ‘/home/hpc/v100dd/v100dd18/dpo/trl/fei.py’}
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_setup.py:_flush():79] Applying login settings: {‘api_key’: ‘REDACTED’}
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_setup.py:_flush():79] Applying login settings: {‘api_key’: ‘REDACTED’}
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_setup.py:_flush():79] Applying login settings: {}
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_init.py:_log_setup():532] Logging user logs to /home/hpc/v100dd/v100dd18/dpo/trl/wandb/run-20241005_213857-9pvgjp31/logs/debug.log
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_init.py:_log_setup():533] Logging internal logs to /home/hpc/v100dd/v100dd18/dpo/trl/wandb/run-20241005_213857-9pvgjp31/logs/debug-internal.log
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_init.py:init():617] calling init triggers
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_init.py:init():624] wandb.init called with sweep_config: {}
config: {}
2024-10-05 21:38:57,704 INFO MainThread:1173237 [wandb_init.py:init():667] starting backend
2024-10-05 21:38:57,705 INFO MainThread:1173237 [wandb_init.py:init():671] sending inform_init request
2024-10-05 21:38:57,706 INFO MainThread:1173237 [backend.py:_multiprocessing_setup():104] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-10-05 21:38:57,707 INFO MainThread:1173237 [wandb_init.py:init():684] backend started and connected
2024-10-05 21:38:57,708 INFO MainThread:1173237 [wandb_init.py:init():779] updated telemetry
2024-10-05 21:38:57,709 INFO MainThread:1173237 [wandb_init.py:init():812] communicating run to backend with 90.0 second timeout
2024-10-05 21:39:55,999 INFO Thread-1 (wrapped_target):1173237 [retry.py:call():172] Retry attempt failed:
Traceback (most recent call last):
File “/home/hpc/v100dd/v100dd18/miniconda3/envs/simpo/lib/python3.10/site-packages/urllib3/connection.py”, line 199, in _new_conn
sock = connection.create_connection(
File “/home/hpc/v100dd/v100dd18/miniconda3/envs/simpo/lib/python3.10/site-packages/urllib3/util/connection.py”, line 85, in create_connection
raise err
File “/home/hpc/v100dd/v100dd18/miniconda3/envs/simpo/lib/python3.10/site-packages/urllib3/util/connection.py”, line 73, in create_connection
sock.connect(sa)
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/hpc/v100dd/v100dd18/miniconda3/envs/simpo/lib/python3.10/site-packages/urllib3/connectionpool.py”, line 789, in urlopen
response = self._make_request(
File “/home/hpc/v100dd/v100dd18/miniconda3/envs/simpo/lib/python3.10/site-packages/urllib3/connectionpool.py”, line 490, in _make_request
raise new_e
File “/home/hpc/v100dd/v100dd18/miniconda3/envs/simpo/lib/python3.10/site-packages/urllib3/connectionpool.py”, line 466, in _make_request
self._validate_conn(conn)
File “/home/hpc/v100dd/v100dd18/miniconda3/envs/simpo/lib/python3.10/site-packages/urllib3/connectionpool.py”, line 1095, in _validate_conn
conn.connect()
File “/home/hpc/v100dd/v100dd18/miniconda3/envs/simpo/lib/python3.10/site-packages/urllib3/connection.py”, line 693, in connect
self.sock = sock = self._new_conn()
File “/home/hpc/v100dd/v100dd18/miniconda3/envs/simpo/lib/python3.10/site-packages/urllib3/connection.py”, line 208, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x14e9b8a442e0>, ‘Connection to api.wandb.ai timed out. (connect timeout=20)’)

debug-internal.log:

{“time”:“2024-10-05T21:38:57.72198004+02:00”,“level”:“INFO”,“msg”:“using version”,“core version”:“0.18.3”}
{“time”:“2024-10-05T21:38:57.722000528+02:00”,“level”:“INFO”,“msg”:“created symlink”,“path”:“/home/hpc/v100dd/v100dd18/dpo/trl/wandb/run-20241005_213857-9pvgjp31/logs/debug-core.log”}
{“time”:“2024-10-05T21:38:57.726760521+02:00”,“level”:“ERROR”,“msg”:“dialing: google: could not find default credentials. See xx”}
{“time”:“2024-10-05T21:38:57.756738939+02:00”,“level”:“INFO”,“msg”:“created new stream”,“id”:“9pvgjp31”}
{“time”:“2024-10-05T21:38:57.756755881+02:00”,“level”:“INFO”,“msg”:“stream: started”,“id”:“9pvgjp31”}
{“time”:“2024-10-05T21:38:57.756789084+02:00”,“level”:“INFO”,“msg”:“handler: started”,“stream_id”:{“value”:“9pvgjp31”}}
{“time”:“2024-10-05T21:38:57.756821766+02:00”,“level”:“INFO”,“msg”:“writer: Do: started”,“stream_id”:{“value”:“9pvgjp31”}}
{“time”:“2024-10-05T21:38:57.756815584+02:00”,“level”:“INFO”,“msg”:“sender: started”,“stream_id”:{“value”:“9pvgjp31”}}
{“time”:“2024-10-05T21:39:27.760853135+02:00”,“level”:“INFO”,“msg”:“api: retrying error”,“error”:“Post "xx”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)“}
{“time”:“2024-10-05T21:40:00.230482509+02:00”,“level”:“INFO”,“msg”:“api: retrying error”,“error”:“Post "https://api.wandb.ai/graphql\”: context deadline exceeded”}
{“time”:“2024-10-05T21:40:29.742902224+02:00”,“level”:“INFO”,“msg”:“stream: closing”,“id”:“9pvgjp31”}
{“time”:“2024-10-05T21:40:29.743215419+02:00”,“level”:“INFO”,“msg”:“handler: closed”,“stream_id”:{“value”:“9pvgjp31”}}
{“time”:“2024-10-05T21:40:29.743245806+02:00”,“level”:“INFO”,“msg”:“writer: Close: closed”,“stream_id”:{“value”:“9pvgjp31”}}
{“time”:“2024-10-05T21:40:29.743262648+02:00”,“level”:“ERROR”,“msg”:“sender: upsertRun:”,“error”:“failed to upsert bucket: api: failed sending: POST https://api.wandb.ai/graphql giving up after 3 attempt(s): context canceled”}
{“time”:“2024-10-05T21:40:29.743305961+02:00”,“level”:“ERROR”,“msg”:“runwork: ignoring record after close”,“work”:{“Record”:{“RecordType”:{“Request”:{“RequestType”:{“Defer”:{}}}},“control”:{“always_send”:true}}}}
{“time”:“2024-10-05T21:40:29.745116376+02:00”,“level”:“INFO”,“msg”:“sender: closed”,“stream_id”:{“value”:“9pvgjp31”}}
{“time”:“2024-10-05T21:40:29.745131004+02:00”,“level”:“INFO”,“msg”:“stream: closed”,“id”:“9pvgjp31”}

Hi there - can you please provide me a link to the project/workspace in question? Thanks!

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi Rizwan, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!