I am actually still getting the same issue but not with the wandb test script shown. I cannot predict what causes the failure.
On one of my scripts the main function starts like this.
load_dotenv()
DATA_ROOT = os.getenv("DATA_ROOT")
@hydra.main(version_base=None, config_path="conf", config_name="random-forest")
def main(cfg: DictConfig) -> None:
os.environ["WANDB__SERVICE_WAIT"] = "300"
wandb_cfg = OmegaConf.to_container(cfg, resolve=True, throw_on_missing=True)
slurm_job_id = os.environ.get("SLURM_JOB_ID", uuid.uuid4())
sorted_cfg = json.dumps(wandb_cfg, sort_keys=True)
hashed_cfg = hashlib.sha256(sorted_cfg.encode("utf-8")).hexdigest()
group = f"{slurm_job_id}_{hashed_cfg}"
wandb.init(
mode="online",
project=wandb_cfg["wandb"]["project"],
config=wandb_cfg,
group=group,
tags=wandb_cfg["wandb"]["tags"],
)
It fails before it gets to the init. Here is the standard output.
wandb: ERROR Find detailed error logs at: /projects/bbub/mjvolk3/torchcell/wandb/debug-cli.mjvolk3.log
Error: Timed out waiting for wandb service to start after 30.0 seconds. Try increasing the timeout with the `_service_wait` setting.
Here is the debug-internal.log
2024-04-30 19:40:13,544 INFO StreamThr :910101 [internal.py:wandb_internal():86] W&B internal server running at pid: 910101, started at: 2024-04-30 19:40:13.543642
2024-04-30 19:40:13,546 DEBUG HandlerThread:910101 [handler.py:handle_request():146] handle_request: status
2024-04-30 19:40:13,562 INFO WriterThread:910101 [datastore.py:open_for_write():87] open: /projects/bbub/mjvolk3/torchcell/wandb/offline-run-20240430_194013-nyj2wnqc/run-nyj2wnqc.wandb
2024-04-30 19:40:13,633 DEBUG HandlerThread:910101 [handler.py:handle_request():146] handle_request: run_start
2024-04-30 19:40:13,634 DEBUG HandlerThread:910101 [system_info.py:__init__():26] System info init
2024-04-30 19:40:13,634 DEBUG HandlerThread:910101 [system_info.py:__init__():41] System info init done
2024-04-30 19:40:13,634 INFO HandlerThread:910101 [system_monitor.py:start():194] Starting system monitor
2024-04-30 19:40:13,635 INFO SystemMonitor:910101 [system_monitor.py:_start():158] Starting system asset monitoring threads
2024-04-30 19:40:13,635 INFO HandlerThread:910101 [system_monitor.py:probe():214] Collecting system info
2024-04-30 19:40:13,635 INFO SystemMonitor:910101 [interfaces.py:start():190] Started cpu monitoring
2024-04-30 19:40:13,636 INFO SystemMonitor:910101 [interfaces.py:start():190] Started disk monitoring
2024-04-30 19:40:13,637 INFO SystemMonitor:910101 [interfaces.py:start():190] Started memory monitoring
2024-04-30 19:40:13,638 INFO SystemMonitor:910101 [interfaces.py:start():190] Started network monitoring
2024-04-30 19:40:13,682 DEBUG HandlerThread:910101 [system_info.py:probe():150] Probing system
2024-04-30 19:40:13,685 DEBUG HandlerThread:910101 [system_info.py:_probe_git():135] Probing git
2024-04-30 19:40:13,712 DEBUG HandlerThread:910101 [system_info.py:_probe_git():143] Probing git done
2024-04-30 19:40:13,712 DEBUG HandlerThread:910101 [system_info.py:probe():198] Probing system done
2024-04-30 19:40:13,712 DEBUG HandlerThread:910101 [system_monitor.py:probe():223] {'os': 'Linux-4.18.0-477.51.1.el8_8.x86_64-x86_64-with-glibc2.28', 'python': '3.11.7', 'heartbeatAt': '2024-05-01T00:40:13.682839', 'startedAt': '2024-05-01T00:40:13.516011', 'docker': None, 'cuda': None, 'args': (), 'state': 'running', 'program': '/projects/bbub/mjvolk3/torchcell/experiments/smf-dmf-tmf-001/svr.py', 'codePathLocal': 'experiments/smf-dmf-tmf-001/svr.py', 'codePath': 'experiments/smf-dmf-tmf-001/svr.py', 'git': {'remote': 'https://github.com/Mjvolk3/torchcell', 'commit': '536b456073f74a3517c452ed2fa40f740aa8d1a0'}, 'email': 'mjvolk3@illinois.edu', 'root': '/projects/bbub/mjvolk3/torchcell', 'host': 'cn004.delta.ncsa.illinois.edu', 'username': 'mjvolk3', 'executable': '/projects/bbub/miniconda3/envs/torchcell/bin/python', 'cpu_count': 128, 'cpu_count_logical': 128, 'cpu_freq': {'current': 2454.989375, 'min': 1500.0, 'max': 2450.0}, 'cpu_freq_per_core': [{'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2395.346, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 3243.016, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2394.373, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}], 'disk': {'/': {'total': 125.80015182495117, 'used': 6.478443145751953}}, 'memory': {'total': 251.6003074645996}}
2024-04-30 19:40:13,712 INFO HandlerThread:910101 [system_monitor.py:probe():224] Finished collecting system info
2024-04-30 19:40:13,712 INFO HandlerThread:910101 [system_monitor.py:probe():227] Publishing system info
2024-04-30 19:40:13,712 DEBUG HandlerThread:910101 [system_info.py:_save_conda():207] Saving list of conda packages installed into the current environment
And here is the debug.log.
2024-04-30 19:40:13,532 INFO MainThread:909868 [wandb_setup.py:_flush():76] Current SDK version is 0.16.6
2024-04-30 19:40:13,532 INFO MainThread:909868 [wandb_setup.py:_flush():76] Configure stats pid to 909868
2024-04-30 19:40:13,532 INFO MainThread:909868 [wandb_setup.py:_flush():76] Loading settings from /u/mjvolk3/.config/wandb/settings
2024-04-30 19:40:13,532 INFO MainThread:909868 [wandb_setup.py:_flush():76] Loading settings from /projects/bbub/mjvolk3/torchcell/wandb/settings
2024-04-30 19:40:13,532 INFO MainThread:909868 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'entity': 'zhao-group', 'project': 'torchcell_smf-dmf-tmf-001_trad-ml_svr_1e03', 'sweep_id': 'enfpup0l', 'root_dir': '/projects/bbub/mjvolk3/torchcell', 'run_id': 'nyj2wnqc', 'sweep_param_path': '/projects/bbub/mjvolk3/torchcell/wandb/sweep-enfpup0l/config-nyj2wnqc.yaml'}
2024-04-30 19:40:13,532 INFO MainThread:909868 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-04-30 19:40:13,532 INFO MainThread:909868 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'experiments/smf-dmf-tmf-001/svr.py', 'program_abspath': '/projects/bbub/mjvolk3/torchcell/experiments/smf-dmf-tmf-001/svr.py', 'program': '/projects/bbub/mjvolk3/torchcell/experiments/smf-dmf-tmf-001/svr.py'}
2024-04-30 19:40:13,533 INFO MainThread:909868 [wandb_init.py:_log_setup():521] Logging user logs to /projects/bbub/mjvolk3/torchcell/wandb/offline-run-20240430_194013-nyj2wnqc/logs/debug.log
2024-04-30 19:40:13,533 INFO MainThread:909868 [wandb_init.py:_log_setup():522] Logging internal logs to /projects/bbub/mjvolk3/torchcell/wandb/offline-run-20240430_194013-nyj2wnqc/logs/debug-internal.log
2024-04-30 19:40:13,533 INFO MainThread:909868 [wandb_init.py:init():561] calling init triggers
2024-04-30 19:40:13,533 INFO MainThread:909868 [wandb_init.py:init():568] wandb.init called with sweep_config: {'cell_dataset': {'aggregation': 'sum', 'graphs': None, 'is_pert': True, 'max_size': 1000, 'node_embeddings': ['nt_window_5979']}, 'data_module': {'batch_size': 16, 'num_workers': 6, 'pin_memory': True}, 'svr': {'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}}
config: {'hydra_logging': {'loggers': {'logging_example': {'level': 'INFO'}}}, 'program': 'experiments/smf-dmf-tmf-001/svr.py', 'wandb': {'mode': 'online', 'project': 'torchcell_smf-dmf-tmf-001_trad-ml_svr', 'tags': []}, 'cell_dataset': {'graphs': None, 'node_embeddings': ['codon_frequency'], 'max_size': 1000.0, 'is_pert': True, 'aggregation': 'sum'}, 'data_module': {'batch_size': 16, 'num_workers': 6, 'pin_memory': True}, 'svr': {'kernel': 'linear', 'C': 1.0, 'gamma': 0.1}, 'command': ['python', 'experiments/smf-dmf-tmf-001/svr.py']}
2024-04-30 19:40:13,533 INFO MainThread:909868 [wandb_init.py:init():611] starting backend
2024-04-30 19:40:13,533 INFO MainThread:909868 [wandb_init.py:init():615] setting up manager
2024-04-30 19:40:13,542 INFO MainThread:909868 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-04-30 19:40:13,545 INFO MainThread:909868 [wandb_init.py:init():623] backend started and connected
2024-04-30 19:40:13,560 INFO MainThread:909868 [wandb_run.py:_config_callback():1347] config_cb None None {'cell_dataset': {'graphs': None, 'node_embeddings': ['nt_window_5979'], 'max_size': 1000, 'is_pert': True, 'aggregation': 'sum'}, 'data_module': {'batch_size': 16, 'num_workers': 6, 'pin_memory': True}, 'svr': {'kernel': 'rbf', 'C': 0.1, 'gamma': 0.01}}
2024-04-30 19:40:13,561 INFO MainThread:909868 [wandb_init.py:init():715] updated telemetry
2024-04-30 19:40:13,580 INFO MainThread:909868 [wandb_init.py:init():748] communicating run to backend with 90.0 second timeout
2024-04-30 19:40:13,585 INFO MainThread:909868 [wandb_init.py:init():799] starting run threads in backend
This on the wandb 0.16.0. You can see that it says timed out after 30.0 s. This means that somehow the environment variable is not recornized.