Repeated Timeouts in Wandb Init

I am running training scripts our university cluster. I keep getting the following time out error.

wandb.sdk.service.service.ServiceStartTimeoutError: Timed out waiting for wandb service to start after 30.0 seconds. Try increasing the timeout with the `_service_wait` setting.

I have tried this as a solution but continue to get the error message that only 30 seconds was waited. I have also tried using the environment variable at the beginning of script for this and it also didn’t work, saying only 30 seconds was waited. I’ve visited some of the related posts and it is unclear how their issues were resovled.

    wandb.init(
        # mode=wandb_cfg["wandb"]["mode"],
        mode="disabled",
        project=wandb_cfg["wandb"]["project"],
        config=wandb_cfg,
        group=group,
        tags=wandb_cfg["wandb"]["tags"],
        settings={
            "_service_wait": 600,
            "init_timeout": 600
        }
    )

Hi @mjvolk3! Thank you for writing in!

Could we please check if you are able to run a simple wandb script on your cluster?

import wandb
run = wandb.init()
run.log({"test":123})
run.finish()

Warmly,
Artsiom

Here is the slurm script:

#!/bin/bash

#SBATCH --mem=8g

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1 # Increase the number of tasks per node to the desired number of agents

#SBATCH --cpus-per-task=1 # Adjust the number of CPUs per task based on your requirements

#SBATCH --partition=cpu

#SBATCH --account=bbub-delta-cpu

#SBATCH --job-name=wandb_test

#SBATCH --time=4:00:00

#SBATCH --constraint="scratch"

#SBATCH --mail-user=mjvolk3@illinois.edu

#SBATCH --mail-type="END"

#SBATCH --output=/projects/bbub/mjvolk3/torchcell/experiments/smf-dmf-tmf-001/slurm/output/%x_%j.out

#SBATCH --error=/projects/bbub/mjvolk3/torchcell/experiments/smf-dmf-tmf-001/slurm/output/%x_%j.out

module reset

source ~/.bashrc

cd /scratch/bbub/mjvolk3/torchcell

pwd

lscpu

cat /proc/meminfo

module list

conda activate /projects/bbub/miniconda3/envs/torchcell

python experiments/smf-dmf-tmf-001/wandb_test.py

This is the python script.

experiments/smf-dmf-tmf-001/wandb_test.py

import wandb

run = wandb.init(mode="online", project="wandb_test")

run.log({"test":123})

run.finish()

Here is the stdout.

Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: None
Tue Apr 30 19:44:35 CDT 2024 - Starting to source .bashrc
Tue Apr 30 19:44:35 CDT 2024 - Sourcing global definitions...
Tue Apr 30 19:44:36 CDT 2024 - Global definitions sourced.
Tue Apr 30 19:44:36 CDT 2024 - Setting up user-specific environment...
Tue Apr 30 19:44:36 CDT 2024 - User-specific environment set.
Tue Apr 30 19:44:36 CDT 2024 - Initializing Conda...
Tue Apr 30 19:44:37 CDT 2024 - Conda initialized.
Tue Apr 30 19:44:37 CDT 2024 - .bashrc sourced successfully.
/scratch/bbub/mjvolk3/torchcell
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        8
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7763 64-Core Processor
Stepping:            1
CPU MHz:             3242.946
CPU max MHz:         2450.0000
CPU min MHz:         1500.0000
BogoMIPS:            4890.67
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
NUMA node2 CPU(s):   32-47
NUMA node3 CPU(s):   48-63
NUMA node4 CPU(s):   64-79
NUMA node5 CPU(s):   80-95
NUMA node6 CPU(s):   96-111
NUMA node7 CPU(s):   112-127
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
MemTotal:       263822040 kB
MemFree:        231266244 kB
MemAvailable:   244901908 kB
Buffers:            2116 kB
Cached:         19011140 kB
SwapCached:            0 kB
Active:          7399492 kB
Inactive:       13953312 kB
Active(anon):    2751608 kB
Inactive(anon):  6301988 kB
Active(file):    4647884 kB
Inactive(file):  7651324 kB
Unevictable:      147240 kB
Mlocked:          144212 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               224 kB
Writeback:             0 kB
AnonPages:       2484512 kB
Mapped:           541732 kB
Shmem:           6714040 kB
KReclaimable:    3584224 kB
Slab:            9510360 kB
SReclaimable:    3584224 kB
SUnreclaim:      5926136 kB
KernelStack:       38912 kB
PageTables:        46832 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    131911020 kB
Committed_AS:   11263752 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      374996 kB
VmallocChunk:          0 kB
Percpu:           417280 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1740800 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      601012 kB
DirectMap2M:    73574400 kB
DirectMap1G:    193986560 kB

Currently Loaded Modules:
  1) gcc/11.4.0      3) cuda/11.8.0         5) slurm-env/0.1
  2) openmpi/4.1.6   4) cue-login-env/1.0   6) default-s11

 

wandb: Currently logged in as: mjvolk3 (zhao-group). Use `wandb login --relogin` to force relogin
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/__main__.py", line 3, in <module>
    cli.cli(prog_name="python -m wandb")
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/cli/cli.py", line 105, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/cli/cli.py", line 289, in service
wandb: - Waiting for wandb.init()...
    server.serve()
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/service/server.py", line 118, in serve
    mux.loop()
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/service/streams.py", line 423, in loop
    raise e
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/service/streams.py", line 421, in loop
    self._loop()
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/service/streams.py", line 414, in _loop
    self._process_action(action)
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/service/streams.py", line 376, in _process_action
    self._process_add(action)
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/service/streams.py", line 224, in _process_add
    stream.start_thread(thread)
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/service/streams.py", line 81, in start_thread
    self._wait_thread_active()
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/service/streams.py", line 86, in _wait_thread_active
    assert result
AssertionError
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
--- Logging error ---
Traceback (most recent call last):
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/logging/__init__.py", line 1114, in emit
    self.flush()
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/logging/__init__.py", line 1094, in flush
    self.stream.flush()
OSError: [Errno 5] Input/output error
Call stack:
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/threading.py", line 1002, in _bootstrap
    self._bootstrap_inner()
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/service/streams.py", line 49, in run
    self._target(**self._kwargs)
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/internal/internal.py", line 86, in wandb_internal
    logger.info(
Message: 'W&B internal server running at pid: %s, started at: %s'
Arguments: (3140307, datetime.datetime(2024, 4, 30, 19, 44, 46, 663076))
wandb: - Waiting for wandb.init()...
wandb: \ Waiting for wandb.init()...
wandb: | Waiting for wandb.init()...
wandb: / Waiting for wandb.init()...
wandb: - Waiting for wandb.init()...
Problem at: /projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/wandb_init.py 849 getcaller
wandb: ERROR Run initialization has timed out after 90.0 sec. 
wandb: ERROR Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
Traceback (most recent call last):
  File "/scratch/bbub/mjvolk3/torchcell/experiments/smf-dmf-tmf-001/wandb_test.py", line 2, in <module>
    run = wandb.init(mode="online", project="wandb_test")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/wandb_init.py", line 1200, in init
    raise e
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/wandb_init.py", line 1181, in init
    run = wi.init()
          ^^^^^^^^^
  File "/projects/bbub/miniconda3/envs/torchcell/lib/python3.11/site-packages/wandb/sdk/wandb_init.py", line 780, in init
    raise error
wandb.errors.CommError: Run initialization has timed out after 90.0 sec. 
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-

Sweet,

Thank you for testing it out.

Could you please provide the debug.log and debug-internal.log files associated with the run where you are running into this issue? These files should be located in the wandb folder relative to your working directory.

These seem to have no wandb dirs. I was using version 0.16.6. Upon reverting to 0.16.0 I was able to get a solution at least for this script. github discussion

Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: None
Wed May  1 23:02:30 CDT 2024 - Starting to source .bashrc
Wed May  1 23:02:30 CDT 2024 - Sourcing global definitions...
Wed May  1 23:02:31 CDT 2024 - Global definitions sourced.
Wed May  1 23:02:31 CDT 2024 - Setting up user-specific environment...
Wed May  1 23:02:31 CDT 2024 - User-specific environment set.
Wed May  1 23:02:31 CDT 2024 - Initializing Conda...
Wed May  1 23:02:58 CDT 2024 - Conda initialized.
Wed May  1 23:02:58 CDT 2024 - .bashrc sourced successfully.
/scratch/bbub/mjvolk3/torchcell
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        8
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7763 64-Core Processor
Stepping:            1
CPU MHz:             3172.048
CPU max MHz:         2450.0000
CPU min MHz:         1500.0000
BogoMIPS:            4890.75
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
NUMA node2 CPU(s):   32-47
NUMA node3 CPU(s):   48-63
NUMA node4 CPU(s):   64-79
NUMA node5 CPU(s):   80-95
NUMA node6 CPU(s):   96-111
NUMA node7 CPU(s):   112-127
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
MemTotal:       263822044 kB
MemFree:        247510540 kB
MemAvailable:   247115268 kB
Buffers:            2112 kB
Cached:          7673448 kB
SwapCached:            0 kB
Active:          2675624 kB
Inactive:       10324256 kB
Active(anon):    2617728 kB
Inactive(anon):  9791620 kB
Active(file):      57896 kB
Inactive(file):   532636 kB
Unevictable:       80940 kB
Mlocked:           77868 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               476 kB
Writeback:             0 kB
AnonPages:       5402112 kB
Mapped:           899536 kB
Shmem:           7085012 kB
KReclaimable:     396628 kB
Slab:            2004304 kB
SReclaimable:     396628 kB
SUnreclaim:      1607676 kB
KernelStack:       37760 kB
PageTables:       159276 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    131911020 kB
Committed_AS:   18353004 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      296476 kB
VmallocChunk:          0 kB
Percpu:           400896 kB
HardwareCorrupted:     0 kB
AnonHugePages:   2017280 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      361396 kB
DirectMap2M:    24530944 kB
DirectMap1G:    243269632 kB

Currently Loaded Modules:
  1) gcc/11.4.0      3) cuda/11.8.0         5) slurm-env/0.1
  2) openmpi/4.1.6   4) cue-login-env/1.0   6) default-s11

 

wandb: Currently logged in as: mjvolk3 (zhao-group). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.16.6 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.16.0
wandb: Run data is saved locally in /scratch/bbub/mjvolk3/torchcell/wandb/run-20240501_230615-tzhyx1qz
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run magic-brook-1
wandb: ⭐️ View project at https://wandb.ai/zhao-group/wandb_test
wandb: 🚀 View run at https://wandb.ai/zhao-group/wandb_test/runs/tzhyx1qz
wandb: - 0.000 MB of 0.000 MB uploaded
wandb: \ 0.000 MB of 0.000 MB uploaded
wandb: | 0.023 MB of 0.023 MB uploaded
wandb:                                                                                
wandb: 
wandb: Run history:
wandb: test ▁
wandb: 
wandb: Run summary:
wandb: test 123
wandb: 
wandb: 🚀 View run magic-brook-1 at: https://wandb.ai/zhao-group/wandb_test/runs/tzhyx1qz
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240501_230615-tzhyx1qz/logs

Now that I think on it, these issue did start mostly arising I did an update.

I still have issues on other scripts where wandb never initializes even after 300s. os.environ["WANDB__SERVICE_WAIT"] = "300".

I am actually still getting the same issue but not with the wandb test script shown. I cannot predict what causes the failure.

On one of my scripts the main function starts like this.

load_dotenv()
DATA_ROOT = os.getenv("DATA_ROOT")

@hydra.main(version_base=None, config_path="conf", config_name="random-forest")
def main(cfg: DictConfig) -> None:
    os.environ["WANDB__SERVICE_WAIT"] = "300"
    wandb_cfg = OmegaConf.to_container(cfg, resolve=True, throw_on_missing=True)
    slurm_job_id = os.environ.get("SLURM_JOB_ID", uuid.uuid4())
    sorted_cfg = json.dumps(wandb_cfg, sort_keys=True)
    hashed_cfg = hashlib.sha256(sorted_cfg.encode("utf-8")).hexdigest()
    group = f"{slurm_job_id}_{hashed_cfg}"
    wandb.init(
        mode="online",
        project=wandb_cfg["wandb"]["project"],
        config=wandb_cfg,
        group=group,
        tags=wandb_cfg["wandb"]["tags"],
    )

It fails before it gets to the init. Here is the standard output.

wandb: ERROR Find detailed error logs at: /projects/bbub/mjvolk3/torchcell/wandb/debug-cli.mjvolk3.log
Error: Timed out waiting for wandb service to start after 30.0 seconds. Try increasing the timeout with the `_service_wait` setting.

Here is the debug-internal.log

2024-04-30 19:40:13,544 INFO    StreamThr :910101 [internal.py:wandb_internal():86] W&B internal server running at pid: 910101, started at: 2024-04-30 19:40:13.543642
2024-04-30 19:40:13,546 DEBUG   HandlerThread:910101 [handler.py:handle_request():146] handle_request: status
2024-04-30 19:40:13,562 INFO    WriterThread:910101 [datastore.py:open_for_write():87] open: /projects/bbub/mjvolk3/torchcell/wandb/offline-run-20240430_194013-nyj2wnqc/run-nyj2wnqc.wandb
2024-04-30 19:40:13,633 DEBUG   HandlerThread:910101 [handler.py:handle_request():146] handle_request: run_start
2024-04-30 19:40:13,634 DEBUG   HandlerThread:910101 [system_info.py:__init__():26] System info init
2024-04-30 19:40:13,634 DEBUG   HandlerThread:910101 [system_info.py:__init__():41] System info init done
2024-04-30 19:40:13,634 INFO    HandlerThread:910101 [system_monitor.py:start():194] Starting system monitor
2024-04-30 19:40:13,635 INFO    SystemMonitor:910101 [system_monitor.py:_start():158] Starting system asset monitoring threads
2024-04-30 19:40:13,635 INFO    HandlerThread:910101 [system_monitor.py:probe():214] Collecting system info
2024-04-30 19:40:13,635 INFO    SystemMonitor:910101 [interfaces.py:start():190] Started cpu monitoring
2024-04-30 19:40:13,636 INFO    SystemMonitor:910101 [interfaces.py:start():190] Started disk monitoring
2024-04-30 19:40:13,637 INFO    SystemMonitor:910101 [interfaces.py:start():190] Started memory monitoring
2024-04-30 19:40:13,638 INFO    SystemMonitor:910101 [interfaces.py:start():190] Started network monitoring
2024-04-30 19:40:13,682 DEBUG   HandlerThread:910101 [system_info.py:probe():150] Probing system
2024-04-30 19:40:13,685 DEBUG   HandlerThread:910101 [system_info.py:_probe_git():135] Probing git
2024-04-30 19:40:13,712 DEBUG   HandlerThread:910101 [system_info.py:_probe_git():143] Probing git done
2024-04-30 19:40:13,712 DEBUG   HandlerThread:910101 [system_info.py:probe():198] Probing system done
2024-04-30 19:40:13,712 DEBUG   HandlerThread:910101 [system_monitor.py:probe():223] {'os': 'Linux-4.18.0-477.51.1.el8_8.x86_64-x86_64-with-glibc2.28', 'python': '3.11.7', 'heartbeatAt': '2024-05-01T00:40:13.682839', 'startedAt': '2024-05-01T00:40:13.516011', 'docker': None, 'cuda': None, 'args': (), 'state': 'running', 'program': '/projects/bbub/mjvolk3/torchcell/experiments/smf-dmf-tmf-001/svr.py', 'codePathLocal': 'experiments/smf-dmf-tmf-001/svr.py', 'codePath': 'experiments/smf-dmf-tmf-001/svr.py', 'git': {'remote': 'https://github.com/Mjvolk3/torchcell', 'commit': '536b456073f74a3517c452ed2fa40f740aa8d1a0'}, 'email': 'mjvolk3@illinois.edu', 'root': '/projects/bbub/mjvolk3/torchcell', 'host': 'cn004.delta.ncsa.illinois.edu', 'username': 'mjvolk3', 'executable': '/projects/bbub/miniconda3/envs/torchcell/bin/python', 'cpu_count': 128, 'cpu_count_logical': 128, 'cpu_freq': {'current': 2454.989375, 'min': 1500.0, 'max': 2450.0}, 'cpu_freq_per_core': [{'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2395.346, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 3243.016, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2394.373, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}, {'current': 2450.0, 'min': 1500.0, 'max': 2450.0}], 'disk': {'/': {'total': 125.80015182495117, 'used': 6.478443145751953}}, 'memory': {'total': 251.6003074645996}}
2024-04-30 19:40:13,712 INFO    HandlerThread:910101 [system_monitor.py:probe():224] Finished collecting system info
2024-04-30 19:40:13,712 INFO    HandlerThread:910101 [system_monitor.py:probe():227] Publishing system info
2024-04-30 19:40:13,712 DEBUG   HandlerThread:910101 [system_info.py:_save_conda():207] Saving list of conda packages installed into the current environment

And here is the debug.log.

2024-04-30 19:40:13,532 INFO    MainThread:909868 [wandb_setup.py:_flush():76] Current SDK version is 0.16.6
2024-04-30 19:40:13,532 INFO    MainThread:909868 [wandb_setup.py:_flush():76] Configure stats pid to 909868
2024-04-30 19:40:13,532 INFO    MainThread:909868 [wandb_setup.py:_flush():76] Loading settings from /u/mjvolk3/.config/wandb/settings
2024-04-30 19:40:13,532 INFO    MainThread:909868 [wandb_setup.py:_flush():76] Loading settings from /projects/bbub/mjvolk3/torchcell/wandb/settings
2024-04-30 19:40:13,532 INFO    MainThread:909868 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'entity': 'zhao-group', 'project': 'torchcell_smf-dmf-tmf-001_trad-ml_svr_1e03', 'sweep_id': 'enfpup0l', 'root_dir': '/projects/bbub/mjvolk3/torchcell', 'run_id': 'nyj2wnqc', 'sweep_param_path': '/projects/bbub/mjvolk3/torchcell/wandb/sweep-enfpup0l/config-nyj2wnqc.yaml'}
2024-04-30 19:40:13,532 INFO    MainThread:909868 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-04-30 19:40:13,532 INFO    MainThread:909868 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'experiments/smf-dmf-tmf-001/svr.py', 'program_abspath': '/projects/bbub/mjvolk3/torchcell/experiments/smf-dmf-tmf-001/svr.py', 'program': '/projects/bbub/mjvolk3/torchcell/experiments/smf-dmf-tmf-001/svr.py'}
2024-04-30 19:40:13,533 INFO    MainThread:909868 [wandb_init.py:_log_setup():521] Logging user logs to /projects/bbub/mjvolk3/torchcell/wandb/offline-run-20240430_194013-nyj2wnqc/logs/debug.log
2024-04-30 19:40:13,533 INFO    MainThread:909868 [wandb_init.py:_log_setup():522] Logging internal logs to /projects/bbub/mjvolk3/torchcell/wandb/offline-run-20240430_194013-nyj2wnqc/logs/debug-internal.log
2024-04-30 19:40:13,533 INFO    MainThread:909868 [wandb_init.py:init():561] calling init triggers
2024-04-30 19:40:13,533 INFO    MainThread:909868 [wandb_init.py:init():568] wandb.init called with sweep_config: {'cell_dataset': {'aggregation': 'sum', 'graphs': None, 'is_pert': True, 'max_size': 1000, 'node_embeddings': ['nt_window_5979']}, 'data_module': {'batch_size': 16, 'num_workers': 6, 'pin_memory': True}, 'svr': {'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}}
config: {'hydra_logging': {'loggers': {'logging_example': {'level': 'INFO'}}}, 'program': 'experiments/smf-dmf-tmf-001/svr.py', 'wandb': {'mode': 'online', 'project': 'torchcell_smf-dmf-tmf-001_trad-ml_svr', 'tags': []}, 'cell_dataset': {'graphs': None, 'node_embeddings': ['codon_frequency'], 'max_size': 1000.0, 'is_pert': True, 'aggregation': 'sum'}, 'data_module': {'batch_size': 16, 'num_workers': 6, 'pin_memory': True}, 'svr': {'kernel': 'linear', 'C': 1.0, 'gamma': 0.1}, 'command': ['python', 'experiments/smf-dmf-tmf-001/svr.py']}
2024-04-30 19:40:13,533 INFO    MainThread:909868 [wandb_init.py:init():611] starting backend
2024-04-30 19:40:13,533 INFO    MainThread:909868 [wandb_init.py:init():615] setting up manager
2024-04-30 19:40:13,542 INFO    MainThread:909868 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-04-30 19:40:13,545 INFO    MainThread:909868 [wandb_init.py:init():623] backend started and connected
2024-04-30 19:40:13,560 INFO    MainThread:909868 [wandb_run.py:_config_callback():1347] config_cb None None {'cell_dataset': {'graphs': None, 'node_embeddings': ['nt_window_5979'], 'max_size': 1000, 'is_pert': True, 'aggregation': 'sum'}, 'data_module': {'batch_size': 16, 'num_workers': 6, 'pin_memory': True}, 'svr': {'kernel': 'rbf', 'C': 0.1, 'gamma': 0.01}}
2024-04-30 19:40:13,561 INFO    MainThread:909868 [wandb_init.py:init():715] updated telemetry
2024-04-30 19:40:13,580 INFO    MainThread:909868 [wandb_init.py:init():748] communicating run to backend with 90.0 second timeout
2024-04-30 19:40:13,585 INFO    MainThread:909868 [wandb_init.py:init():799] starting run threads in backend

This on the wandb 0.16.0. You can see that it says timed out after 30.0 s. This means that somehow the environment variable is not recornized.

Hi @mjvolk3 , apologies for the delay in reply, this thread got completely burried.

When you are mentioning this:

I am actually still getting the same issue but not with the wandb test script shown. I cannot predict what causes the failure.

Do you mean this is still happening to you on 0.16.0?

Hi Michael,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Yes I mean that it happens with 0.16.0… I think that the environment variable os.environ["WANDB__SERVICE_WAIT"] = "300" is not working properly with Wandb.

Thank you so much for elaborating. I will try running this on my side and see if I am seeing a similar behavior regarding os.environ["WANDB__SERVICE_WAIT"] = "300" not working on my side as well.