wandb.sdk.wandb_manager.ManagerConnectionRefusedError when running subprocess

Hi everyone.

I’m having a problem when I try to run a subprocess (with Popen) in my python script that executes a bash command (slurm sbatch).
The sbatch command starts a job on a different node and looks like this:
p = Popen([shutil.which("sbatch"), '--mem=40G', '--gres=gpu:titan_xp:1', '--nodelist=tikgpu02', '--cpus-per-task=2', '--output=/home/pschlaepfer/denselp/slt/log/%j.out', '--error=/home/pschlaepfer/denselp/slt/log/%j.err', '/home/pschlaepfer/denselp/slt/scripts/slt.sh', '--action=fine-tune-thf', '--max-length', '128', '--lr=4e-5', '--epochs=5', '--batch-size=16', '--task', task, '--pre-trained-path', checkpoint_path, '--wandb-mode=offline'], start_new_session=True)

While initializing wandb, it throws the following error:

wandb.sdk.wandb_manager.ManagerConnectionRefusedError: Connection to wandb service failed since the process is not available.

(I attached the whole stack trace at the end.)

Wandb is initialized as follows:

experiment_name = f"job-id:{meta_config.job_id}"
run = wandb.init(
  project=wandb_project_choice+("-proto" if meta_config.is_debug_instance else ""),
  name=experiment_name,
  tags=[
    "job_id:"+str(meta_config.job_id)
  ],
  settings=wandb.Settings(start_method='fork'),
  dir=wandb_logging_dir_path,
  config=dict(experiment_config._asdict()) if type(experiment_config).__name__ == 'ExperimentConfig' else dict(experiment_config._as_dict()),
  reinit=True,
  mode="offline",
)

I’m using version 0.16.0.

The script works when I run the sbatch command manually from the terminal.

Thank you very much for your help!

Stack trace:

Traceback (most recent call last):
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 115, in _service_connect
svc_iface._svc_connect(port=port)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/service/service_sock.py”, line 30, in _svc_connect
self._sock_client.connect(port=port)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py”, line 102, in connect
s.connect((“localhost”, port))
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/home/pschlaepfer/denselp/slt/main.py”, line 107, in
run = wandb.init(
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 1185, in init
raise e
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 1162, in init
wi.setup(kwargs)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py”, line 189, in setup
self._wl = wandb_setup.setup(settings=setup_settings)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 327, in setup
ret = _setup(settings=settings)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 320, in _setup
wl = _WandbSetup(settings=settings)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 303, in init
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 114, in init
self._setup()
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 250, in _setup
self._setup_manager()
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py”, line 277, in _setup_manager
self._manager = wandb_manager._Manager(settings=self._settings)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 152, in init
wandb._sentry.reraise(e)
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/analytics/sentry.py”, line 154, in reraise
raise exc.with_traceback(sys.exc_info()[2])
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 150, in init
self._service_connect()
File “/itet-stor/pschlaepfer/net_scratch/conda_envs/denselp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py”, line 124, in _service_connect
raise ManagerConnectionRefusedError(message)
wandb.sdk.wandb_manager.ManagerConnectionRefusedError: Connection to wandb service failed since the process is not available.

Hi @phil-schlaepfer, wandb does spin up an external process when initializing that appears to not be able to communicate with the main process. I’m not aware of anything particular to Popen that would block this communication though. One thing I did want to see is if you had tried this without setting start_method="fork"?

Thank you,
Nate

Hi @phil-schlaepfer, I wanted to follow up and see if you had a chance to try this without using “fork”?

Thank you,
Nate

Same issue and situation here (starting a sub-process to run eval in another slurm node). Removing start_method="fork" didn’t help

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.