Wandb fails at init (assert ports_found)

Hello,

I am running into an inconsistent issue where some of my training runs (the exact same code run twice) fail. I get the following error:

Traceback (most recent call last):
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/wandb_init.py", line 1040, in init
    wi.setup(kwargs)
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/wandb_init.py", line 151, in setup
    self._wl = wandb_setup.setup()
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 320, in setup
    ret = _setup(settings=settings)
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 315, in _setup
    wl = _WandbSetup(settings=settings)
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 301, in __init__
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 114, in __init__
    self._setup()
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 242, in _setup
    self._setup_manager()
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 273, in _setup_manager
    self._manager = wandb_manager._Manager(
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/wandb_manager.py", line 106, in __init__
    self._service.start()
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/service/service.py", line 106, in start
    self._launch_server()
  File "/project_dir/lib/python3.9/site-packages/wandb/sdk/service/service.py", line 102, in _launch_server
    assert ports_found
AssertionError

Unfortunately, this error occurs before a folder is created in /project_dir/wandb/, and as a result I cannot find a more descriptive error message in debug-internal.log. As mentioned, this issue only periodically happens. My code is being run on a compute cluster.

Hi @kfallah happy to look into this for you. Could you please provide a bit more context on the training environment

  • Brief description of your experiment setup and how you are using wandb to track your experiments
  • Example code block of this setup that might help us reproduce
  • Wandb Client Version you are using

Hello, thank you! I am using wandb 0.13.5 with python 3.9. Note that this is run on my school’s compute cluster with other potential wandb users. Also, note that this issue inconsistently occurs (same code run twice sometimes fails and sometimes does not). My original workflow was an ML script with a PyTorch training loop, where after reading a .json config file, I initialize wandb with the following:

    wandb.init(
        project=..., entity=... config=config, mode="offline" if args.disable_wandb else "online"
    )

I also set the WANDB_API_KEY environmental variable with my API key. My program would throw an error on this line.

To try and fix this, I deleted the wandb folder in my project directory and ran wandb login --relogin in the command line, and then got the following error:

Traceback (most recent call last):
  File "/home_dir_path/.conda/envs/manifold-contrastive/bin/wandb", line 8, in <module>
    sys.exit(cli())
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/storage/home/hcoda1/0/kfallah3/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/cli/cli.py", line 97, in wrapper
    return func(*args, **kwargs)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/cli/cli.py", line 236, in login
    wandb.setup(settings=login_settings)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 312, in setup
    ret = _setup(settings=settings)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 307, in _setup
    wl = _WandbSetup(settings=settings)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 293, in __init__
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 106, in __init__
    self._setup()
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 234, in _setup
    self._setup_manager()
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/wandb_setup.py", line 265, in _setup_manager
    self._manager = wandb_manager._Manager(
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/wandb_manager.py", line 108, in __init__
    self._service.start()
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/service/service.py", line 112, in start
    self._launch_server()
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/service/service.py", line 108, in _launch_server
    assert ports_found
AssertionError
 [kfallah@login-phoenix-slurm-3]% Traceback (most recent call last):
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/__main__.py", line 3, in <module>
    cli.cli(prog_name="python -m wandb")
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/cli/cli.py", line 97, in wrapper
    return func(*args, **kwargs)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/cli/cli.py", line 282, in service
    server.serve()
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/service/server.py", line 130, in serve
    self._inform_used_ports(grpc_port=grpc_port, sock_port=sock_port)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/service/server.py", line 65, in _inform_used_ports
    pf.write(self._port_fname)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/site-packages/wandb/sdk/service/port_file.py", line 25, in write
    f = tempfile.NamedTemporaryFile(prefix=bname, dir=dname, mode="w", delete=False)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/tempfile.py", line 545, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/home_dir_path/.conda/envs/manifold-contrastive/lib/python3.9/tempfile.py", line 255, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpcmvswfcr/port-105577.txt0fmtukes'

Potentially a duplicate of [CLI]: Can't find port file when using wandb.require("service") · Issue #3911 · wandb/wandb · GitHub
I have pointed this issue out to the compute cluster administrators. Any potential workarounds I could use for now would be very helpful.

Hi @kfallah,

Thank you for the follow up. I reviewed the Github issue you referenced and as you have very similar setup as others on that thread:

  • Running experiments through a cluster node (possible slow node)
  • You are hitting an error with wandb service trying to establish port connections

Try the following:

  • Increase the port wait timeout of wandb service. Change the 30 to 300 for example, time_max = time.time() + 300
  • Or disabling wandb service by setting the env variable WANDB_ DISABLE_SERVICE =True. Wandb service was developed to improve the reliability of distributed jobs, and is on by default in client versions 0.13.0+. If you can execute without it, disable it if it become apparent there are network related interferences.

What happens when you try the above?

1 Like

Hi @kfallah , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Hi, I have the same issue as the OP. I increased the timeout and it did help (at least in one case right now). Context - I run some experiments on a shared server which is sometimes under heavy load (lots processes on CPU from other users). In that case the wandb.init typically fails (I can post traceback if interested). When running the same code on not so busy server it works fine.

hi @mohammadbakir is there a better way to increase the timeout value without changing the source code? like an environment variable or something else?

2 Likes

A feature where we can increase the timeout value without changing the source code would be useful.

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.