When I launch wandb on some (but not all) of the nodes in my cluster I get the following error. Moving to another machine resolves the issue, but I can’t launch wandb on that node even though they are using the exact same conda environment and wandb installation.
File ".../lib/python3.11/site-packages/pytorch_lightning/loggers/wandb.py", line 360, in __init__
_ = self.experiment
^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/pytorch_lightning/loggers/logger.py", line 53, in experiment
return get_experiment() or DummyExperiment()
^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/pytorch_lightning/loggers/logger.py", line 51, in get_experiment
return fn(self)
^^^^^^^^
File "/lib/python3.11/site-packages/pytorch_lightning/loggers/wandb.py", line 406, in experiment
self._experiment = wandb.init(**self._wandb_init)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/wandb/sdk/wandb_init.py", line 1219, in init
raise Error("An unexpected error occurred") from error_seen
wandb.errors.Error: An unexpected error occurred
When I reproduce the issue outside of pytorch lightning, I get the following error. However, I’m unable to get permissions for that folder.
PermissionError: /tmp/wandb/run-20240525_232234-8un2rikz/logs is not writable