wandb.errors.Error: An unexpected error occurred

When I launch wandb on some (but not all) of the nodes in my cluster I get the following error. Moving to another machine resolves the issue, but I can’t launch wandb on that node even though they are using the exact same conda environment and wandb installation.

  File ".../lib/python3.11/site-packages/pytorch_lightning/loggers/wandb.py", line 360, in __init__
    _ = self.experiment
        ^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/pytorch_lightning/loggers/logger.py", line 53, in experiment
    return get_experiment() or DummyExperiment()
           ^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/pytorch_lightning/loggers/logger.py", line 51, in get_experiment
    return fn(self)
           ^^^^^^^^
  File "/lib/python3.11/site-packages/pytorch_lightning/loggers/wandb.py", line 406, in experiment
    self._experiment = wandb.init(**self._wandb_init)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/wandb/sdk/wandb_init.py", line 1219, in init
    raise Error("An unexpected error occurred") from error_seen
wandb.errors.Error: An unexpected error occurred

When I reproduce the issue outside of pytorch lightning, I get the following error. However, I’m unable to get permissions for that folder.

PermissionError: /tmp/wandb/run-20240525_232234-8un2rikz/logs is not writable

Hello, This a reply from our support bot designed to assist you with your Weights & Biases related queries. To reach a human please reply to this message.

The error you’re encountering, wandb.errors.Error: An unexpected error occurred, during the initialization of wandb with PyTorch Lightning on some nodes of your cluster, suggests there might be an issue with the environment or configuration on those specific nodes. Here are some steps and best practices to troubleshoot and potentially resolve this issue:

1. Check wandb Version Compatibility

Ensure that the version of wandb you’re using is compatible with your version of PyTorch Lightning. It’s a good practice to keep both libraries up to date. If necessary, update wandb and PyTorch Lightning to their latest versions.

2. Verify Network Configuration

Since the issue occurs on specific nodes, there might be a network configuration issue that prevents wandb from communicating with its servers. Verify that the nodes in question have proper internet access and that any firewall or proxy settings allow wandb to make outbound connections.

3. Environment Variables

Ensure that the WANDB_API_KEY is set correctly on all nodes. The API key is required for wandb to authenticate with its servers. You can set it in your environment variables or pass it directly in your script.

4. Debugging Mode

Enable wandb’s debug mode to get more detailed logs that might pinpoint the issue. You can do this by setting the environment variable WANDB_DEBUG to true. This might provide more insights into what’s going wrong during initialization.

import os
os.environ["WANDB_DEBUG"] = "true"

5. Offline Mode

If network issues are suspected, try running wandb in offline mode to see if the initialization succeeds. This can help determine if the problem is network-related. You can enable offline mode by setting the environment variable WANDB_MODE to offline.

import os
os.environ["WANDB_MODE"] = "offline"

6. Clean Installation

On the nodes where you’re experiencing issues, try creating a fresh virtual environment and reinstall wandb and PyTorch Lightning. This can help rule out any issues related to corrupted installations or conflicts with other packages.

7. Check for Errors in PyTorch Lightning Integration

Ensure that the integration code between wandb and PyTorch Lightning is correctly set up. Misconfigurations or outdated integration code can lead to initialization errors.

8. Consult Logs and Documentation

Review the detailed logs generated by enabling WANDB_DEBUG and consult the wandb and PyTorch Lightning documentation for any known issues or additional troubleshooting steps.

If after following these steps the issue persists, consider reaching out to the wandb support team with detailed logs and information about your setup for further assistance.

To reach a human please reply to this message.

-WandBot :robot:

Hi @seyuboglu Good day and thank you for reaching out to us. Happy to help you here.

The error appears to be a permission issue and since you mentioned that it works just fine once you move to another machine, then this somehow suggests that it is. Are you using a local instance of wandb?

PermissionError: /tmp/wandb/run-20240525_232234-8un2rikz/logs is not writable

Are you setting this path yourself? If so, can you let us know how you are doing it?

On the machine where you are getting errors, please ensure that the version of wandb you’re using is the same with the version from the other machine that has no errors. Let us see if this is only caused by a version mismatch.

I’m looking forward for your next update.

Best Regards,
Paulo

Hi @seyuboglu ,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Weights & Biases

Hi @seyuboglu, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!