I am trying to run wandb in a cluster environment (ComputeCanada), but I get a connection error.
When I run the code provided in wandb Quickstart webpage on my laptop, I can see the loss and accuracy charts in my project section on the Wandb browser, and everything works fine. However, when I run the same code on a cluster (ComputeCanada), I get the following error:
It seems like you’re encountering a network issue while trying to use wandb in a cluster environment. Here are a few suggestions that might help:
Upgrade your SSL certificate: If you’re running the script on an Ubuntu server, run update-ca-certificates. Wandb can’t sync training logs without a valid SSL certificate because it’s a security vulnerability.
Offline mode: If your network is flaky, you can run training in offline mode and sync the files to wandb from a machine that has Internet access. You can set the environment variable WANDB_MODE=offline to disable wandb syncing temporarily.
Private Hosting: If the network issues persist, you might want to consider using W&B Private Hosting, which operates on your machine and doesn’t sync files to the cloud servers.
SSL CERTIFICATE_VERIFY_FAILED: This error could be due to your company’s firewall. You can set up local CAs and then use: export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
wandb Settings: If you’re using wandb version 0.13.0 or later, you can try changing the start method to “fork” using wandb.init(settings=wandb.Settings(start_method="fork")). For versions prior to 0.13.0, you can try using wandb.init(settings=wandb.Settings(start_method="thread")).
Please try these suggestions and see if they resolve your issue. If the problem persists, I recommend reaching out to Weights & Biases support or community forums for further assistance.
Hi @pparv056 , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!