Connection error when running on a cluster (Compute Canada)


I am trying to run wandb in a cluster environment (ComputeCanada), but I get a connection error.

When I run the code provided in wandb Quickstart webpage on my laptop, I can see the loss and accuracy charts in my project section on the Wandb browser, and everything works fine. However, when I run the same code on a cluster (ComputeCanada), I get the following error:

I have already verified that I have successfully logged in to my wandb account in the cluster:

import wandb
wandb: Currently logged in as: pparv056. Use wandb login --relogin to force relogin

Please let me know how I can solve this problem.
Thanks for your time in advance.

Hi Payam,

Thank you for reaching out. We’ll check this on our end and we’ll get back to you for updates.

Carlo Argel

Hi @pparv056

It seems like you’re encountering a network issue while trying to use wandb in a cluster environment. Here are a few suggestions that might help:

  1. Upgrade your SSL certificate: If you’re running the script on an Ubuntu server, run update-ca-certificates. Wandb can’t sync training logs without a valid SSL certificate because it’s a security vulnerability.
  2. Offline mode: If your network is flaky, you can run training in offline mode and sync the files to wandb from a machine that has Internet access. You can set the environment variable WANDB_MODE=offline to disable wandb syncing temporarily.
  3. Private Hosting: If the network issues persist, you might want to consider using W&B Private Hosting, which operates on your machine and doesn’t sync files to the cloud servers.
  4. SSL CERTIFICATE_VERIFY_FAILED: This error could be due to your company’s firewall. You can set up local CAs and then use: export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
  5. wandb Settings: If you’re using wandb version 0.13.0 or later, you can try changing the start method to “fork” using wandb.init(settings=wandb.Settings(start_method="fork")). For versions prior to 0.13.0, you can try using wandb.init(settings=wandb.Settings(start_method="thread")).

Please try these suggestions and see if they resolve your issue. If the problem persists, I recommend reaching out to Weights & Biases support or community forums for further assistance.

Hi @pparv056 ,

I just want to follow up if you still need more assistance with this.

Carlo Argel

Hi @pparv056 , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.