Sweep on remote cluster GPUs

Hey. I’m trying to run a sweep on a cluster GPUs by submitting it as a new job.
The problem is that the job runs, but keep logging a network error:

wandb: Network error (ConnectionError), entering retry loop.

The script works fine if I’m trying to run it “locally” in the cluster i.e. without submitting a GPU job.
My intuition is that W&B doesn’t find my creds (.netrc file) on the node its running. So I was wondering if there is a way to directly pass my API key to the wandb.agent function, so that the script is independent of its execution environment?

Thanks

Hi @andreakiro,
You can also use the environment variables WANDB_API_KEY and WANDB_BASE_URL (if using local server and not connecting to wandb.ai) and wandb will look there instead of a .netrc file. Alternatively in Python you can use wandb.login(key=<your_api_key>) but we recommend using caution with this as you can potentially expose your API key since it is getting hard coded into your script.

I’m a little suspicious that this is the cause of your issues though as this usually shows up as an “Invalid or missing API key” error. If this doesn’t resolve the issue, could you try to run ping api.wandb.ai (or whichever url endpoint you are trying to create runs to) on the cluster GPU to confirm that the GPU has no issue sending network packets to our backend?

Thank you,
Nate

Hi @andreakiro , sometimes cluster admins disable network access on compute nodes for security reasons. You may need to load a proxy module (this may depend on your cluster) before running your sweep so that it gets logged during training.
Good luck!

1 Like

Hi @andreakiro,
I just wanted to follow up and see if you were still looking for help with this and if you had a chance to try running ping against api.wandb.ai?
Thank you,
Nate

Hi Andrea, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!