Sweep on remote cluster GPUs

andreakiro · July 9, 2022, 5:40pm

Hey. I’m trying to run a sweep on a cluster GPUs by submitting it as a new job.
The problem is that the job runs, but keep logging a network error:

wandb: Network error (ConnectionError), entering retry loop.

The script works fine if I’m trying to run it “locally” in the cluster i.e. without submitting a GPU job.
My intuition is that W&B doesn’t find my creds (.netrc file) on the node its running. So I was wondering if there is a way to directly pass my API key to the wandb.agent function, so that the script is independent of its execution environment?

Thanks

nathank · July 13, 2022, 1:45pm

Hi @andreakiro,
You can also use the environment variables WANDB_API_KEY and WANDB_BASE_URL (if using local server and not connecting to wandb.ai) and wandb will look there instead of a .netrc file. Alternatively in Python you can use wandb.login(key=<your_api_key>) but we recommend using caution with this as you can potentially expose your API key since it is getting hard coded into your script.

I’m a little suspicious that this is the cause of your issues though as this usually shows up as an “Invalid or missing API key” error. If this doesn’t resolve the issue, could you try to run ping api.wandb.ai (or whichever url endpoint you are trying to create runs to) on the cluster GPU to confirm that the GPU has no issue sending network packets to our backend?

Thank you,
Nate

gsaltintas · July 18, 2022, 10:42pm

Hi @andreakiro , sometimes cluster admins disable network access on compute nodes for security reasons. You may need to load a proxy module (this may depend on your cluster) before running your sweep so that it gets logged during training.
Good luck!

nathank · July 20, 2022, 4:06am

Hi @andreakiro,
I just wanted to follow up and see if you were still looking for help with this and if you had a chance to try running ping against api.wandb.ai?
Thank you,
Nate

lesliewandb · July 25, 2022, 5:14pm

Hi Andrea, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

system · September 18, 2022, 4:06am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help with running a sweep agent on a multi-gpu machine with pytorch DistributedDataParallel W&B Help sweeps	4	722	January 8, 2025
WandB sweeps and ddp W&B Help sweeps , wandb	3	1183	November 5, 2023
100% offline sweep W&B Help sweeps , wandb	15	3106	July 6, 2023
(Windows 11) `wandb.sweep()` gives ConnectionResetError: [WinError 10054] W&B Help sweeps	6	1503	January 17, 2023
Encountering network error when running sweep W&B Help	6	576	June 27, 2023

Sweep on remote cluster GPUs

Related topics