How does one do hyper parameter sweeps when using HPCs/clusters?

I saw the great video:

but I still wasn’t 100% how to use it in a HPC cluster. I understand there is a central sweep master at wandb’s servers sending commands, but how does it connect to the HPC/clsuter?

There are some cases I am worried baout

  1. the HPC needs my password
  2. the HPC needs an ssh key
  3. a VPN to connect to the hpc
  4. the HPC needs duo authentication
  5. the HPC uses a workload manager e.g. slurm or condor

it would be very nice to have a concrete example with some of these. Perhaps slurm + password is the most common (although I admit I’ve been using condor with a VPN wall + password is my real use case right now).

Hey,

I run W&B sweeps on HPC clusters with slurm as well.
Just prepare all your code and data on the cluster and create the sweep as explained above.

Then from the sweep page in your project, copy the wandb agent command.
Login to your hpc, and put the wandb agent command in the queue.
It is especially usefull o nHPC clusters as you have multiple nodes available. the wandb API automatically puts the right set of new hyperparameters on seperate nodes!

Thank you for the help Joris, let us know if you still need further assistance Brando.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.