but I still wasn’t 100% how to use it in a HPC cluster. I understand there is a central sweep master at wandb’s servers sending commands, but how does it connect to the HPC/clsuter?
There are some cases I am worried baout
the HPC needs my password
the HPC needs an ssh key
a VPN to connect to the hpc
the HPC needs duo authentication
the HPC uses a workload manager e.g. slurm or condor
it would be very nice to have a concrete example with some of these. Perhaps slurm + password is the most common (although I admit I’ve been using condor with a VPN wall + password is my real use case right now).
I run W&B sweeps on HPC clusters with slurm as well.
Just prepare all your code and data on the cluster and create the sweep as explained above.
Then from the sweep page in your project, copy the wandb agent command.
Login to your hpc, and put the wandb agent command in the queue.
It is especially usefull o nHPC clusters as you have multiple nodes available. the wandb API automatically puts the right set of new hyperparameters on seperate nodes!