Hello together,
I am doing HPO using sweeps with hydra configs on a SLURM cluster.
My work flow so far was as follows:
- I create a .yaml to define the parameters I want to sweep over (shown below)
- I use:
wandb sweep sweeps/multiconv_sweep.yaml
to create the sweep - Then I use a bash script so send a couple of agents to the cluster (shown below)
Everything works great so far, however, after one agent is done, it starts another run. This is annoying because these runs run out of the cluster time limit → I’m getting plenty of crashed runs and I’m cluttering the cluster with runs that wont finish anyway. I tried to force stop the agent from doing so by adding:
# End of training
wandb.finish()
log.info(f"Finished Training")
sys.exit("Finished Training")
but it wont work.
Anyone here who could help me with this?
As far as I know there is no array solution for sending a defined amount of jobs with agents to the cluster, so I use the script sweep.sh
and spam in the command line:
sbatch sweep.sh wandb agent teamID/ProjectID/SweepID
a couple of times, this works well.
sweep.sh:
#!/bin/bash
# Define the partition on which the job shall run.
#SBATCH XXXX
# Define a name for your job
#SBATCH XXXXX
#SBATCH --output logs/%x-%A-%a.out
#SBATCH --error logs/%x-%A-%a.err
# Define the amount of memory required per node
#SBATCH --mem 32GB
#SBATCH --time 1-10:00:00
echo "Workingdir: $PWD";
echo "Started at $(date)";
# Running the job
start=`date +%s`
srun "$@"
end=`date +%s`
runtime=$((end-start))
echo Job execution complete.
sweep_settings.yaml
program: train.py
method: random
metric:
goal: minimize
name: sliced wasserstein distance
parameters:
model.gan.beta:
distribution: uniform
max: 1.0
min: 0.0
[...]
command:
- python
- ${program}
- ${args_no_hyphens}