Sweep agent will always start another run after finishing (on SLURM)

Hello together,

I am doing HPO using sweeps with hydra configs on a SLURM cluster.
My work flow so far was as follows:

  • I create a .yaml to define the parameters I want to sweep over (shown below)
  • I use: wandb sweep sweeps/multiconv_sweep.yaml to create the sweep
  • Then I use a bash script so send a couple of agents to the cluster (shown below)

Everything works great so far, however, after one agent is done, it starts another run. This is annoying because these runs run out of the cluster time limit → I’m getting plenty of crashed runs and I’m cluttering the cluster with runs that wont finish anyway. I tried to force stop the agent from doing so by adding:

  # End of training
    wandb.finish()

    log.info(f"Finished Training")
    sys.exit("Finished Training")

but it wont work.
Anyone here who could help me with this?

As far as I know there is no array solution for sending a defined amount of jobs with agents to the cluster, so I use the script sweep.sh and spam in the command line:
sbatch sweep.sh wandb agent teamID/ProjectID/SweepID
a couple of times, this works well.

sweep.sh:

#!/bin/bash

# Define the partition on which the job shall run.
#SBATCH XXXX

# Define a name for your job
#SBATCH XXXXX 

#SBATCH --output logs/%x-%A-%a.out  
#SBATCH --error logs/%x-%A-%a.err  

# Define the amount of memory required per node
#SBATCH --mem 32GB
#SBATCH --time 1-10:00:00
echo "Workingdir: $PWD";
echo "Started at $(date)";


# Running the job
start=`date +%s`

srun "$@"

end=`date +%s`
runtime=$((end-start))

echo Job execution complete.

sweep_settings.yaml

program: train.py
method: random
metric:
  goal: minimize
  name: sliced wasserstein distance

parameters:

  model.gan.beta:
    distribution: uniform
    max: 1.0
    min: 0.0

[...]

command:
  - python
  - ${program}
  - ${args_no_hyphens}

Hello, This a reply from our support bot designed to assist you with your Weights & Biases related queries. To reach a human please reply to this message.

To prevent a Weights & Biases sweep agent from starting new runs after completion, especially when using a SLURM cluster, you can specify the number of runs you want each agent to execute using the count parameter when starting the agent. This ensures that the agent will stop after executing the specified number of runs, which can help avoid the issue of agents starting new runs beyond your cluster’s time limit.

Here’s how you can modify your sweep.sh script to include the count parameter when starting a sweep agent. This example assumes you want each agent to execute only one run:

#!/bin/bash

# Define the partition on which the job shall run.
#SBATCH XXXX

# Define a name for your job
#SBATCH XXXXX

#SBATCH --output logs/%x-%A-%a.out
#SBATCH --error logs/%x-%A-%a.err

# Define the amount of memory required per node
#SBATCH --mem 32GB
#SBATCH --time 1-10:00:00
echo "Workingdir: $PWD";
echo "Started at $(date)";

# Running the job
start=`date +%s`

# Start the W&B agent with count=1 to execute only one run
srun wandb agent --count 1 $1

end=`date +%s`
runtime=$((end-start))

echo Job execution complete.

In this script, $1 is expected to be the sweep ID in the format teamID/ProjectID/SweepID. When you submit this script with sbatch, you should pass the sweep ID as an argument:

sbatch sweep.sh teamID/ProjectID/SweepID

This approach uses the --count parameter to control the number of runs each agent should execute, effectively preventing the agent from starting new runs after completion.

Regarding the use of wandb.finish(), it’s important to note that this function is used to mark the end of a run within your training script (train.py in your case), ensuring that all data is synced to the server before the script exits. It does not control the behavior of the sweep agent itself. The --count parameter is the correct way to limit the number of runs for each agent.

For managing Weights & Biases sweep agents on SLURM clusters, here are some best practices:

  • Use the --count parameter to specify the number of runs per agent, as shown above.
  • Ensure your SLURM scripts and job configurations are set up to handle the expected workload and runtime of your experiments.
  • Monitor your SLURM jobs and W&B dashboard to ensure runs are completing successfully and to troubleshoot any issues that arise.

By following these practices, you can more effectively manage your Weights & Biases sweeps on SLURM clusters and avoid issues with agents starting unwanted runs.

Sources:

To reach a human please reply to this message.

-WandBot :robot:

1 Like

Hi @sboehm Good day and thank you for reaching out to us! Happy to help you on this.

When using sweeps with the SLURM scheduling system, we recommend running wandb agent --count 1 SWEEP_ID in each of your scheduled jobs, which will run a single training job and then exit. This makes it easier to predict runtimes when requesting resources and takes advantage of the parallelism of hyperparameter search.

Can you check if this approach can be applied to your current setup and see if this works for you?

Thanks,
Paulo

Hey @paulo-sabile
Thanks for your response, actually the bot response was already pointing in this direction and it works perfectly :slight_smile:

I don’t know how on earth this bot was able to answer it so good, Gemini, Co-Pilot and Chat-GPT failed horribly at answering this question.

Thank you for your update @sboehm and we are very grateful for your feedback as well! I will now mark this request as solved. Please feel free to reach out to us again if you need assistance!

Have a great day ahead!

Best Regards,
Paulo