Start sweep and run from the same bash file (SLURM)

I want to initialize a wandb sweep and start the agents (in parallel) from a bash file using SLURM .
I found some similar posts here, but it did not work for me.
Therefore I wanted to share my setup. I create a new bash file names wandbid_from_slurmid.sh

#!/bin/bash

# Define the function
extract_wandbid_id() {
    local outp_path="$HOME/slurmfiles/"
    local outp_file_path=$(ls -ltr "${outp_path}"/*${SLURM_JOB_ID}* 2>/dev/null | tail -n 1 | awk '{print $NF}')

    # Initialize a wandbid to keep track of whether the ID has been found
    found=0

    # Loop until the wandbid is found
    while [ $found -eq 0 ]; do
        # Check if the file contains the desired string and extract the wandbid
        if grep -q "Creating sweep with ID: " "$outp_file_path"; then
            # Extract the wandbid following "Creating sweep with ID: "
            local wandbid=$(grep "Creating sweep with ID: " "$outp_file_path" | head -n 1 | sed 's/.*Creating sweep with ID: //')

            # Mark as found
            found=1

            # WARNING: bash returns the variable using "echo" so don't return any other
            echo $wandbid
            break

        else
            # Wait for 1 second before retrying
            sleep 1
        fi
    done
}

Then in my main .job file I have the following key lines:


#SBATCH --output=./slurmfiles/slurm_output_%A_%a.out

...

srun wandb sweep --project WANDB_PROJECT config.yml

source $HOME/wandbid_from_slurmid.sh

wandbid=$(extract_wandbid_id)

wandb_id=WANB_ACCOUNT/WANDB_PROJECT/"$wandbid"

srun wandb agent --count 2 $wandb_id &
srun wandb agent --count 2 $wandb_id &
wait

I hope it could help someone else out. Let me know if you have any feedback :slight_smile: