I want to initialize a wandb sweep and start the agents (in parallel) from a bash file using SLURM .
I found some similar posts here, but it did not work for me.
Therefore I wanted to share my setup. I create a new bash file names wandbid_from_slurmid.sh
#!/bin/bash
# Define the function
extract_wandbid_id() {
local outp_path="$HOME/slurmfiles/"
local outp_file_path=$(ls -ltr "${outp_path}"/*${SLURM_JOB_ID}* 2>/dev/null | tail -n 1 | awk '{print $NF}')
# Initialize a wandbid to keep track of whether the ID has been found
found=0
# Loop until the wandbid is found
while [ $found -eq 0 ]; do
# Check if the file contains the desired string and extract the wandbid
if grep -q "Creating sweep with ID: " "$outp_file_path"; then
# Extract the wandbid following "Creating sweep with ID: "
local wandbid=$(grep "Creating sweep with ID: " "$outp_file_path" | head -n 1 | sed 's/.*Creating sweep with ID: //')
# Mark as found
found=1
# WARNING: bash returns the variable using "echo" so don't return any other
echo $wandbid
break
else
# Wait for 1 second before retrying
sleep 1
fi
done
}
Then in my main .job file I have the following key lines:
#SBATCH --output=./slurmfiles/slurm_output_%A_%a.out
...
srun wandb sweep --project WANDB_PROJECT config.yml
source $HOME/wandbid_from_slurmid.sh
wandbid=$(extract_wandbid_id)
wandb_id=WANB_ACCOUNT/WANDB_PROJECT/"$wandbid"
srun wandb agent --count 2 $wandb_id &
srun wandb agent --count 2 $wandb_id &
wait
I hope it could help someone else out. Let me know if you have any feedback