Wandb sweep using slurm and multi gpu setting

Hi, I am using slurm to submit a sweep using the following file

#!/bin/bash
#SBATCH --job-name=distributed
#SBATCH --account=[ACCOUNT]
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G


#SBATCH --time=00:05:00
#SBATCH --output=logs/gpu_multi_mpi%j.out 
#SBATCH --error=logs/gpu_multi_mpi%j.out

module purge
module load cuda
module load python/3.9

source ~/venv/bin/activate

wandb agent --count 10 [AGENT_ID]

Here is a copy of the main.py

if __name__ == "__main__":
    # parse the arguments
    parser = argparse.ArgumentParser(
        "Distributed Optimization Script", parents=[get_args_parser()]
    )
    args = parser.parse_args()
    mp.set_start_method("spawn", force=True)

    NODE_ID = os.environ["SLURM_NODEID"]
    rank = int(os.environ["SLURM_PROCID"])
    local_rank = int(os.environ["SLURM_LOCALID"])
    world_size = int(os.environ["SLURM_NTASKS"])
    hostnames = hostlist.expand_hostlist(os.environ["SLURM_JOB_NODELIST"])
    n_nodes = len(hostnames)
    print(NODE_ID, rank, local_rank, world_size, hostnames)
    # get IDs of reserved GPU
    gpu_ids = os.environ["SLURM_STEP_GPUS"].split(",")
    print(f"GPU IDS: {gpu_ids}")

When running this file regularly using slurm (not sweeping) gpu_ids is populated and works fine. However when I attempt to sweep using the code snippet shown earlier - I get a key error.

KeyError: 'SLURM_STEP_GPUS'

How can I access the gpu_ids of the requested gpus when doing a sweep?

Hello zafirmk,

Thanks for getting in touch with support. I hope you’re having a great start to your week.

When running a sweep with Weights and Biases (W&B) on a Slurm managed cluster, you may encounter differences in how environment variables, such as SLURM_STEP_GPUS , are handled compared to a standard Slurm job. This variable typically gets populated when a Slurm step (a subset of a job) explicitly requests GPUs. However, when using wandb agent, the environment under which your scripts run might not inherit these GPU-specific settings automatically from the Slurm job.

Here are a few steps and modifications to ensure you get the GPU IDs correctly during a W&B sweep:

  1. Modify SLURM Script
    Ensure your Slurm script requests the appropriate resources and initializes any necessary modules and environment variables. It seems your script already does this well, but adding comments on each step might help ensure nothing crucial is missed:
!/bin/bashSBATCH --job-name=distributedSBATCH --account=[ACCOUNT]SBATCH --nodes=1SBATCH --ntasks-per-node=1SBATCH --gres=gpu:2 # Request 2 GPUsSBATCH --cpus-per-task=8SBATCH --mem=32GSBATCH --time=00:05:00SBATCH --output=logs/gpu_multi_mpi%j.outSBATCH --error=logs/gpu_multi_mpi%j.outmodule purgemodule load cudamodule load python/3.9source ~/venv/bin/activate

Run the wandb agent:

wandb agent --count 10 [AGENT_ID]

  1. Modify Python Script
    Your Python script needs to handle the case where the SLURM_STEP_GPUS environment variable might not be set. You can modify your script to check for the existence of this variable and use a default or fetch GPU information differently if not available:
import osimport argparseimport torchimport torch.multiprocessing as mpimport hostlistdef get_args_parser():parser = argparse.ArgumentParser(add_help=False)# Add other arguments as neededreturn parserif name == "main":parser = argparse.ArgumentParser("Distributed Optimization Script", parents=[get_args_parser()])args = parser.parse_args()mp.set_start_method("spawn", force=True)
NODE_ID = os.environ["SLURM_NODEID"]rank = int(os.environ["SLURM_PROCID"])local_rank = int(os.environ["SLURM_LOCALID"])world_size = int(os.environ["SLURM_NTASKS"])hostnames = hostlist.expand_hostlist(os.environ["SLURM_JOB_NODELIST"])n_nodes = len(hostnames)print(NODE_ID, rank, local_rank, world_size, hostnames)
# Handle SLURM_STEP_GPUS or fallbackgpu_ids = os.getenv("SLURM_STEP_GPUS")if gpu_ids:    gpu_ids = gpu_ids.split(",")else:    # Fallback: use torch to find visible GPUs if SLURM_STEP_GPUS is not set    gpu_ids = [str(i) for i in range(torch.cuda.device_count())]
print(f"GPU IDS: {gpu_ids}")
  1. Testing and Debugging
  • Direct Slurm Job: Test your script directly with sbatch to ensure that GPUs are being correctly allocated and recognized.
  • W&B Sweep: Run your W&B sweep to check if the fallback logic correctly identifies available GPUs.
  1. Considerations
  • Environment Variables: Be aware that different Slurm configurations might handle environment variables differently. Consult your cluster’s documentation or system administrator for details.
  • Error Handling: It’s good practice to add robust error handling and logging, especially when dealing with distributed systems and external services like W&B.

By implementing a fallback for GPU ID retrieval, your script should be more resilient to the differing environments that can occur between standard Slurm jobs and those initiated by tools like W&B.

Let me know if this helps and if there is anything else I can do for you!

Best,
Jason