Hello zafirmk,
Thanks for getting in touch with support. I hope you’re having a great start to your week.
When running a sweep with Weights and Biases (W&B) on a Slurm managed cluster, you may encounter differences in how environment variables, such as SLURM_STEP_GPUS
, are handled compared to a standard Slurm job. This variable typically gets populated when a Slurm step (a subset of a job) explicitly requests GPUs. However, when using wandb agent, the environment under which your scripts run might not inherit these GPU-specific settings automatically from the Slurm job.
Here are a few steps and modifications to ensure you get the GPU IDs correctly during a W&B sweep:
- Modify SLURM Script
Ensure your Slurm script requests the appropriate resources and initializes any necessary modules and environment variables. It seems your script already does this well, but adding comments on each step might help ensure nothing crucial is missed:
!/bin/bashSBATCH --job-name=distributedSBATCH --account=[ACCOUNT]SBATCH --nodes=1SBATCH --ntasks-per-node=1SBATCH --gres=gpu:2 # Request 2 GPUsSBATCH --cpus-per-task=8SBATCH --mem=32GSBATCH --time=00:05:00SBATCH --output=logs/gpu_multi_mpi%j.outSBATCH --error=logs/gpu_multi_mpi%j.outmodule purgemodule load cudamodule load python/3.9source ~/venv/bin/activate
Run the wandb agent:
wandb agent --count 10 [AGENT_ID]
- Modify Python Script
Your Python script needs to handle the case where the SLURM_STEP_GPUS
environment variable might not be set. You can modify your script to check for the existence of this variable and use a default or fetch GPU information differently if not available:
import osimport argparseimport torchimport torch.multiprocessing as mpimport hostlistdef get_args_parser():parser = argparse.ArgumentParser(add_help=False)# Add other arguments as neededreturn parserif name == "main":parser = argparse.ArgumentParser("Distributed Optimization Script", parents=[get_args_parser()])args = parser.parse_args()mp.set_start_method("spawn", force=True)
NODE_ID = os.environ["SLURM_NODEID"]rank = int(os.environ["SLURM_PROCID"])local_rank = int(os.environ["SLURM_LOCALID"])world_size = int(os.environ["SLURM_NTASKS"])hostnames = hostlist.expand_hostlist(os.environ["SLURM_JOB_NODELIST"])n_nodes = len(hostnames)print(NODE_ID, rank, local_rank, world_size, hostnames)
# Handle SLURM_STEP_GPUS or fallbackgpu_ids = os.getenv("SLURM_STEP_GPUS")if gpu_ids: gpu_ids = gpu_ids.split(",")else: # Fallback: use torch to find visible GPUs if SLURM_STEP_GPUS is not set gpu_ids = [str(i) for i in range(torch.cuda.device_count())]
print(f"GPU IDS: {gpu_ids}")
- Testing and Debugging
- Direct Slurm Job: Test your script directly with sbatch to ensure that GPUs are being correctly allocated and recognized.
- W&B Sweep: Run your W&B sweep to check if the fallback logic correctly identifies available GPUs.
- Considerations
- Environment Variables: Be aware that different Slurm configurations might handle environment variables differently. Consult your cluster’s documentation or system administrator for details.
- Error Handling: It’s good practice to add robust error handling and logging, especially when dealing with distributed systems and external services like W&B.
By implementing a fallback for GPU ID retrieval, your script should be more resilient to the differing environments that can occur between standard Slurm jobs and those initiated by tools like W&B.
Let me know if this helps and if there is anything else I can do for you!
Best,
Jason