I am using Sweeps but I want each Sweep to run with more than 1 GPU. For instance, in a simple case, I set just 1 configuration of Sweep by setting all values to constant and I use SLURM to spam a job that has allocated 2 GPUs. The Sweep run still allocates just 1 GPU even though I have 2 of them, how can I do this?
If I try without Sweep, just the raw script, it effectively uses DDP to run in parallel with several GPUs.
When using Wandb sweeps with multiple GPUs, you need to ensure that your training script is set up to utilize distributed training. Since you’re using PyTorch Lightning with the ddp_find_unused_parameters_true strategy, your script is already prepared for distributed training.
However, when running sweeps with WandB, you need to make sure that each sweep agent is aware of the number of GPUs available and is configured to use them. Here’s how you can modify your SLURM script and training function to ensure that each sweep run utilizes multiple GPUs:
SLURM Script: Make sure your SLURM script requests multiple GPUs and sets the appropriate environment variables. For example:
#!/bin/bash
#SBATCH --job-name=my-sweep
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2 # Request 2 GPUs
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=sweep-%j.out
# Load any modules or set up the environment if needed
module load cuda/10.1
# Run the W&B agent
wandb agent --count 1 your-entity/your-project/sweepID
Training Function: Modify your training function to use the number of GPUs allocated by SLURM. You can access the number of GPUs using the SLURM_GPUS environment variable or similar. Here’s an example of how you might modify your training function:
import os
import wandb
import pytorch_lightning as pl
def train():
# Initialize a W&B run
wandb.init()
# Get the number of GPUs allocated by SLURM
num_gpus = int(os.environ.get('SLURM_GPUS', 1))
# Set up the trainer
trainer = pl.Trainer(
logger=wandb_logger,
accelerator='gpu',
devices=num_gpus, # Use the number of GPUs allocated by SLURM
max_epochs=5,
check_val_every_n_epoch=5,
strategy="ddp_find_unused_parameters_true",
callbacks=[lr_monitor],
precision='16-mixed',
)
# Train the model
# ...
# Define the sweep configuration
sweep_config = {
# ....
Let us know if that helps and feel free to write back in for questions.
Hi @aletl aletl, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!