Make each sweep run with more than 1 GPU

aletl · November 29, 2023, 11:00am

Hello,

I am using Sweeps but I want each Sweep to run with more than 1 GPU. For instance, in a simple case, I set just 1 configuration of Sweep by setting all values to constant and I use SLURM to spam a job that has allocated 2 GPUs. The Sweep run still allocates just 1 GPU even though I have 2 of them, how can I do this?

If I try without Sweep, just the raw script, it effectively uses DDP to run in parallel with several GPUs.

trainer = pl.Trainer(
                    logger=wandb_logger,
                    accelerator='gpu',
                    max_epochs=5,
                    devices=-1,
                    check_val_every_n_epoch=5,
                    strategy="ddp_find_unused_parameters_true",
                    callbacks=[lr_monitor],
                    precision='16-mixed')

I am running the Sweeps from a script, by the way.

if __name__ == "__main__":
        
    setting = {
        'value': setting_selected
        }
     
    config.sweep_config['parameters']['setting'] = setting
    pprint.pprint(config.sweep_config)
    
    sweep_id = wandb.sweep(config.sweep_config, entity="shared", project="models-evaluation")
    
    wandb.agent(sweep_id, evaluate_model, count=1)

Thanks!

joana-marie · December 4, 2023, 5:06am

Hi @aletl ,

When using Wandb sweeps with multiple GPUs, you need to ensure that your training script is set up to utilize distributed training. Since you’re using PyTorch Lightning with the ddp_find_unused_parameters_true strategy, your script is already prepared for distributed training.

However, when running sweeps with WandB, you need to make sure that each sweep agent is aware of the number of GPUs available and is configured to use them. Here’s how you can modify your SLURM script and training function to ensure that each sweep run utilizes multiple GPUs:

SLURM Script: Make sure your SLURM script requests multiple GPUs and sets the appropriate environment variables. For example:

#!/bin/bash
#SBATCH --job-name=my-sweep
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2  # Request 2 GPUs
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=sweep-%j.out

# Load any modules or set up the environment if needed
module load cuda/10.1

# Run the W&B agent
wandb agent --count 1 your-entity/your-project/sweepID

Training Function: Modify your training function to use the number of GPUs allocated by SLURM. You can access the number of GPUs using the SLURM_GPUS environment variable or similar. Here’s an example of how you might modify your training function:

import os
import wandb
import pytorch_lightning as pl

def train():
    # Initialize a W&B run
    wandb.init()

    # Get the number of GPUs allocated by SLURM
    num_gpus = int(os.environ.get('SLURM_GPUS', 1))

    # Set up the trainer
    trainer = pl.Trainer(
        logger=wandb_logger,
        accelerator='gpu',
        devices=num_gpus,  # Use the number of GPUs allocated by SLURM
        max_epochs=5,
        check_val_every_n_epoch=5,
        strategy="ddp_find_unused_parameters_true",
        callbacks=[lr_monitor],
        precision='16-mixed',
    )

    # Train the model
    # ...

# Define the sweep configuration
sweep_config = {
    # ....

Let us know if that helps and feel free to write back in for questions.

joana-marie · December 11, 2023, 3:09am

Hi @aletl aletl, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

system · February 9, 2024, 3:09am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wandb sweep using slurm and multi gpu setting W&B Help sweeps	1	638	April 30, 2024
WandB sweeps and ddp W&B Help sweeps , wandb	3	1163	November 5, 2023
Sweep in DDP mode W&B Help sweeps , wandb	4	1038	March 6, 2022
Help with running a sweep agent on a multi-gpu machine with pytorch DistributedDataParallel W&B Help sweeps	4	675	January 8, 2025
About hyperparameters sweeping for DDP program W&B Help sweeps	4	399	July 8, 2022

Make each sweep run with more than 1 GPU

Related topics