Issue with W&B Sweeps and Lightning

bourzakismail · September 4, 2024, 7:21pm

Hi everyone,

I’m using W&B Sweeps to perform hyperparameter tuning with PyTorch Lightning on two GPUs. My setup works fine for the first run, but it stops at the second run and doesn’t proceed further.
(when I run the same code on a single GPU, it works fine without any issues)

Problem Description: After the first run completes, my code stops at the configuration definition stage and does not proceed to the next run. Below is the log output where it gets stuck.

wandb: Agent Starting Run: 74qhtrgy with config:
wandb: activation_function: sigmoid
wandb: batch_size: 32
wandb: d_ff: 119
wandb: d_model: 56
wandb: dropout: 0.4621265312832923
wandb: epochs: 10
wandb: learning_rate: 0.045521224681459006
wandb: nhead: 7
wandb: num_layers: 7
wandb: weight_decay: 0.0067762018632429675
wandb: Tracking run with wandb version 0.17.6
wandb: Run data is saved locally in /home/bourzakismail/My Projects/wandb/run-20240813_145424-b1rmeete/files/wandb/run-20240813_145506-74qhtrgy
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run vibrant-sweep-2
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

Code:

def create_sweep_config():
    return {
        'method': 'bayes',
        'name': 'EXP1_iTransformer_SPEED_10_YEARS',
        'metric': {
            'name': 'val_loss',
            'goal': 'minimize'   
        },
        'parameters': {
            'num_layers': {'distribution': 'int_uniform', 'min': 1, 'max': 10},
            'nhead': {'distribution': 'int_uniform', 'min': 2, 'max': 10},
            'd_model': {'distribution': 'int_uniform', 'min': 50, 'max': 200},
            'd_ff': {'distribution': 'int_uniform', 'min': 50, 'max': 200},
            'activation_function': {'values': ['relu', 'sigmoid', 'tanh']},
            'dropout': {'distribution': 'uniform', 'min': 0, 'max': 0.5},
            'learning_rate': {'distribution': 'uniform', 'min': 1e-5, 'max': 1e-1},
            'weight_decay': {'distribution': 'uniform', 'min': 1e-5, 'max': 1e-1},
            'batch_size': {'value': 32},
            'epochs': {'value': 10}
        }
    }

def train(config=None):
    wandb.init(config=config)
    config = wandb.config

    if os.environ.get("LOCAL_RANK", None) is None:
        os.environ["WANDB_DIR"] = wandb.run.dir

    model = iTransformer(
        seq_len=seq_len,
        d_model=(config.d_model if config.d_model % 2 == 0 else config.d_model + 1) * (config.nhead if config.nhead % 2 == 0 else config.nhead + 1),
        d_output=d_output,
        nhead=(config.nhead if config.nhead % 2 == 0 else config.nhead + 1),
        d_ff=config.d_ff,
        activation_function=config.activation_function,
        num_layers=config.num_layers,
        dropout=config.dropout,
        lr=config.learning_rate,
        weight_decay=config.weight_decay,
        batch_size=config.batch_size
    )

    wandb_logger = WandbLogger(log_model="all")

    trainer = L.Trainer(
        accelerator="gpu",
        devices=2,
        strategy='ddp',
        max_epochs=config.epochs,
        logger=wandb_logger,
        callbacks=[EarlyStopping(monitor="val_loss", mode="min", patience=10)]
    )

    trainer.fit(model)
    trainer.test(model)
    if int(os.environ.get('LOCAL_RANK', 0)) == 0:
        wandb.finish()

def main():
    sweep_config = create_sweep_config()
    sweep_id = wandb.sweep(sweep_config, project="SpeedPL_iTransformer_Model_10_YEARS")
    wandb.agent(sweep_id, function=train, count=10)

if __name__ == "__main__":
    if int(os.environ.get('LOCAL_RANK', 0)) == 0:
        main()
    else:
        train()

Environment:

Lightning version: 2.3.3
W&B version: 0.17.6
torch version: 2.3.1+cu121
Python version: 3.10.12

system · September 6, 2024, 3:28pm

Hello there:
Thanks for writing to us! Do you mind sending us a minimal code so we can reproduce this issue? We tried your code before and didn’t encounter the problem you were experiencing.
On the side note, I see you are using 0.17.6. Our latest release is 0.17.9. You could also try updating to see if it helps with the issue.
Best,
W&B

system · September 9, 2024, 9:06pm

Hi Visitor,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Weights & Biases

system · September 10, 2024, 8:21pm

Hi Visitor, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Topic		Replies	Views
Issue with W&B Sweeps and Lightning: Code Stops After First Run W&B Help sweeps , wandb	7	93	August 28, 2024
Hugging Face with Sweeps causes Broken pipe W&B Help sweeps	2	876	December 24, 2023
Wandb sweep and PyTorchLightning CLI W&B Help sweeps , wandb	1	32	April 16, 2025
BrokenPipeError when doing sweeps W&B Help sweeps , wandb	5	718	January 22, 2024
Sweeps: Waiting for W&B process to finish... (failed 1) W&B Help sweeps , projects , wandb	7	4111	May 31, 2023

Issue with W&B Sweeps and Lightning

Related topics