Issue with W&B Sweeps and Lightning: Code Stops After First Run

bourzakismail · August 13, 2024, 7:18pm

Hi everyone,

I’m using W&B Sweeps to perform hyperparameter tuning with PyTorch Lightning on two GPUs. My setup works fine for the first run, but it stops at the second run and doesn’t proceed further.
(when I run the same code on a single GPU, it works fine without any issues)

Problem Description: After the first run completes, my code stops at the configuration definition stage and does not proceed to the next run. Below is the log output where it gets stuck.

wandb: Agent Starting Run: 74qhtrgy with config:
wandb: activation_function: sigmoid
wandb: batch_size: 32
wandb: d_ff: 119
wandb: d_model: 56
wandb: dropout: 0.4621265312832923
wandb: epochs: 10
wandb: learning_rate: 0.045521224681459006
wandb: nhead: 7
wandb: num_layers: 7
wandb: weight_decay: 0.0067762018632429675
wandb: Tracking run with wandb version 0.17.6
wandb: Run data is saved locally in /home/bourzakismail/My Projects/wandb/run-20240813_145424-b1rmeete/files/wandb/run-20240813_145506-74qhtrgy
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run vibrant-sweep-2
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

Code:

def create_sweep_config():
    return {
        'method': 'bayes',
        'name': 'EXP1_iTransformer_SPEED_10_YEARS',
        'metric': {
            'name': 'val_loss',
            'goal': 'minimize'   
        },
        'parameters': {
            'num_layers': {'distribution': 'int_uniform', 'min': 1, 'max': 10},
            'nhead': {'distribution': 'int_uniform', 'min': 2, 'max': 10},
            'd_model': {'distribution': 'int_uniform', 'min': 50, 'max': 200},
            'd_ff': {'distribution': 'int_uniform', 'min': 50, 'max': 200},
            'activation_function': {'values': ['relu', 'sigmoid', 'tanh']},
            'dropout': {'distribution': 'uniform', 'min': 0, 'max': 0.5},
            'learning_rate': {'distribution': 'uniform', 'min': 1e-5, 'max': 1e-1},
            'weight_decay': {'distribution': 'uniform', 'min': 1e-5, 'max': 1e-1},
            'batch_size': {'value': 32},
            'epochs': {'value': 10}
        }
    }

def train(config=None):
    wandb.init(config=config)
    config = wandb.config

    if os.environ.get("LOCAL_RANK", None) is None:
        os.environ["WANDB_DIR"] = wandb.run.dir

    model = iTransformer(
        seq_len=seq_len,
        d_model=(config.d_model if config.d_model % 2 == 0 else config.d_model + 1) * (config.nhead if config.nhead % 2 == 0 else config.nhead + 1),
        d_output=d_output,
        nhead=(config.nhead if config.nhead % 2 == 0 else config.nhead + 1),
        d_ff=config.d_ff,
        activation_function=config.activation_function,
        num_layers=config.num_layers,
        dropout=config.dropout,
        lr=config.learning_rate,
        weight_decay=config.weight_decay,
        batch_size=config.batch_size
    )

    wandb_logger = WandbLogger(log_model="all")

    trainer = L.Trainer(
        accelerator="gpu",
        devices=2,
        strategy='ddp',
        max_epochs=config.epochs,
        logger=wandb_logger,
        callbacks=[EarlyStopping(monitor="val_loss", mode="min", patience=10)]
    )

    trainer.fit(model)
    trainer.test(model)
    if int(os.environ.get('LOCAL_RANK', 0)) == 0:
        wandb.finish()

def main():
    sweep_config = create_sweep_config()
    sweep_id = wandb.sweep(sweep_config, project="SpeedPL_iTransformer_Model_10_YEARS")
    wandb.agent(sweep_id, function=train, count=10)

if __name__ == "__main__":
    if int(os.environ.get('LOCAL_RANK', 0)) == 0:
        main()
    else:
        train()

Environment:

Lightning version: 2.3.3
W&B version: 0.17.6
torch version: 2.3.1+cu121
Python version: 3.10.12

mojtaba-bahrami · August 14, 2024, 11:59am

This just happened to me yesterday and today after 10-15 hours of training.

luis_bergua · August 19, 2024, 10:09am

Hey @bourzakismail @mojtaba-bahrami just wanted to check if you’re still seeing this? There was a temporary outage so maybe you were affected

mojtaba-bahrami · August 19, 2024, 10:57am

Thanks for the reply @luis_bergua. I have put the wandb into offline mode since then. I will try it again and let you know.
May I ask when was the outage time? I faced a bunch of different issues that I want to know if they were all withing the outage time span or caused by something else.

bourzakismail · August 19, 2024, 3:07pm

Hey @luis_bergua,
Thank you for your response. I tried running my code this morning, but unfortunately, I’m still encountering the same issue.

luis_bergua · August 22, 2024, 2:33pm

Thanks for confirming this @bourzakismail! Would you have any problems with sharing a minimal repro code so we can reproduce on our end? I tried using your config but was unable to reproduce the issue

artsiom · August 26, 2024, 7:03pm

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

artsiom · August 28, 2024, 2:40pm

Hi, since we have not heard back from you, we are going to close this request. If you would like to reopen the conversation, please let us know! Unfortunately, at the moment, we do not receive notifications if a thread reopens on Discourse. So, please feel free to create a new ticket regarding your concern if you’d like to continue the conversation.

Topic		Replies	Views
Issue with W&B Sweeps and Lightning W&B Help sweeps , wandb	3	66	September 4, 2024
Sweeps ending in just 1 epoch W&B Help sweeps , wandb	4	164	April 18, 2024
Wandb sweep not working W&B Help	5	532	June 5, 2024
Sweeps: Waiting for W&B process to finish... (failed 1) W&B Help sweeps , projects , wandb	7	4111	May 31, 2023
BrokenPipeError when doing sweeps W&B Help sweeps , wandb	5	718	January 22, 2024

Issue with W&B Sweeps and Lightning: Code Stops After First Run

Related topics