Sweeps ending in just 1 epoch

The problem is that, I was trying to perform hyper-parameter sweep using wandb, the first sweep runs for set no. of epochs, but the consecutive sweeps just run for 1 epoch. For proof I attach the image in which you can observe a drastic decrease in runtime as sweeps progress.

Here is my code for performing sweep :


wandb.login()

NAME = sweep_config['parameters']['model_name']['value']+f"__var-{sweep_config['parameters']['num_classes']['value']}"+ \
    f"__fold-{sweep_config['parameters']['fold']['value']}"

print('NAME : ' , NAME , '\n\n')
sweep_id = wandb.sweep(sweep_config, project=NAME)

def tune_hyperparams(config = None):
    with wandb.init(config = config):
        config = wandb.config
        print(config,'\n\n\n')
        num_workers = 8
        tr_loader = DataLoader(tr_dataset, batch_size=config['BATCH_SIZE'], shuffle=True, num_workers=num_workers)
        val_loader = DataLoader(val_dataset, batch_size=config['BATCH_SIZE'], shuffle=False, num_workers=num_workers)

        model_obj = DenseNet(densenet_variant = config['model_size'] , in_channels=config['in_channels'], 
                     num_classes=config['num_classes'] , compression_factor=0.3 , k = 32 , config=config)
        model = Classifier(model_obj)

        run_name = f"lr_{config['lr']} *** bs{config['BATCH_SIZE']} *** decay_{config['weight_decay']}"
        wandb_logger = WandbLogger(project=NAME , name = run_name)


        trainer = Trainer(callbacks=[early_stop_callback, rich_progress_bar], 
                        accelerator = 'gpu' ,max_epochs=config['epochs'], logger=[wandb_logger] , devices=find_usable_cuda_devices(1))  

        trainer.fit(model, tr_loader, val_loader)
    wandb.finish()


wandb.agent(sweep_id, tune_hyperparams, count=30)

Pls tell how to tackle this problem…
Thanks in advance…

Hello,

Thank you for contacting support.

To help resolve your issue as efficiently as possible, could you please provide the following information:

  1. A link to the project.
  2. A copy of the code snippet for the callbacks=[early_stop_callback] portion of your code.

Thanks again and I look forward to hearing from you.

Best regards,
Jason

Thanks for reaching out … Here is the link of wandb project →

Callbacks :

early_stop_callback = EarlyStopping(
   monitor='val_loss',
   min_delta=0.0001,
   patience=20,
   verbose=True,
   mode='min'
)

Isn’t it possible to share project in the private mode with a sharable link as happens with MS-Office packages…?

Hello,

Thanks for your kind patience while I investigated the issue. Reviewing the logs I see the following:

29 Monitored metric val_loss did not improve in the last 55 records. Best score: 0.612. Signaling Trainer to stop.

Here is a direct link to the logs within your project page as well.

Wandb will only ever terminate a run early if it is implicitly called using early_terminate within your sweeps config. More info can be found here

This would leave the Pytorch EarlyStopping function as the main culprit. Looking into their docs they reference “events” as “The number of events to wait if no improvement and then stop the training”.

I am not sure what they are counting as an event but the logs indicate the EarlyStopping method is what is ending your runs early.

Let me know if you have any more questions or have more context you would like to share.

Best,
Jason

Hi, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!