W&B Sweeps w/ Self-Supervised Learning

I’m reaching out to get some thoughts on integrating W&B Sweeps with some of the code we’re interested in using. An example of the code we’re running is linked here. Note the 2 key sections ‘Training Self-Supervised Learning Model’ and ‘Fine Tuning Model’ which contain the !python commands we’re interested in tuning (model, technique, learning rate, etc.)

Based on this documentation, I’ve set up sweep_config but I’m unsure how to incorporate the 2 !python commands in train() when running an agent. Do you have any input on how to integrate a wandb sweep with these 2 commands?

An additional point I wanted to discuss was the strategy for a Sweep. The SSL code we’re running requires training 2 sequential models (the SSL and the final classification Model) where the output SSL model is the input to the final classification model (see the linked code above). We’re interested in doing hyperparameter tuning for both of the models - should we set up 2 independent sweeps for each? Or should we run a sweep on the first SSL model, pick the best performing model and use that as the input into the second classification model where we run a second sweep?

Hi @anmolseth, thank you for writing in!

Have you looked into using a custom command in your sweep config? Here is some information on this. You can pass any arguments for the sweep in via the command line so that your train.py can pick them up there instead of through wandb.config

As for the second part of your question, I can’t speak to which method would give better accuracy but my intuition would tell me that optimizing your input model first, and then using that model in the sweep for the second model would give a better result. Sweeping over both independently may lead the second model to expect input data to look differently than it will look from an optimized version of the first model.

Let me know if you have any further questions on either of these and I’d be happy help out.

Thank you,
Nate

1 Like

Thanks @nathank! The custom command functionality seems to be well-suited for my use case. I wanted to follow up on how to set up SSL’s train.py to be compatible with command line Sweeps. The SSL train.py cli_main() method includes ArgumentParser() to pull out the arguments from command line but this parser doesn’t seem to be present in any of the W&B examples online.

Do you know if this parser is needed to be compatible with W&B command line sweeps? Additionally, are there are any other changes I should be making to the SSL code to get prepped for my first Sweep?

Thank you,
Anmol

Hi @anmolseth,
What parameters would you like to sweep over? It looks like the ArgumentParser() is already setup to take in a number of arguments. If you are sweeping over parameters that ArgumentParser() already handles then it will simply grab them automatically from the sweep if you use the same name for them.

If not you will need to add arguments to the parser for any additions parameters you would like to sweep over and make sure your model is using the arguments passed in via the ArgumentParser()

Other than that, you should be ready to start your sweep. Let me know if you have any issues and I’d be happy to help!

Thank you,
Nate

Hi @anmolseth,
I wanted to follow up and see if you would like any additional help with this or if you were able to get your sweep running?

Thank you,
Nate

Hi @nathank ,

Thanks for following up. I’ve run into a few issues which I suspect are related. I’ve set up and tried running a random sweep and bayesian sweep with asynchronous hyperband but both are running into the same errors. Here’s the YAML file for my random sweep as reference:

> method: random
> metric:
>   goal: minimize
>   name: 'loss'
> parameters:
>   model:
>     distribution: categorical
>     values:
>       - imagenet_resnet18
>       - imagenet_resnet50
>       - resnet18
>       - resnet50
>       - minicnn32
>   technique:
>     distribution: categorical
>     values:
>       - SIMCLR
>       - SIMSIAM
>   log_name:
>     value: log_name
>   DATA_PATH:
>     value: /home/jovyan/efs/split_data_SIMCLR_rad_mini/train
>   VAL_PATH:
>     value: /home/jovyan/efs/split_data_SIMCLR_rad_mini/val
>   batch_size:
>     distribution: categorical
>     values:
>         - 8
>         - 16
>         - 32
>         - 64
>         - 128
>         - 256
>   learning_rate:
>     distribution: categorical
>     values:
>         - 1
>         - 1e-1
>         - 1e-2
>         - 1e-3
>         - 1e-4
>         - 1e-5
>         - 1e-6
>         - 1e-7
>   patience:
>     distribution: categorical
>     values:
>         - 1
>         - 5
>         - 10
>         - 20
>   CPU:
>     value: 4
>   GPU:
>     value: 1
> program: /home/jovyan/efs/SSL/train.py

I’m trying to log a relevant metric to evaluate model performance and have tried both ‘loss’ and ‘val_loss’ which are present in the SSL model training outputs. However, the following error message shows up whenever I queue up a sweep: RuntimeError: Early stopping conditioned on metric val_losswhich is not available. Pass in or modify yourEarlyStoppingcallback to use any of the following:loss``

Here’s the full error message for reference:


> Traceback (most recent call last):
  File "/home/jovyan/efs/SSL/train.py", line 169, in <module>
    cli_main()
  File "/home/jovyan/efs/SSL/train.py", line 162, in cli_main
    trainer.fit(model)
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 57, in train
    return self.train_or_test()
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 625, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 669, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 101, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 926, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 162, in on_validation_end
    self._run_early_stopping_check(trainer, pl_module)
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 173, in _run_early_stopping_check
    or not self._validate_condition_metric(logs)  # short circuit if metric not present
  File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 132, in _validate_condition_metric
    raise RuntimeError(error_msg)
RuntimeError: Early stopping conditioned on metric `val_loss` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: `loss`

This causes all runs to fail and often not log any relevant metrics for evaluation. For example, here’s one of the outputs of a recent sweep which trained multiple models but didn’t log loss for any of them:

My current sweep workflow entails logging in to wandb, creating a sweep from the YAML, and then executing the sweep using the wandb agent.

Do you have any advice on how to resolve this logging issue? The bayesian sweeps I’ve tried are running into the same issues.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.