Hi @nathank ,
Thanks for following up. I’ve run into a few issues which I suspect are related. I’ve set up and tried running a random sweep and bayesian sweep with asynchronous hyperband but both are running into the same errors. Here’s the YAML file for my random sweep as reference:
> method: random
> metric:
> goal: minimize
> name: 'loss'
> parameters:
> model:
> distribution: categorical
> values:
> - imagenet_resnet18
> - imagenet_resnet50
> - resnet18
> - resnet50
> - minicnn32
> technique:
> distribution: categorical
> values:
> - SIMCLR
> - SIMSIAM
> log_name:
> value: log_name
> DATA_PATH:
> value: /home/jovyan/efs/split_data_SIMCLR_rad_mini/train
> VAL_PATH:
> value: /home/jovyan/efs/split_data_SIMCLR_rad_mini/val
> batch_size:
> distribution: categorical
> values:
> - 8
> - 16
> - 32
> - 64
> - 128
> - 256
> learning_rate:
> distribution: categorical
> values:
> - 1
> - 1e-1
> - 1e-2
> - 1e-3
> - 1e-4
> - 1e-5
> - 1e-6
> - 1e-7
> patience:
> distribution: categorical
> values:
> - 1
> - 5
> - 10
> - 20
> CPU:
> value: 4
> GPU:
> value: 1
> program: /home/jovyan/efs/SSL/train.py
I’m trying to log a relevant metric to evaluate model performance and have tried both ‘loss’ and ‘val_loss’ which are present in the SSL model training outputs. However, the following error message shows up whenever I queue up a sweep: RuntimeError: Early stopping conditioned on metric
val_losswhich is not available. Pass in or modify your
EarlyStoppingcallback to use any of the following:
loss``
Here’s the full error message for reference:
> Traceback (most recent call last):
File "/home/jovyan/efs/SSL/train.py", line 169, in <module>
cli_main()
File "/home/jovyan/efs/SSL/train.py", line 162, in cli_main
trainer.fit(model)
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 510, in fit
results = self.accelerator_backend.train()
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 57, in train
return self.train_or_test()
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in train_or_test
results = self.trainer.train()
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
self.train_loop.run_training_epoch()
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 625, in run_training_epoch
self.trainer.run_evaluation(on_epoch=True)
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 669, in run_evaluation
self.evaluation_loop.on_evaluation_end()
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 101, in on_evaluation_end
self.trainer.call_hook('on_validation_end', *args, **kwargs)
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 926, in call_hook
trainer_hook(*args, **kwargs)
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
callback.on_validation_end(self, self.get_model())
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 162, in on_validation_end
self._run_early_stopping_check(trainer, pl_module)
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 173, in _run_early_stopping_check
or not self._validate_condition_metric(logs) # short circuit if metric not present
File "/home/jovyan/ai4ls2/lib/python3.8/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 132, in _validate_condition_metric
raise RuntimeError(error_msg)
RuntimeError: Early stopping conditioned on metric `val_loss` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: `loss`
This causes all runs to fail and often not log any relevant metrics for evaluation. For example, here’s one of the outputs of a recent sweep which trained multiple models but didn’t log loss for any of them:
My current sweep workflow entails logging in to wandb, creating a sweep from the YAML, and then executing the sweep using the wandb agent.
Do you have any advice on how to resolve this logging issue? The bayesian sweeps I’ve tried are running into the same issues.