Early Stopping

Hello All,

I’m configuring a hyper parameter sweep. I have training, validation, and test set.

I’d like to use the test_loss as the final metric to optimize and val_loss for early stopping.

I don’t see a place to specify a metric for early stopping. Does it default to the same metric specified for overall optimization (of hyper parameters)? If so, how can I change this?

Thanks!

1 Like

It isn’t possible to have a different metric for hyperband early stopping and search strategy. https://github.com/wandb/sweeps/blob/master/hyperband_stopping.py#L176

One workaround would be to use a search strategy that doesn’t require a metric like random or grid and then use val_loss as your metric for early stopping. You can then easily reconfigure the resulting parameter importance and parallel coordinate plots to show test_loss in your dashboard.

If you would would like this feature, you can file a feature request on our client repo issues.

2 Likes

Welcome to the forum, @max_wasserman! Great first question.

While I can see why it might be good for us to add the ability to separate the early-stopping metric from the Bayesian optimization metric, I would strongly caution against using the test loss in any step of the process – whether its the optimization of parameters (obviously a no-no!) or the optimization of hyperparameters. The PyTorch Lightning docs even say that you should only call .test “[o]nly right before publishing your paper or pushing to production”.

The purpose of metrics measured on the test set is to reflect, as veridically as possible, the performance of the model on more data drawn from the same distribution, which we in turn hope reflects the performance of the model on data in production. Selecting hyperparameters based on the test set breaks the “information wall” (more technically, the conditional independence relation) between the test data and the model’s parameters that make the test set useful for getting unbiased estimates of true generalization performance.

There is at least some indication that the use of fixed validation and test sets has led the ML field as a whole to “overfit”, in the sense of over-estimation of true generalization performance:

3 Likes

Thanks so much for the responses.

In this case I am using synthetic data (I can generate a lot it cheaply). I used the names val/test_loss instead of validation set 1 (for early stopping) and validation set 2 (for bayes optimization) for simplicity. I will generate more data after this (my true test set) for final unbiased estimation of generalization.

It seems the best solution at the moment is to simply do what @_scott recommended: use a random search and log ‘test_loss’ (actually validation set 2) for viz later.

PS is this the preferred location/forum where I should post technical questions of this kind? The GitHub page refers to a slack group that appears to be closed.

2 Likes

Yes, this is the place, Thanks for checking!

We really want to make sure our community enjoys the forums so we’re silently moving from slack to discourse and we’ll be making the announcement soon once we’re confident the forums are all setup :slight_smile:

1 Like

Ah okay, if you’ve got an actual unbiased test set, then you’re golden. I’d be interested to hear more about your project!

And yes, as @_scott points out, if you aren’t using Bayesian optimization, the choice of metric won’t impact the behavior of your search. random is actually a pretty good choice for HPO, competitive with bayes in my and others’ experience – and less prone to error/misconfiguration. Also BTW, the early_terminate feature uses HyperBand, which is more aggressive than the usual early stopping folks learn about in an ML class, based on stopping training when you see increasing validation set error. That style of early stopping is best delegated to the ML framework you’re using.

Thanks for pointing out the issue with the Slack link. As @bhutanisanyam1 said, we are moving discussion to this forum, but that link should’ve still been in operation anyway. Will fix it shortly.

1 Like

I’m doing some graph learning work (inputs are graphs, labels are graphs). Submitting paper soon, so I’ll post it to one of these forums after!

3 Likes