At which epochs does the hyperband algorithm of wandb checks for improvement?

What I am trying to do :-

I am trying to apply a bayesian hyperband sweep.

Now as mentioned in the docs, under early terminate we have to mention 4 params (generally), those are min_iter, s, eta and max_iter, it would look something like follows.

#______________________________________________________________________________________

My doubts summarized:-

In summary, what I want to know,
Given all 4 :- min_iter, s, eta, and max_iter

  1. at which epochs will the hyperband algorithm check for improvement??

  2. considering I am trying to do bayesian hyperband, how many runs will be evaluated in the first bracket, and how many runs will be evaluated in the consecutive brackets?

  3. is there any way or rule(s) of thumb to decide what values are good to take for these 4 parameters(min_iter, s, eta, and max_iter) ?

  4. please explain about the paramters s and eta (especially eta) in a bit more detail, i.e. with a bit or underlying maths (please keep it simple if possible).

#______________________________________________________________________________________

What is my doubt about?? (explained in a bit more detail/context):-

in the docs it is only somewhat explained that at which epochs (their) implementation of the hyperband algorithm checks for improvement and takes decision whether to terminate a run or not.

  1. When only the minimum number of iterations for each run are our concern
early_terminate:  
     type: hyperband  
     min_iter: 3

The brackets for this example are: [3, 3*eta, 3*eta*eta, 3*eta*eta*eta], which equals [3, 9, 27, 81].

  1. When only the maximum number of iterations for each run are our concern
early_terminate: 
    type: hyperband  
    max_iter: 27  
    s: 2

The brackets for this example are [27/eta, 27/eta/eta], which equals [9, 3].

But what about a case when both the minimum and maximum number of iterations for each run are our concern??
Like the one as follows…

early_terminate:
  type: hyperband
  min_iter: 10
  s: 3
  eta: 4
  max_iter: 50

Hi @orphic-vis, thank you for reaching out to Weights & Biases with your very interesting question. I will investigate internally to get some more detailed information on the bayesian hyperband Sweeps and will get back to you

@fmamberti-wandb Thank you for your reply!! I will be eagerly waiting for your answer. :smile:

@fmamberti-wandb Any updates on this topic??

Hi @orphic-vis , apologies for the late reply on this - truly sorry.

Referencing the code implementing the hyperband algorithm (see here), the min_iter value is only used if max_ter is not set; therefor setting both values is the same as setting only max_iter.

As mentioned in this section of the Docs, s determines the number of brackets, and eta is the factor (or dividend, if you are using max_iter) used to multiple or divide the previous bracket to get to the next one.

how many runs will be evaluated in the first bracket, and how many runs will be evaluated in the consecutive brackets?

When you start a Run via an agent, whenever each Run reaches each step for each bracket (i.e. [3, 3*eta, 3*eta*eta, 3*eta*eta*eta]) it will be evaluated against previous Runs, and terminated if the metric is too low (or too high), so the number of runs reaching each bracket will chance for each sweep.

Please let me know if you have any further questions and thank you for your patience.

Hi @orphic-vis , I wanted to follow up on this request. Please let us know if we can be of further assistance.

@fmamberti-wandb First of all thak you for your reply!! And second sorry for my late reply. :sweat_smile:

I have 2 more doubts.

  1. “it will be evaluated against previous Runs” – Say, so far 40 runs have been completed, and the 11th run has reached a flag/band on which the hyperband algorithm will now check its performance. So, how will the previous runs be used?? Will it be the average of the tracked target metric of ALL the previous runs?? Or something else ?? (like out of all 40, only the most recent 15’s average??)

  2. One issue is that, unlike other libraries, in wandb yo use bayes, we can only mention the method as bayes as of the moment, and no further info, like the acquisition function, any prior info on some configs, no. of steps to first take as totally random then start the actually bayes, exploration-exploitation trade-off parameter etc. Or is there a way to achieve that?? Also, if no, than can you please tell me what are the “default” settings of the bayesian optimization used in wandb??

Once again thanks a lot for your reply!!

@fmamberti-wandb Got somewhat of a lead for answer of 1st doubt here.
Mentions that the algorithm takes all the runs ever performed before the current run, takes the top (1/eta) percentile values of the target metric, sorted, as threshold(s) corresponding to the current band.
But then how are they used further??

Hi @orphic-vis

Regarding your first question, as you mentioned the algorithm take the previous runs that reached each band, get the value for the metric your are monitoring in the sweep and check what is the threshold for being in the top percentile of all these runs. If the current run is below/above it (depending if you are minimising or maximising) therefore not in the top r percent it will stop the run.

Regarding your second question, those settings are currently not customisable, but I’d be happy to raise a feature request for this. What parameters would you like to be able to control?
Regarding the default ones, the implementation is based on scikit-learn GaussianProcessRegression (see code here) with Matérn kernel, with expected_improvement optimisation function

@fmamberti-wandb Please do raise a feature request(s) if possible. Some features that can be added (may or may not be related to bayesian sweeps) are –

  1. Ability to some how log “sub”-runs of a run, in case of K-fold cross validation, grouped as one with ease, especially when sweeping is being done, currently its is very hard to achieve this.

  2. About the early stopping methods - I like the hyperband implementation, but I thinks its a bit too restrictive, it would be better to also have a method in which the user can mention that at which epochs/bands, the method compares the performance of the current run, with the previous runs, just in the way the hyperband implementation of wandb does, giving users more flexibility, as other wise the bands are deteremined in multiples of min_iter (or factors of max_iter)

  3. Expectation-exploitaiton trade off parameter (let’s call it x_i ) - Even better if at certain value say 0, it identical to random search, why?? related to point 6.

  4. Other than classic bayesian optimization (which uses gaussian processes), there is SMAC, which is bayesian optimization but with random forest, that might be a good adition to the set of optimization algorithms wandb has.

  5. Ability to mention the acquisition function (Not necessary but, I mean some libraries do have it)- PI, EI or Upper-bound

  6. Ability to guide the optimization algo(Not really necessary but would be nice) :- Now, bayesian search does require previous knowledge (either it explores itself, or the user provides it), so it would be nice if we can mention the algorithm (regardles of given previous data) how many random steps to taken before actually starting the bayesian search, and also say ability to change the x_i for the algo, (like first 15 random, next 15 explore with bayes, next 15 exploit with bayes)

@fmamberti-wandb

Also regarding sweeps, I have faced an issue regarding categorical, hypeparameters, say I have run a sweep, with 8 hypeparameters, out of which 3 are categorical (doesn’t necessarily mean that its non-numeric, or discontinuous), for 50 evaluations.

If change the categorical distribution of even one hyperparameter (say in one of them originally 3 values were present, I just removed one of them to reduce it down to 2 values only), I get error, forcing me to provide the original distribution.

I want to carry over this knowledge gained to further more sweeps, as long as some tof the hypeparameters in the next sweep being used are common.

— This can be considered as a 7th feature, that may help a lot while performing sweeps in stages.

Hi @orphic-vis , I wanted to let you know that I have raised your feedback as feature request with our product team to review.

Regarding your first and last point:

  • You may be able to group different experiments of your sweep using tags with values from the sweep config parameters as well as using the group Runs features
  • You should be able to setup a new Sweep in the UI, copying the config from the old one and amending the parameters you would like to update and using the Configure prior runs to add the Runs you want to carry over from previous sweeps to carry over the knowledge to the new Sweep

@fmamberti-wandb
Thanks for answering my questions. My doubts are solved, you can close this question/thread, as per requirement.