Average performance over seeds for Bayesian hyperparameter optimizer

Dear W&B Support Team,

In ML, a single seed’s performance is not a reliable metric to feed to a Bayesian optimizer. This is especially pronounced in RL, where the instability is so high that a SOTA method can perform worse than the baseline on an unlucky run, and vice versa[1]. This renders the W&B Bayesian optimizer currently unusable (or rather, unscientific to use) for tuning RL hyperparams.

I believe the community has voiced support for this feature numerous times in many threads (below), but no updates or even workarounds were ever provided. I hope this post becomes a +1 and a summary of previous requests, and finally moves the needle on this feature.

Personally, not being able to use the W&B bayes optimizer means I have to settle for the random option, or spend considerable time implementing a seed-aggregate wrapper.

Previous requests:

Reference:

[1] - Eimer, Theresa, Marius Lindauer, and Roberta Raileanu. ‘Hyperparameters in Reinforcement Learning and How To Tune Them’. arXiv:2306.01324. Preprint, arXiv, 2 June 2023. [2306.01324] Hyperparameters in Reinforcement Learning and How To Tune Them .

I’m sending this off to the support team to see what can be done.

We have a solution in the Kempner Institute handbook; feel free to submit an issue to our GitHub repository if you encounter any problems. github/KempnerInstitute/optimizing-ml-workflow/tree/main/workshop_exercises/wandb_aggregate