Hi @gurugecl , I ran through some tests and was unable to reproduce a situation where wandb bayes sweeps with hyperband sweep produces a failed/crashed run when early stopping occurs, see project here. Do you have a reproducible working example I could test against?
Some high level info on how bayes treats runs
When a Bayesian sweep run executes to completion, we always use the “best” metric value of last finished run for the next run . The default behavior can be modified by the user using an impute strategy :
- “worst”: use the worst value of the run’s target metric,
- “best”: use the best value of the run’s target metric,
- “latest”: use the latest value of the run’s target metric.
Now in the event a run Fails, Crashes, is Killed, wandb defaults to “worst” metric value unless the impute strategy was set by a user.
In terms of the difference between crashed vs failed runs
failed runs means a non-zero exit (control-c or exception or sys.exit(1)
.
- Scenario: W&B kills the agent, and does not run any further sweep experiments.
- The sweep agent is designed to shutdown if (by default) there are 3 failures in the first 60 seconds, or the first 5 runs started are failures.
a crashed run : wandb backend never got a successful exit indication after 5-10 minutes of no heartbeat