Early Terminate Failing with Exit Code 1

Hi @gurugecl , I ran through some tests and was unable to reproduce a situation where wandb bayes sweeps with hyperband sweep produces a failed/crashed run when early stopping occurs, see project here. Do you have a reproducible working example I could test against?

Some high level info on how bayes treats runs

When a Bayesian sweep run executes to completion, we always use the “best” metric value of last finished run for the next run . The default behavior can be modified by the user using an impute strategy :

  • “worst”: use the worst value of the run’s target metric,
  • “best”: use the best value of the run’s target metric,
  • “latest”: use the latest value of the run’s target metric.

Now in the event a run Fails, Crashes, is Killed, wandb defaults to “worst” metric value unless the impute strategy was set by a user.

In terms of the difference between crashed vs failed runs

failed runs means a non-zero exit (control-c or exception or sys.exit(1) .

  • Scenario: W&B kills the agent, and does not run any further sweep experiments.
  • The sweep agent is designed to shutdown if (by default) there are 3 failures in the first 60 seconds, or the first 5 runs started are failures.

a crashed run : wandb backend never got a successful exit indication after 5-10 minutes of no heartbeat