Early Terminate Failing with Exit Code 1

mohammadbakir · September 25, 2023, 5:55pm

Hi @gurugecl , I ran through some tests and was unable to reproduce a situation where wandb bayes sweeps with hyperband sweep produces a failed/crashed run when early stopping occurs, see project here. Do you have a reproducible working example I could test against?

Some high level info on how bayes treats runs

When a Bayesian sweep run executes to completion, we always use the “best” metric value of last finished run for the next run . The default behavior can be modified by the user using an impute strategy :

“worst”: use the worst value of the run’s target metric,
“best”: use the best value of the run’s target metric,
“latest”: use the latest value of the run’s target metric.

Now in the event a run Fails, Crashes, is Killed, wandb defaults to “worst” metric value unless the impute strategy was set by a user.

In terms of the difference between crashed vs failed runs

failed runs means a non-zero exit (control-c or exception or sys.exit(1) .

Scenario: W&B kills the agent, and does not run any further sweep experiments.
The sweep agent is designed to shutdown if (by default) there are 3 failures in the first 60 seconds, or the first 5 runs started are failures.

a crashed run : wandb backend never got a successful exit indication after 5-10 minutes of no heartbeat

Topic		Replies	Views
Early_terminate param for sweep configuration does not seem to work at all W&B Help sweeps	42	333	April 18, 2024
Sweeps: Waiting for W&B process to finish... (failed 1) W&B Help sweeps , projects , wandb	7	3410	May 31, 2023
Sweeps ending in just 1 epoch W&B Help sweeps , wandb	4	71	April 18, 2024
Early Stopping W&B Help wandb	7	2532	April 20, 2022
Sweep run not closing W&B Help sweeps	10	836	September 14, 2022

Early Terminate Failing with Exit Code 1

Related Topics