Hello,
I have some questions about bayesian search and crashes: When running my wandb sweeps, sometimes some of my runs crash for “external” reasons (e.g. someone re-starts the server I’m running on, problems with accidentally detaching from shared memory, etc.). Since my runs take pretty long, I do usually not have the time to re-run the entire sweep, so I would like to continue it anyway. Since I am using bayesian search as the method, this leads me to question of how the search “deals” with the crashed runs when I continue the sweep, especially if I don’t delete them right away:
Does the search take a crashed run into account by interpreting its hyperparameter configuration as “very bad”, in the sense that a run with this configuration did not even finish, so it cannot be a good configuration, or does it just ignore the run? And in the second case, could this lead to the algorithm “choosing” the same hyperparameter configuration that crashed again later for another run?
For this behaviour, does it make a difference if I delete the run right away or leave it in the sweep while the agents start other new runs?
Is there a difference here for “crashed” vs. “failed” runs? And what is the difference between those in general?
Thank you in advance to anyone who has an answer for me!
When a Bayesian sweep run executes to completion, we always use the “best” metric value of last finished run for the next run . The default behavior can be modified by the user using an impute strategy :
“worst”: use the worst value of the run’s target metric,
“best”: use the best value of the run’s target metric,
“latest”: use the latest value of the run’s target metric.
Now in the event a run Fails, Crashes, is Killed, wandb defaults to “worst” metric value unless the impute strategy was set by a user.
In terms of the difference between crashed vs failed runs
failed runs means a non-zero exit (control-c or exception or sys.exit(1) .
Scenario: W&B kills the agent, and does not run any further sweep experiments.
The sweep agent is designed to shutdown if (by default) there are 3 failures in the first 60 seconds, or the first 5 runs started are failures.
a crashed run : wandb backend never got a successful exit indication after 5-10 minutes of no heartbeat
Hi @mohammadbakir,
Thank you for your answer, that was very helpful!
So the default is “worst” if I don’t actively set an impute strategy and don’t delete the crashed or failed run, and I suppose if I do delete the run, the sweep will just continue as if a run for those hyperparameters had never existed for thereon out, right?
Hi @danielle-schuman , made an update to my initial response due to an error.
This is correct, the default is "worst"unless an impute strategy is specified by the user. Now if you pause a sweep, and delete one of its runs, we will not count that set of hyperparameters as being covered by the sweep.