How does the bayesian method in sweeps treat crashed runs?

danielle-schuman · June 4, 2023, 10:02pm

Hello,
I have some questions about bayesian search and crashes: When running my wandb sweeps, sometimes some of my runs crash for “external” reasons (e.g. someone re-starts the server I’m running on, problems with accidentally detaching from shared memory, etc.). Since my runs take pretty long, I do usually not have the time to re-run the entire sweep, so I would like to continue it anyway. Since I am using bayesian search as the method, this leads me to question of how the search “deals” with the crashed runs when I continue the sweep, especially if I don’t delete them right away:

Does the search take a crashed run into account by interpreting its hyperparameter configuration as “very bad”, in the sense that a run with this configuration did not even finish, so it cannot be a good configuration, or does it just ignore the run? And in the second case, could this lead to the algorithm “choosing” the same hyperparameter configuration that crashed again later for another run?
For this behaviour, does it make a difference if I delete the run right away or leave it in the sweep while the agents start other new runs?
Is there a difference here for “crashed” vs. “failed” runs? And what is the difference between those in general?
Thank you in advance to anyone who has an answer for me!

mohammadbakir · June 7, 2023, 11:19pm

Hi @danielle-schuman ,

When a Bayesian sweep run executes to completion, we always use the “best” metric value of last finished run for the next run . The default behavior can be modified by the user using an impute strategy :

“worst”: use the worst value of the run’s target metric,
“best”: use the best value of the run’s target metric,
“latest”: use the latest value of the run’s target metric.

Now in the event a run Fails, Crashes, is Killed, wandb defaults to “worst” metric value unless the impute strategy was set by a user.

In terms of the difference between crashed vs failed runs

failed runs means a non-zero exit (control-c or exception or sys.exit(1) .

Scenario: W&B kills the agent, and does not run any further sweep experiments.
The sweep agent is designed to shutdown if (by default) there are 3 failures in the first 60 seconds, or the first 5 runs started are failures.

a crashed run : wandb backend never got a successful exit indication after 5-10 minutes of no heartbeat

danielle-schuman · June 8, 2023, 12:23pm

Hi @mohammadbakir,
Thank you for your answer, that was very helpful!

So the default is “worst” if I don’t actively set an impute strategy and don’t delete the crashed or failed run, and I suppose if I do delete the run, the sweep will just continue as if a run for those hyperparameters had never existed for thereon out, right?

mohammadbakir · June 8, 2023, 7:51pm

Hi @danielle-schuman , made an update to my initial response due to an error.

This is correct, the default is "worst"unless an impute strategy is specified by the user. Now if you pause a sweep, and delete one of its runs, we will not count that set of hyperparameters as being covered by the sweep.

danielle-schuman · June 9, 2023, 8:17am

Alright, thank you very much! Then everything is clear now as far as I’m concerned.

system · August 8, 2023, 8:17am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Runs in parrallel with bayesian optimisation W&B Help sweeps , wandb	2	311	November 18, 2022
Use the same parameter but produce different results in Bayesian Sweep W&B Help sweeps , wandb	9	1573	June 12, 2023
When resuming a sweep with Bayesian optimization, are the previous runs kept into consideration? W&B Help sweeps	2	408	July 24, 2023
Bayes contoller behavior while using wandb.define_metric() W&B Help sweeps	4	325	March 11, 2024
Bayesian sweep repeating the runs W&B Help sweeps , wandb	2	553	July 23, 2023

How does the bayesian method in sweeps treat crashed runs?

Related topics