The early_terminate
param for the sweep configuration does not work at all in my testing. I actually read the source code implementation (sweeps/src/sweeps/hyperband_stopping.py at 6e25b8bcf2adada723538e675c6de34f5faebb22 · wandb/sweeps · GitHub) since the documentation is atrocious, so I tried to test early_terminate
for myself. Here is a report of the test using a dummy sweep: early_terminate bug (I wanted to share the sweep directly but I dont think theres a way (???) so I just pasted the sweep yaml config and all available charts).
Now, referencing both the report and the source code, you can see that the first bracket is 4
. therefore, the threshold
(see source code for what this is) should be 4.049
. Now, because the history of visionary-sweep-2
has a min loss value of 4.504
within the first bracket, it should be early terminated at the first bracket per the implementation. However, it is not. In fact, it should be early terminated for every successive bracket but it continues to run until completion.
So either there is a bug or something not present in the code or documentation is causing bizarre behaviors (e.g. maybe the early terminate check is run async or on some (very slow) polling rate)
I just did some debugging directly in the source code and found that all my sweep runs have empty history (i.e. run.history = []
), not just for the target goal metric but for all metrics logged. So this explains why early termination is not working at all. Why is no history saved for my sweep runs? I setup my sweep configuration by simply following your tutorials.
after some more digging, I think I found the bug. The early termination function hyperband_stop_runs
expects the metric history to be stored in the .history
attribute of the run, but when the controller queries for the runs (which are eventually passed to the early termination function), it runs the following GQL query
"""
query SweepWithRuns($entity: String, $project: String, $sweep: String!, $specs: [JSONString!]!) {
project(name: $project, entityName: $entity) {
sweep(sweepName: $sweep) {
id
name
method
state
description
config
createdAt
heartbeatAt
updatedAt
earlyStopJobRunning
bestLoss
controller
scheduler
runs {
edges {
node {
name
state
config
exitcode
heartbeatAt
shouldStop
failed
stopped
running
summaryMetrics
sampledHistory(specs: $specs)
}
}
}
}
}
}
"""
so it doesnt even query for the history
, only for sampledHistory
, so history
will always be empty
oh but I see that you are doing this sweeps/src/sweeps/run.py at 6e25b8bcf2adada723538e675c6de34f5faebb22 · wandb/sweeps · GitHub, thus loading history from sampled history. Idk whats going on anymore
ok I think I found the bug now. When you call _sweep_object_read_from_backend
, you set self._sweep_metric
after you run the query (wandb/wandb/wandb_controller.py at main · wandb/wandb · GitHub), but the code expects self._sweep_metric
to be set before the query in order to create the specs for the sample history (wandb/wandb/wandb_controller.py at main · wandb/wandb · GitHub)
oh, and another issue. The current pip version of sweeps
has the following incorrect implementation: h[band]
treats the band values as 1-indexed (sweeps/src/sweeps/hyperband_stopping.py at 95d3ccad7ef6ce7c7af77b5bf7efd933b50b9336 · wandb/sweeps · GitHub), but the latest commit on main has the correct implementation: h[band-1]
treats the brand values as 0-indexed (sweeps/src/sweeps/hyperband_stopping.py at master · wandb/sweeps · GitHub). this also explains why my specific example above didnt trigger early termination
I applied all these changes locally and everything works, including the specific example in the description. pls fix this asap because my ML workflow is heavily reliant on early termination working.
Hi @yanyiphei that’s amazing rubber duck debugging - thank you very much. Just to confirm, moving this line to 432 resolved the issue for you with early sweeps termination?
that wont work because that is still dependent on this line wandb/wandb/wandb_controller.py at main · wandb/wandb · GitHub . You need to find a way to load the sweep config differently.
I resolved my issue simply by hardcoding my sweep metric. I dont know whats the best generic solution to the problem.
Hi @yanyiphei thank you for the additional information. I have logged this bug with our Sweeps team, and we will keep you updated on its progress here.
@thanos-wandb can you confirm that its indeed a bug?
@yanyiphei thank you for following-up on this. Can you please also try to set the method: bayes
and strict: true
? would it still the early_terminate
algorithm won’t work for you?
But i dont want to use the bayes
method. I want to use the random
and grid
methods.
Hi @yanyiphei I am investigating this further and checking with our engineers in Sweeps team. I will keep you updated.
Hi @yanyiphei thank you for following-up on this. We have identified a bug and our engineers have now deployed a fix. Can you please try again to run your code again and let us know if early_terminate algorithm now works for you?
Ok, it works now. But it doesnt mark early terminated runs as “Stopped”. See screenshot below of an early terminate run
@thanos-wandb actually, early terminate doesnt work at all for a different sweep configuration that I tried. In this config, all the runs had a runtime of < 1 min. Could this be a problem? Does early terminate run asynchronously, perhaps with some latency? Nothing is documented so it’s hard to understand whats a bug and whats not