Early_terminate param for sweep configuration does not seem to work at all

The early_terminate param for the sweep configuration does not work at all in my testing. I actually read the source code implementation (sweeps/src/sweeps/hyperband_stopping.py at 6e25b8bcf2adada723538e675c6de34f5faebb22 · wandb/sweeps · GitHub) since the documentation is atrocious, so I tried to test early_terminate for myself. Here is a report of the test using a dummy sweep: early_terminate bug (I wanted to share the sweep directly but I dont think theres a way (???) so I just pasted the sweep yaml config and all available charts).

Now, referencing both the report and the source code, you can see that the first bracket is 4. therefore, the threshold (see source code for what this is) should be 4.049. Now, because the history of visionary-sweep-2 has a min loss value of 4.504 within the first bracket, it should be early terminated at the first bracket per the implementation. However, it is not. In fact, it should be early terminated for every successive bracket but it continues to run until completion.

So either there is a bug or something not present in the code or documentation is causing bizarre behaviors (e.g. maybe the early terminate check is run async or on some (very slow) polling rate)

(this was cross-posted in [CLI]: `early_terminate` param for sweep configuration does not work at all · Issue #7156 · wandb/wandb · GitHub too)

I just did some debugging directly in the source code and found that all my sweep runs have empty history (i.e. run.history = []), not just for the target goal metric but for all metrics logged. So this explains why early termination is not working at all. Why is no history saved for my sweep runs? I setup my sweep configuration by simply following your tutorials.

after some more digging, I think I found the bug. The early termination function hyperband_stop_runs expects the metric history to be stored in the .history attribute of the run, but when the controller queries for the runs (which are eventually passed to the early termination function), it runs the following GQL query

            """
        query SweepWithRuns($entity: String, $project: String, $sweep: String!, $specs: [JSONString!]!) {
            project(name: $project, entityName: $entity) {
                sweep(sweepName: $sweep) {
                    id
                    name
                    method
                    state
                    description
                    config
                    createdAt
                    heartbeatAt
                    updatedAt
                    earlyStopJobRunning
                    bestLoss
                    controller
                    scheduler
                    runs {
                        edges {
                            node {
                                name
                                state
                                config
                                exitcode
                                heartbeatAt
                                shouldStop
                                failed
                                stopped
                                running
                                summaryMetrics
                                sampledHistory(specs: $specs)
                            }
                        }
                    }
                }
            }
        }
        """

so it doesnt even query for the history, only for sampledHistory, so history will always be empty

oh but I see that you are doing this sweeps/src/sweeps/run.py at 6e25b8bcf2adada723538e675c6de34f5faebb22 · wandb/sweeps · GitHub, thus loading history from sampled history. Idk whats going on anymore

ok I think I found the bug now. When you call _sweep_object_read_from_backend, you set self._sweep_metric after you run the query (wandb/wandb/wandb_controller.py at main · wandb/wandb · GitHub), but the code expects self._sweep_metric to be set before the query in order to create the specs for the sample history (wandb/wandb/wandb_controller.py at main · wandb/wandb · GitHub)

oh, and another issue. The current pip version of sweeps has the following incorrect implementation: h[band] treats the band values as 1-indexed (sweeps/src/sweeps/hyperband_stopping.py at 95d3ccad7ef6ce7c7af77b5bf7efd933b50b9336 · wandb/sweeps · GitHub), but the latest commit on main has the correct implementation: h[band-1] treats the brand values as 0-indexed (sweeps/src/sweeps/hyperband_stopping.py at master · wandb/sweeps · GitHub). this also explains why my specific example above didnt trigger early termination

I applied all these changes locally and everything works, including the specific example in the description. pls fix this asap because my ML workflow is heavily reliant on early termination working.

Hi @yanyiphei that’s amazing rubber duck debugging - thank you very much. Just to confirm, moving this line to 432 resolved the issue for you with early sweeps termination?

that wont work because that is still dependent on this line wandb/wandb/wandb_controller.py at main · wandb/wandb · GitHub . You need to find a way to load the sweep config differently.

I resolved my issue simply by hardcoding my sweep metric. I dont know whats the best generic solution to the problem.

Hi @yanyiphei thank you for the additional information. I have logged this bug with our Sweeps team, and we will keep you updated on its progress here.

@thanos-wandb can you confirm that its indeed a bug?

@yanyiphei thank you for following-up on this. Can you please also try to set the method: bayes and strict: true? would it still the early_terminate algorithm won’t work for you?

But i dont want to use the bayes method. I want to use the random and grid methods.

Hi @yanyiphei I am investigating this further and checking with our engineers in Sweeps team. I will keep you updated.

@thanos-wandb updates?

Hi @yanyiphei thank you for following-up on this. We have identified a bug and our engineers have now deployed a fix. Can you please try again to run your code again and let us know if early_terminate algorithm now works for you?

Ok, it works now. But it doesnt mark early terminated runs as “Stopped”. See screenshot below of an early terminate run

@thanos-wandb actually, early terminate doesnt work at all for a different sweep configuration that I tried. In this config, all the runs had a runtime of < 1 min. Could this be a problem? Does early terminate run asynchronously, perhaps with some latency? Nothing is documented so it’s hard to understand whats a bug and whats not