Early_terminate param for sweep configuration does not seem to work at all

yanyiphei · March 13, 2024, 4:53pm

The early_terminate param for the sweep configuration does not work at all in my testing. I actually read the source code implementation (sweeps/src/sweeps/hyperband_stopping.py at 6e25b8bcf2adada723538e675c6de34f5faebb22 · wandb/sweeps · GitHub) since the documentation is atrocious, so I tried to test early_terminate for myself. Here is a report of the test using a dummy sweep: early_terminate bug (I wanted to share the sweep directly but I dont think theres a way (???) so I just pasted the sweep yaml config and all available charts).

Now, referencing both the report and the source code, you can see that the first bracket is 4. therefore, the threshold (see source code for what this is) should be 4.049. Now, because the history of visionary-sweep-2 has a min loss value of 4.504 within the first bracket, it should be early terminated at the first bracket per the implementation. However, it is not. In fact, it should be early terminated for every successive bracket but it continues to run until completion.

So either there is a bug or something not present in the code or documentation is causing bizarre behaviors (e.g. maybe the early terminate check is run async or on some (very slow) polling rate)

yanyiphei · March 13, 2024, 5:29pm

(this was cross-posted in [CLI]: `early_terminate` param for sweep configuration does not work at all · Issue #7156 · wandb/wandb · GitHub too)

yanyiphei · March 14, 2024, 8:17pm

I just did some debugging directly in the source code and found that all my sweep runs have empty history (i.e. run.history = []), not just for the target goal metric but for all metrics logged. So this explains why early termination is not working at all. Why is no history saved for my sweep runs? I setup my sweep configuration by simply following your tutorials.

yanyiphei · March 14, 2024, 9:30pm

after some more digging, I think I found the bug. The early termination function hyperband_stop_runs expects the metric history to be stored in the .history attribute of the run, but when the controller queries for the runs (which are eventually passed to the early termination function), it runs the following GQL query

            """
        query SweepWithRuns($entity: String, $project: String, $sweep: String!, $specs: [JSONString!]!) {
            project(name: $project, entityName: $entity) {
                sweep(sweepName: $sweep) {
                    id
                    name
                    method
                    state
                    description
                    config
                    createdAt
                    heartbeatAt
                    updatedAt
                    earlyStopJobRunning
                    bestLoss
                    controller
                    scheduler
                    runs {
                        edges {
                            node {
                                name
                                state
                                config
                                exitcode
                                heartbeatAt
                                shouldStop
                                failed
                                stopped
                                running
                                summaryMetrics
                                sampledHistory(specs: $specs)
                            }
                        }
                    }
                }
            }
        }
        """

so it doesnt even query for the history, only for sampledHistory, so history will always be empty

yanyiphei · March 14, 2024, 9:44pm

oh but I see that you are doing this sweeps/src/sweeps/run.py at 6e25b8bcf2adada723538e675c6de34f5faebb22 · wandb/sweeps · GitHub, thus loading history from sampled history. Idk whats going on anymore

yanyiphei · March 14, 2024, 10:14pm

ok I think I found the bug now. When you call _sweep_object_read_from_backend, you set self._sweep_metric after you run the query (wandb/wandb/wandb_controller.py at main · wandb/wandb · GitHub), but the code expects self._sweep_metric to be set before the query in order to create the specs for the sample history (wandb/wandb/wandb_controller.py at main · wandb/wandb · GitHub)

yanyiphei · March 14, 2024, 10:32pm

oh, and another issue. The current pip version of sweeps has the following incorrect implementation: h[band] treats the band values as 1-indexed (sweeps/src/sweeps/hyperband_stopping.py at 95d3ccad7ef6ce7c7af77b5bf7efd933b50b9336 · wandb/sweeps · GitHub), but the latest commit on main has the correct implementation: h[band-1] treats the brand values as 0-indexed (sweeps/src/sweeps/hyperband_stopping.py at master · wandb/sweeps · GitHub). this also explains why my specific example above didnt trigger early termination

yanyiphei · March 14, 2024, 10:39pm

I applied all these changes locally and everything works, including the specific example in the description. pls fix this asap because my ML workflow is heavily reliant on early termination working.

thanos-wandb · March 15, 2024, 9:50am

Hi @yanyiphei that’s amazing rubber duck debugging - thank you very much. Just to confirm, moving this line to 432 resolved the issue for you with early sweeps termination?

yanyiphei · March 15, 2024, 5:22pm

that wont work because that is still dependent on this line wandb/wandb/wandb_controller.py at main · wandb/wandb · GitHub . You need to find a way to load the sweep config differently.

yanyiphei · March 15, 2024, 5:23pm

I resolved my issue simply by hardcoding my sweep metric. I dont know whats the best generic solution to the problem.

thanos-wandb · March 18, 2024, 10:49am

Hi @yanyiphei thank you for the additional information. I have logged this bug with our Sweeps team, and we will keep you updated on its progress here.

yanyiphei · March 18, 2024, 6:48pm

@thanos-wandb can you confirm that its indeed a bug?

thanos-wandb · March 19, 2024, 1:07pm

@yanyiphei thank you for following-up on this. Can you please also try to set the method: bayes and strict: true? would it still the early_terminate algorithm won’t work for you?

yanyiphei · March 19, 2024, 5:15pm

But i dont want to use the bayes method. I want to use the random and grid methods.

thanos-wandb · March 20, 2024, 4:00pm

Hi @yanyiphei I am investigating this further and checking with our engineers in Sweeps team. I will keep you updated.

yanyiphei · March 26, 2024, 7:36pm

@thanos-wandb updates?

thanos-wandb · March 26, 2024, 11:05pm

Hi @yanyiphei thank you for following-up on this. We have identified a bug and our engineers have now deployed a fix. Can you please try again to run your code again and let us know if early_terminate algorithm now works for you?

yanyiphei · March 27, 2024, 9:26pm

Ok, it works now. But it doesnt mark early terminated runs as “Stopped”. See screenshot below of an early terminate run

yanyiphei · March 27, 2024, 10:47pm

@thanos-wandb actually, early terminate doesnt work at all for a different sweep configuration that I tried. In this config, all the runs had a runtime of < 1 min. Could this be a problem? Does early terminate run asynchronously, perhaps with some latency? Nothing is documented so it’s hard to understand whats a bug and whats not

Topic		Replies	Views
Early Terminate Failing with Exit Code 1 W&B Help sweeps	8	1368	December 30, 2023
Setting up YAML file for Sweeps W&B Help sweeps , beginner-friendly	14	588	November 5, 2024
Sweeps ending in just 1 epoch W&B Help sweeps , wandb	4	163	April 18, 2024
Sweep run not closing W&B Help sweeps	10	1156	September 14, 2022
How to early stop bad runs in sweeps to save time W&B Help sweeps , wandb	5	3656	August 8, 2022

Early_terminate param for sweep configuration does not seem to work at all

Related topics