Early_terminate param for sweep configuration does not seem to work at all

@thanos-wandb any updates?

Hi @yanyiphei this bug should have been fixed now, and the early terminate algorithm must be working. Can you please try to add a call like time.sleep() in your code and see if that fixes it for you?

Hi @yanyiphei the cron task that checks for any runs to be early terminated runs every minute. Therefore, that should explain the reason it fails in your case. I hope this helps!

@thanos-wandb please add the 1-minute cron task to the documentation.

But also, i’ve tested it on > 1 min runs and it still didnt work.

@yanyiphei can you please try to activate also the strict mode as the example below?

early_terminate:
  type: hyperband
  min_iter: x
  eta: y
  strict: true

Will it still fail for you to terminate any runs? can you please share your sweep config, and a link to your project?

@thanos-wandb i already tried both with strict False and True. How do you want me to share the project or sweep config?

Could you copy paste here the Sweep config dictionary or the yaml file you’re using? You could share a link/URL to your generated sweep so that we can investigate further. If you’re not comfortable sharing this in a public forum, please email us at support@wandb.com and refer to ticket #62628, and we can take it from there.

@thanos-wandb here’s the sweep link: Weights & Biases. You can find the config and all the runs (that should have been early terminated but werent)

Hi @yanyiphei thank you for sending the link. Would it be please possible to change the method to bayes and see if that has any difference? I want to rule out that being related to using random method. Regarding the documentation feedback you provided earlier, we have added an info box here: Sweep configuration options | Weights & Biases Documentation

Hyperband checks which W&B runs to end once every few minutes. The end run timestamp might differ from the specified brackets if your run or iteration are short.

@thanos-wandb here’s another one with method=grid Weights & Biases. I wont do bayes because it useless for me

Hi @yanyiphei that’s fine, the early terminate algorithm should work with the other search methods too. The reaper cronjob only runs every minute. With a min iter of 350 , and for your runs that have only 500 steps (lasting about 2 mins), we only have ~35 seconds of the run that is viable for killing. Since your runs are short lived, it might be that early terminate won’t provide much benefit to your use case. The algorithm is mostly beneficial for long running experiments.

To enhance the chances to get the runs early terminated, can you please change your Sweep config as follows to increase the bands:

early_terminate:
  eta: 1.5
  min_iter: 150
  strict: true
  type: hyperband

@thanos-wandb i also have much longer runs (1 hr long) and early terminate still didnt work. Those runs are more sensitive, so I cant share them publicly here.

Also, I dont have the resources to debug this for you. I’ve given you enough context that you can debug this on your end. Let me know if you need anything else.

Hi @yanyiphei would it be possible to send a link to those longer runs via email? so that you won’t share any information publicly, and help us troubleshooting further. We have tested Sweeps, after fixing the bug that you had reported here, and the early terminate algorithm has been fixed now. Please see screenshot of a Sweep that was ran after the fix.

Therefore, the issue you’re now seeing should relate either to the short duration of the runs, or the min_iter/eta configurations, since you have specified only one band in your current setup.

@thanos-wandb ok, i can share those via email. Does the project have to be public to share it?

@yanyiphei sounds good, feel free to share a link to the project at support@wandb.com and refer to ticket #62628. The project does not have to be public, so no need to change its privacy settings.

@thanos-wandb just shared via email! Also I appreciate the continued support!

Hi @yanyiphei thanks for sharing the project that you were referring to. I have looked into it and it seems you are setting the min_iter to very high value, allowing maximum one band for checking for early terminate. We would recommend to have at least 2 bands. Also, it seems you’re running in the same project many individual sweeps with a single run in them. The early terminate algorithm can’t check the conditions across different sweep ids.

@thanos-wandb

I have looked into it and it seems you are setting the min_iter to very high value, allowing maximum one band for checking for early terminate. We would recommend to have at least 2 bands

Why? The runs I gave you were very long, so there was definitely more than 1 minute between the min_iter and the run termination. They had around 12 min between min_iter and run termination, so that is plenty, according to the information you provided.

Also, it seems you’re running in the same project many individual sweeps with a single run in them. The early terminate algorithm can’t check the conditions across different sweep ids.

Thats untrue. I gave you a sweep with 72 runs in them. Every sweep in that project has more than 1 run. Please check carefully.

Hi @yanyiphei thank you for the clarification. The Sweep with the 72 runs, has been set to min_iter=2400 and all the runs terminate at 3000 steps. Would it be possible to run the same sweep again with the following changes to your sweep config?

early_terminate:
  eta: 1.5
  min_iter: 1200
  strict: True
  type: hyperband

@thanos-wandb sorry, but currently I dont have the resources to run these big sweeps to debug for you. I’ve provided plenty of evidence that early_terminate does not work (perhaps it works sometimes under very narrow and undocumented constraints, but clearly it doesnt work for all possible sweep configs). If you need further evidence, im happy to provide. But I cant personally debug this for you.