Why is Hyperband not working on this example project?

Hello,
I’m just starting to experiment with WandB and I’m the most interested in its hyperparameter optimization features.
I’m trying their HyperBand algorithm and testing it out with this example repo

I reduce the number of epochs to 3 so each full run is approximately 1 minute

I’ve tried the configuration given in the repo with the hyperband parameters being

early_terminate:
  type: hyperband
  s: 2
  eta: 2
  max_iter: 8

Following the documentation, i’m expecting the algorithm to check if the run should be stopped at steps [8/2/2, 8/2] = [2, 4]

However every runs are continuing until the last epochs. I noticed 2 exceptions.

  • Keyboard interrupts : 2 of the runs were killed because of a keyboard interrupt that I definitely didn’t send ? Is it how Hyperband stops the run ?
  • One of the run was stopped i.e the log says
2024-04-15 22:42:02,644 - wandb.wandb_agent - INFO - Agent received command: stop
2024-04-15 22:42:02,645 - wandb.wandb_agent - INFO - Stop: csxr2e0f
2024-04-15 22:42:07,651 - wandb.wandb_agent - INFO - Cleaning up finished run: csxr2e0f
2024-04-15 22:42:07,847 - wandb.wandb_agent - INFO - Agent received command: run
...etc

It only happened once, is this how hyperband stops the runs ? Why is the run still shozing as running in the WandB UI, is that a bug ?

So to sum up my questions are:

  • How is Hyperband supposed to behave in this example use case ?
  • How are runs stopped by Hyperband supposed to show up in the logs, what are the expected logs I should read ?
  • What does my hyperparameters means in my context ? i.e at which step/epoch should the runs be stopped and how many runs ? Bcs the results I see do not correlate with what the docs or paper say…

I tried to share my dashboard but I couldn’t find any way to do so, please let me know how to share my project, fyi my account is a student one so i’m on the free usage plan

Thanks a lot

Ok my understanding is that I was not running enough runs to see the effect of Hyperband. After trying a longer experiments and more runs (in parallel too) I could start to see the effects.
I have one question remaining tho:

  • What are the states of the runs that are stopped by Hyperband ? Are they killed/crashed/failed ? I see so many of my runs in these different states at the bucket steps so I’m a bit confused.

Hi @aa9380 , a run in W&B (Weights & Biases) can take on several different states, which indicate the status of the run at any given time. The states are:

  • finished: The script has ended and fully synced data, or wandb.finish() was called.
  • failed: The script ended with a non-zero exit status, indicating that an error occurred.
  • crashed: The script stopped sending heartbeats to the W&B service, which can happen if the machine crashes or the process is otherwise unexpectedly terminated.
  • running: The script is still running and has recently sent a heartbeat, indicating that it is active.

References

Run Page | Weights & Biases Documentation
Frequently Asked Questions About Experiments
Runs | Weights & Biases Documentation

Thank you for the clarification (you didn’t mention the Killed state tho ?) but my question is which one of these states are expected when Hyperband is early terminating my runs ?
I’m running my own experiments with 60+ runs, maximum epoch is 40 and my hyperband parameters are :

early_terminate:
  eta: 2
  max_iter: 32
  s: 4
  type: hyperband

This should mean my buckets are at epochs [1, 3, 7, 15] (counting from index 0)
Out of all my runs :

  • 36 runs to completion (epoch 40)
  • 12 Crashed (epoch 10, 8, 7, 5, 3, and 7 of them at epoch 1) and agent received command stop (from WandB logs)
  • 2 Failed (epoch 3, 2) and agent received command stop (from WandB logs)
  • 4 Killed (4 of them at epoch 1) and a KeyboardInterrupt was sent (definitely not by me)

All the terminated runs were stopped/crashed/whatever at epochs quite close to the buckets steps so I’m guessing this was due to hyperband and the difference might be because of network latency (my epochs can be 1 min or 5min long depending on the hyperparameters chosen).

However it’d be nice to have more readable logs for when a run is ended by Hyperband. For example a specific “Terminated” state, because it’s a bit confusing having all those different states at the moment. I’ll make a PR one day.

To improve the readability of logs and make it clearer when a run is terminated by Hyperband, you could consider adding custom logging or tracking within your experiment code. For example, you could log a specific message when a run is terminated early by Hyperband and differentiate it from other termination reasons.

There is no explicit state defined for when the hyperband early terminates as of today. Stopped state does exists already in our DB and used to be visible in the UI. It should be restored, and possibly visible by default if the user is using hyperband. This will definitely help enhance the user experience and make it easier to interpret experiment results in the future.

To further distinguish between the crashed, failed and killed states:

  • Killed state is seen quite less frequently than other run states - this is when the run is stopped from the UI itself or hyperband as in your case when runs are stopped by hyperband.
  • When a process reaches a wandb.finish() call, we send a finalize signal to the server which indicates the run is over, and sets the status as Finished (zero exit signal) or Failed (non-zero exit signal).
  • When this finalize signal is not received and the heartbeat signal does not reach the server after a while, we assume that the process has stopped responding, and we mark the run as Crashed. In a nutshell, we never got a successful exit indication after 5-10 minutes of no heartbeat.
  • By default the heartbeat is set as 30 sec.
  • As for why the run is marked as failed. There are two ways that happens.
    • Either the run was finished with: wandb.run.finish(exit_code=1), or
    • more likely… the script itself exited with a non-zero exit status sys.exit(1)
    • If you are able to share your sweep program, i can take a look.