Hi, I have setup a sweep with hyperband earl termination and for some reason whenever the early termination is triggered the run state is recorded as Failed instead of Finished which I believe means it won’t be recorded as part of the bayesian optimization. Below are the error logs and relevant parts of my code which includes some custom early termination logic based on a threshold and patience variable which works correctly. Only the hyperband early termination fails at epoch 16. I execute the run with python script_name.py
wandb, version 0.15.10
Any assistance would be greatly appreciated and please let me know if any other information would be helpful. Thank you!
import wandb
wandb.login()
sweep_configuration = {
'method': 'bayes',
'name': 'sweep',
'metric': {'goal': 'maximize', 'name': 'Best Val ROC'},
'early_terminate': {'type': 'hyperband', 'min_iter':17},
'parameters': {
'lr': {'distribution': 'log_uniform_values', 'min': 0.0001, 'max': 0.1},
'weight_decay': {'distribution': 'log_uniform_values', 'min': 1e-5, 'max': 1e-1},
'dropout': {'max': 0.7, 'min': 0.3},
'heads': {'max': 8, 'min': 4},
'num_conv': {'max': 3, 'min': 2},
'num_lin': {'max': 2, 'min': 1},
'num_neighbor_l1': {'max': 15, 'min': 10},
'num_neighbor_l2': {'max': 5, 'min': 3},
'hidden_channels': {'values': [32, 64, 128]},
'output_channels': {'values': [16, 32, 64]},
'decode_channels': {'values': [8, 16]},
'aggr': {'values': ['mean', 'sum', 'mul']},
'lmda': {'max': 0.99, 'min': 0.50},
}
}
sweep_id = wandb.sweep(sweep=sweep_configuration, project='project-name')
def main():
wandb.init(name='V3 local',
project="project-name",
notes='Some notes',
entity="xxxxxx")
epochs = 140
lr = wandb.config.lr
batch_size = 32
dropout = wandb.config.dropout
num_lin = wandb.config.num_lin
num_conv = wandb.config.num_conv
heads = wandb.config.heads
decode_channels = wandb.config.decode_channels
hidden_channels = wandb.config.hidden_channels
output_channels = wandb.config.output_channels
aggr = wandb.config.aggr
nn_l1 = wandb.config.num_neighbor_l1
nn_l2 = wandb.config.num_neighbor_l2
lmda = wandb.config.lmda
weight_decay = wandb.config.weight_decay
best_val_roc = 0
patience = 15
counter = 0
for epoch in np.arange(0, epochs):
train_loss, train_roc = train(train_loader, model, optimizer, criterion, batch_size, device, epoch, lmda, batch_run=False)
val_loss, val_roc = test(val_loader, model, criterion, batch_size, device, epoch, lmda, batch_run=False)
if val_roc > best_val_roc:
best_val_roc = val_roc
counter = 0
else:
counter += 1
wandb.log({"Epoch": epoch,
"Train Loss":train_loss,
"Train ROC": train_roc,
"Val Loss": val_loss,
"Val ROC": val_roc,
"Best Val ROC": best_val_roc})
if (counter >= patience) or (epoch == 10 and best_val_roc <= 0.55):
print("Early stopping triggered.")
wandb.log({"early_stopping": True}) # Log early stopping event
wandb.finish() # End the run here
return # Exiting the function
if __name__ == '__main__':
wandb.agent(sweep_id, project="project-name", function=main, count=100)
Log Errors:
2023-09-16 21:29:38,694 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_on_init():2229] got version response
2023-09-16 21:29:38,694 INFO Thread-10 (_run_job):1602087 [wandb_init.py:init():799] starting run threads in backend
2023-09-16 21:29:43,236 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_console_start():2199] atexit reg
2023-09-16 21:29:43,236 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_redirect():2054] redirect: wrap_raw
2023-09-16 21:29:43,237 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_redirect():2119] Wrapping output streams.
2023-09-16 21:29:43,237 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_redirect():2144] Redirects installed.
2023-09-16 21:29:43,237 INFO Thread-10 (_run_job):1602087 [wandb_init.py:init():840] run started, returning control to user process
2023-09-17 01:50:03,405 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_finish():1934] finishing run xxxxx/xxxxxxx/fsdfs50r
2023-09-17 01:50:03,405 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_atexit_cleanup():2168] got exitcode: 1
2023-09-17 01:50:03,405 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_restore():2151] restore
2023-09-17 01:50:03,405 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_restore():2157] restore done
2023-09-17 01:50:07,085 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_footer_history_summary_info():3557] rendering history
2023-09-17 01:50:07,086 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_footer_history_summary_info():3589] rendering summary
2023-09-17 01:50:07,094 INFO Thread-10 (_run_job):1602087 [wandb_run.py:_footer_sync_info():3516] logging synced files
Slightly different log error occurred with the initial failed run. But the rest just have the above log message. I also hadn’t updated wandb initially:
2023-09-14 11:09:35,171 INFO Thread-3 (_run_job):1461321 [wandb_run.py:_redirect():1959] redirect: SettingsConsole.WRAP_RAW
2023-09-14 11:09:35,171 INFO Thread-3 (_run_job):1461321 [wandb_run.py:_redirect():2024] Wrapping output streams.
2023-09-14 11:09:35,171 INFO Thread-3 (_run_job):1461321 [wandb_run.py:_redirect():2046] Redirects installed.
2023-09-14 11:09:35,171 INFO Thread-3 (_run_job):1461321 [wandb_init.py:init():798] run started, returning control to user process
2023-09-14 14:06:20,706 ERROR Thread-2 (_heartbeat):1461321 [internal_api.py:execute():244] 502 response executing GraphQL.
2023-09-14 14:06:20,706 ERROR Thread-2 (_heartbeat):1461321 [internal_api.py:execute():245]
<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>
2023-09-14 14:10:45,331 ERROR Thread-2 (_heartbeat):1461321 [internal_api.py:execute():244] 502 response executing GraphQL.
2023-09-14 14:10:45,331 ERROR Thread-2 (_heartbeat):1461321 [internal_api.py:execute():245]
<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>
2023-09-14 15:14:04,762 INFO Thread-3 (_run_job):1461321 [wandb_run.py:_finish():1844] finishing run xxxxx/xxxxxxxx/df4532k4
2023-09-14 15:14:04,762 INFO Thread-3 (_run_job):1461321 [wandb_run.py:_atexit_cleanup():2070] got exitcode: 1
2023-09-14 15:14:04,762 INFO Thread-3 (_run_job):1461321 [wandb_run.py:_restore():2053] restore
2023-09-14 15:14:04,763 INFO Thread-3 (_run_job):1461321 [wandb_run.py:_restore():2059] restore done
2023-09-14 15:14:35,981 INFO Thread-3 (_run_job):1461321 [wandb_run.py:_footer_history_summary_info():3427] rendering history
2023-09-14 15:14:35,982 INFO Thread-3 (_run_job):1461321 [wandb_run.py:_footer_history_summary_info():3459] rendering summary
2023-09-14 15:14:35,985 INFO Thread-3 (_run_job):1461321 [wandb_run.py:_footer_sync_info():3383] logging synced files