Early Terminate Failing with Exit Code 1

Hi, I have setup a sweep with hyperband earl termination and for some reason whenever the early termination is triggered the run state is recorded as Failed instead of Finished which I believe means it won’t be recorded as part of the bayesian optimization. Below are the error logs and relevant parts of my code which includes some custom early termination logic based on a threshold and patience variable which works correctly. Only the hyperband early termination fails at epoch 16. I execute the run with python script_name.py

wandb, version 0.15.10

Any assistance would be greatly appreciated and please let me know if any other information would be helpful. Thank you!

import wandb

wandb.login()

sweep_configuration = {
    'method': 'bayes', 
    'name': 'sweep',
    'metric': {'goal': 'maximize', 'name': 'Best Val ROC'},
    'early_terminate': {'type': 'hyperband', 'min_iter':17}, 
    'parameters': {
        'lr': {'distribution': 'log_uniform_values', 'min': 0.0001, 'max': 0.1},
        'weight_decay': {'distribution': 'log_uniform_values', 'min': 1e-5, 'max': 1e-1},
        'dropout': {'max': 0.7, 'min': 0.3},
        'heads': {'max': 8, 'min': 4},
        'num_conv': {'max': 3, 'min': 2},
        'num_lin': {'max': 2, 'min': 1},
        'num_neighbor_l1': {'max': 15, 'min': 10},
        'num_neighbor_l2': {'max': 5, 'min': 3},
        'hidden_channels': {'values': [32, 64, 128]},
        'output_channels': {'values': [16, 32, 64]},
        'decode_channels': {'values': [8, 16]},
        'aggr': {'values': ['mean', 'sum', 'mul']}, 
        'lmda': {'max': 0.99, 'min': 0.50},
     }
}

sweep_id = wandb.sweep(sweep=sweep_configuration, project='project-name')

def main():

    wandb.init(name='V3 local', 
           project="project-name",
           notes='Some notes',
           entity="xxxxxx")
      
   
    epochs = 140
    lr  =  wandb.config.lr
    batch_size = 32
    dropout = wandb.config.dropout
    num_lin = wandb.config.num_lin
    num_conv = wandb.config.num_conv
    heads = wandb.config.heads
    decode_channels = wandb.config.decode_channels
    hidden_channels = wandb.config.hidden_channels
    output_channels = wandb.config.output_channels
    aggr = wandb.config.aggr
    nn_l1 = wandb.config.num_neighbor_l1
    nn_l2 = wandb.config.num_neighbor_l2
    lmda = wandb.config.lmda
    weight_decay = wandb.config.weight_decay

 best_val_roc = 0
 patience = 15 
 counter = 0

    for epoch in np.arange(0, epochs):
          
        train_loss, train_roc = train(train_loader, model, optimizer, criterion, batch_size, device, epoch, lmda, batch_run=False) 
        val_loss, val_roc = test(val_loader, model, criterion, batch_size, device, epoch, lmda, batch_run=False) 

        if val_roc > best_val_roc:
            best_val_roc = val_roc
            counter = 0
        else:
            counter += 1

        wandb.log({"Epoch": epoch,
        "Train Loss":train_loss,        
        "Train ROC": train_roc,        
        "Val Loss": val_loss,        
        "Val ROC": val_roc,
        "Best Val ROC": best_val_roc})

         if (counter >= patience) or (epoch == 10 and best_val_roc <= 0.55):
            print("Early stopping triggered.")
            wandb.log({"early_stopping": True})  # Log early stopping event
            wandb.finish()  # End the run here
            return  # Exiting the function

if __name__ == '__main__':

    wandb.agent(sweep_id, project="project-name", function=main, count=100)

Log Errors:

2023-09-16 21:29:38,694 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_on_init():2229] got version response 
2023-09-16 21:29:38,694 INFO    Thread-10 (_run_job):1602087 [wandb_init.py:init():799] starting run threads in backend
2023-09-16 21:29:43,236 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_console_start():2199] atexit reg
2023-09-16 21:29:43,236 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_redirect():2054] redirect: wrap_raw
2023-09-16 21:29:43,237 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_redirect():2119] Wrapping output streams.
2023-09-16 21:29:43,237 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_redirect():2144] Redirects installed.
2023-09-16 21:29:43,237 INFO    Thread-10 (_run_job):1602087 [wandb_init.py:init():840] run started, returning control to user process
2023-09-17 01:50:03,405 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_finish():1934] finishing run xxxxx/xxxxxxx/fsdfs50r
2023-09-17 01:50:03,405 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_atexit_cleanup():2168] got exitcode: 1
2023-09-17 01:50:03,405 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_restore():2151] restore
2023-09-17 01:50:03,405 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_restore():2157] restore done
2023-09-17 01:50:07,085 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_footer_history_summary_info():3557] rendering history
2023-09-17 01:50:07,086 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_footer_history_summary_info():3589] rendering summary
2023-09-17 01:50:07,094 INFO    Thread-10 (_run_job):1602087 [wandb_run.py:_footer_sync_info():3516] logging synced files

Slightly different log error occurred with the initial failed run. But the rest just have the above log message. I also hadn’t updated wandb initially:

2023-09-14 11:09:35,171 INFO    Thread-3 (_run_job):1461321 [wandb_run.py:_redirect():1959] redirect: SettingsConsole.WRAP_RAW
2023-09-14 11:09:35,171 INFO    Thread-3 (_run_job):1461321 [wandb_run.py:_redirect():2024] Wrapping output streams.
2023-09-14 11:09:35,171 INFO    Thread-3 (_run_job):1461321 [wandb_run.py:_redirect():2046] Redirects installed.
2023-09-14 11:09:35,171 INFO    Thread-3 (_run_job):1461321 [wandb_init.py:init():798] run started, returning control to user process
2023-09-14 14:06:20,706 ERROR   Thread-2 (_heartbeat):1461321 [internal_api.py:execute():244] 502 response executing GraphQL.
2023-09-14 14:06:20,706 ERROR   Thread-2 (_heartbeat):1461321 [internal_api.py:execute():245] 
<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>

2023-09-14 14:10:45,331 ERROR   Thread-2 (_heartbeat):1461321 [internal_api.py:execute():244] 502 response executing GraphQL.
2023-09-14 14:10:45,331 ERROR   Thread-2 (_heartbeat):1461321 [internal_api.py:execute():245] 
<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>

2023-09-14 15:14:04,762 INFO    Thread-3 (_run_job):1461321 [wandb_run.py:_finish():1844] finishing run xxxxx/xxxxxxxx/df4532k4
2023-09-14 15:14:04,762 INFO    Thread-3 (_run_job):1461321 [wandb_run.py:_atexit_cleanup():2070] got exitcode: 1
2023-09-14 15:14:04,762 INFO    Thread-3 (_run_job):1461321 [wandb_run.py:_restore():2053] restore
2023-09-14 15:14:04,763 INFO    Thread-3 (_run_job):1461321 [wandb_run.py:_restore():2059] restore done
2023-09-14 15:14:35,981 INFO    Thread-3 (_run_job):1461321 [wandb_run.py:_footer_history_summary_info():3427] rendering history
2023-09-14 15:14:35,982 INFO    Thread-3 (_run_job):1461321 [wandb_run.py:_footer_history_summary_info():3459] rendering summary
2023-09-14 15:14:35,985 INFO    Thread-3 (_run_job):1461321 [wandb_run.py:_footer_sync_info():3383] logging synced files

Hi @gurugecl , thank you for writing in and happy to help. I will review the provided info and circle back. In the meantime, could you provide me a link to your sweep where you are seeing this behavior. Thanks

Hi @gurugecl , I ran through some tests and was unable to reproduce a situation where wandb bayes sweeps with hyperband sweep produces a failed/crashed run when early stopping occurs, see project here. Do you have a reproducible working example I could test against?

Some high level info on how bayes treats runs

When a Bayesian sweep run executes to completion, we always use the “best” metric value of last finished run for the next run . The default behavior can be modified by the user using an impute strategy :

  • “worst”: use the worst value of the run’s target metric,
  • “best”: use the best value of the run’s target metric,
  • “latest”: use the latest value of the run’s target metric.

Now in the event a run Fails, Crashes, is Killed, wandb defaults to “worst” metric value unless the impute strategy was set by a user.

In terms of the difference between crashed vs failed runs

failed runs means a non-zero exit (control-c or exception or sys.exit(1) .

  • Scenario: W&B kills the agent, and does not run any further sweep experiments.
  • The sweep agent is designed to shutdown if (by default) there are 3 failures in the first 60 seconds, or the first 5 runs started are failures.

a crashed run : wandb backend never got a successful exit indication after 5-10 minutes of no heartbeat

Hi @mohammadbakir, sorry for the delay in responding but thank you for looking into that and for the information provided! Sure, here is a link to my sweep: Weights & Biases

As you can see almost all the failed runs are happening at epoch 16 due to early termination.

Please let me know if anything else is needed. Thank you!

Hi @gurugecl , thanks for providing a link to your workspace. There isn’t anything from the run workspace that reveals reason for the “Failed” status. I ran through tests on my end to confirm early termination works as expected and it does with successful run completion.

  • Are you executing your sweeps within a containerized environment? If yes, if you run a test sweep outside the environment do you still see same behavior?
  • If you update wandb sdk to latest release, 0.15.11, are you seeing the same behavior?
  • Could you provide us a set of debug logs for the failed runs? The debug.log and debug-internal.log files are located in wandb working directory under those runs. wandb/<run>/logs. Please send them to support@wandb.com and reference this post.

Thanks

Hi @gurugecl , apologies for the delay on this. We’ve reviewed your logs and the following error repeats,

raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7effa305a8f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution

The error message indicates that the DNS (Domain Name System) resolution process, which is responsible for translating domain names into IP addresses, failed temporarily. This failure could be due to various reasons, such as issues with the network configuration, DNS server problems, or temporary connectivity issues.

Are you running this from a containerized environment with special network connections or behind a proxy?

Hi Deek, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Hi @mohammadbakir, sorry for the delay in responding but no Im neither using a containerized environment nor a proxy. And this has been continuously happening over several weeks so I don’t think its due to temporary connectivity issues but please let me know if there is any other information I can provide regarding my setup

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.