Runs log stops at 50

Hello, i am running wandb locally in my computer. I start a sweep and it runs smoothly, but when it reaches 50 runs it stops. Although kernel seems to be running it does not show any runs in the wandb site nor locally files in my computer. Does anyone know what seems to be the problem? I can provide any logs if requested, i don’t know what to post and be helpful.

Hi John, can you give me more information on how you set up your Sweep? For example, are you using grid search? If so, can you change to random search and see if you are still running into the same problem with it crashing at 50 runs? This might be due to a specific parameter configuration crashing the sweeps.

Hello, i have the same issue with another computer, this one stopped at 32 runs. The same code in google colab runs just fine. Could be a hardware issue? It is a grid search.

sweep_config = {
‘method’: ‘grid’
}

metric = {
‘name’: ‘loss’,
‘goal’: ‘minimize’
}

sweep_config[‘metric’] = metric

parameters_dict = {
‘learning-rate’:{
‘values’: [0.001, 0.0001, 0.002]
},
‘conv1’: {
‘values’: [32, 48, 64]
},
‘conv2’: {
‘values’: [48, 64, 128]
},
‘conv3’: {
‘values’: [64, 128, 256]
},
‘dropout’: {
‘values’: [0.2, 0.3]
},
‘batch_size’: {
‘values’: [64, 128, 256]
},
}

sweep_config[‘parameters’] = parameters_dict
parameters_dict.update({
‘epochs’: {
‘value’: 10}
})

pprint.pprint(sweep_config)

Last log:
2022-07-08T17:17:49.759534 Sending commands to agent c6swh3i1: [{“run_id”:“kbabazlh”,“program”:“”,“type”:“run”,“args”:{“batch_size”:{“value”:64},“conv1”:{“value”:32},“conv2”:{“value”:64},“conv3”:{“value”:256},“dropout”:{“value”:0.2},“epochs”:{“value”:10},“learning-rate”:{“value”:0.001}},“runqueue_item_id”:“UnVuUXVldWVJdGVtOjEyNTk3MDUzNQ==”,“logs”:,“run_storage_id”:“UnVuOnYxOmtiYWJhemxoOkVNR18xOnBvZGlrYWtvcw==”}]

2022-07-08T17:19:21.436651 Launched new run 113o3dnp (decent-sweep-32)

2022-07-08T17:19:21.465885 Sending commands to agent c6swh3i1: [{“run_id”:“113o3dnp”,“program”:“”,“type”:“run”,“args”:{“batch_size”:{“value”:64},“conv1”:{“value”:32},“conv2”:{“value”:64},“conv3”:{“value”:256},“dropout”:{“value”:0.2},“epochs”:{“value”:10},“learning-rate”:{“value”:0.0001}},“runqueue_item_id”:“UnVuUXVldWVJdGVtOjEyNTk3MDUzNg==”,“logs”:,“run_storage_id”:“UnVuOnYxOjExM28zZG5wOkVNR18xOnBvZGlrYWtvcw==”}]

2022-07-08T17:35:19.290112 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T17:55:11.671631 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T18:15:13.199534 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T18:35:13.31471 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T18:55:15.914243 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T19:15:11.71383 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T19:35:17.42472 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T19:55:16.585164 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T20:15:17.665501 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T20:35:13.42555 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T20:55:16.863404 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T21:15:12.686781 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T21:35:11.491317 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T21:55:11.980649 Agent c6swh3i1 state changed from ERROR to RUNNING

Hi John, is there a reason why you need to use grid search? Because of the configuration that is not working properly, it would be better to use random search in this circumstance.

Hi John,

Do you still need help here?

Warmly,
Leslie

Hello,

this configuration runs properly in google colab. The reason i use grid search is because i need to explore all possible configurations. Why do i get this error? Any ideas? Is grid search problematic with any particular version of python or tensorflow?

That’s interesting, if it works in one place, it should work in the other. Can you send me the debug logs of when you ran it in your terminal please?

Do you still need help here John?

Hi John, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!