Runs log stops at 50

podikakos · July 4, 2022, 12:02pm

Hello, i am running wandb locally in my computer. I start a sweep and it runs smoothly, but when it reaches 50 runs it stops. Although kernel seems to be running it does not show any runs in the wandb site nor locally files in my computer. Does anyone know what seems to be the problem? I can provide any logs if requested, i don’t know what to post and be helpful.

lesliewandb · July 7, 2022, 12:31am

Hi John, can you give me more information on how you set up your Sweep? For example, are you using grid search? If so, can you change to random search and see if you are still running into the same problem with it crashing at 50 runs? This might be due to a specific parameter configuration crashing the sweeps.

podikakos · July 9, 2022, 1:23pm

Hello, i have the same issue with another computer, this one stopped at 32 runs. The same code in google colab runs just fine. Could be a hardware issue? It is a grid search.

sweep_config = {
‘method’: ‘grid’
}

metric = {
‘name’: ‘loss’,
‘goal’: ‘minimize’
}

sweep_config[‘metric’] = metric

parameters_dict = {
‘learning-rate’:{
‘values’: [0.001, 0.0001, 0.002]
},
‘conv1’: {
‘values’: [32, 48, 64]
},
‘conv2’: {
‘values’: [48, 64, 128]
},
‘conv3’: {
‘values’: [64, 128, 256]
},
‘dropout’: {
‘values’: [0.2, 0.3]
},
‘batch_size’: {
‘values’: [64, 128, 256]
},
}

sweep_config[‘parameters’] = parameters_dict
parameters_dict.update({
‘epochs’: {
‘value’: 10}
})

pprint.pprint(sweep_config)

Last log:
2022-07-08T17:17:49.759534 Sending commands to agent c6swh3i1: [{“run_id”:“kbabazlh”,“program”:“”,“type”:“run”,“args”:{“batch_size”:{“value”:64},“conv1”:{“value”:32},“conv2”:{“value”:64},“conv3”:{“value”:256},“dropout”:{“value”:0.2},“epochs”:{“value”:10},“learning-rate”:{“value”:0.001}},“runqueue_item_id”:“UnVuUXVldWVJdGVtOjEyNTk3MDUzNQ==”,“logs”:,“run_storage_id”:“UnVuOnYxOmtiYWJhemxoOkVNR18xOnBvZGlrYWtvcw==”}]

2022-07-08T17:19:21.436651 Launched new run 113o3dnp (decent-sweep-32)

2022-07-08T17:19:21.465885 Sending commands to agent c6swh3i1: [{“run_id”:“113o3dnp”,“program”:“”,“type”:“run”,“args”:{“batch_size”:{“value”:64},“conv1”:{“value”:32},“conv2”:{“value”:64},“conv3”:{“value”:256},“dropout”:{“value”:0.2},“epochs”:{“value”:10},“learning-rate”:{“value”:0.0001}},“runqueue_item_id”:“UnVuUXVldWVJdGVtOjEyNTk3MDUzNg==”,“logs”:,“run_storage_id”:“UnVuOnYxOjExM28zZG5wOkVNR18xOnBvZGlrYWtvcw==”}]

2022-07-08T17:35:19.290112 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T17:55:11.671631 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T18:15:13.199534 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T18:35:13.31471 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T18:55:15.914243 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T19:15:11.71383 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T19:35:17.42472 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T19:55:16.585164 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T20:15:17.665501 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T20:35:13.42555 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T20:55:16.863404 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T21:15:12.686781 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T21:35:11.491317 Agent c6swh3i1 state changed from ERROR to RUNNING

2022-07-08T21:55:11.980649 Agent c6swh3i1 state changed from ERROR to RUNNING

lesliewandb · July 12, 2022, 4:31pm

Hi John, is there a reason why you need to use grid search? Because of the configuration that is not working properly, it would be better to use random search in this circumstance.

lesliewandb · July 15, 2022, 7:41pm

Hi John,

Do you still need help here?

Warmly,
Leslie

podikakos · July 17, 2022, 9:51pm

Hello,

this configuration runs properly in google colab. The reason i use grid search is because i need to explore all possible configurations. Why do i get this error? Any ideas? Is grid search problematic with any particular version of python or tensorflow?

lesliewandb · July 28, 2022, 7:49pm

That’s interesting, if it works in one place, it should work in the other. Can you send me the debug logs of when you ran it in your terminal please?

lesliewandb · August 5, 2022, 3:13pm

Do you still need help here John?

lesliewandb · August 10, 2022, 5:28pm

Hi John, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

system · September 15, 2022, 9:52pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Broken Pipe error W&B Help sweeps , wandb	2	1798	February 9, 2024
Wandb sweep have unreproducible results W&B Help sweeps	3	34	August 2, 2024
Duplicate runs after 500 runs when using local controller W&B Help sweeps , wandb	5	590	June 10, 2023
Wandb sweep not working W&B Help	5	530	June 5, 2024
Running Sweep stops running new runs every 5 runs and each sweep creates 5 agents and now only 1 W&B Help sweeps	2	72	August 12, 2024

Runs log stops at 50

Related topics