Issue with grid sweep

wadeschulzyale · April 1, 2024, 2:26pm

Running into an issue that started late last week when trying to run a grid sweep. The estimated number of runs is 5292, but when the sweep reaches run number 300, the API server quits issuing new runs. The sweep is still in the started state and looks active, with the agents reporting a heartbeat, but on the client/agent, it looks as if the run is paused:

2024-03-31 23:55:40,348 - wandb.wandb_agent - INFO - Running runs: []

Pausing/resuming the run has no effect on this. If I stop the agent and try to relaunch it, the heartbeat disconnects but instead of connecting gives the following error:

wandb: Starting wandb agent 🕵️
wandb: Network error (ReadTimeout), entering retry loop.
client_loop: send disconnect: Broken pipe

The cli logs show:

socket.timeout: The read operation timed out

urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=20)

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=20)

The network connection itself is fine, I can create a new sweep from the same device and it will start a new set of runs, but similarly stops when it hits run 300 (have tried 3 separate times).

uma-wandb · April 4, 2024, 5:27pm

hey @wadeschulzyale - would it be possible to provide the debug.logand debug-internal.log files associated with a run from the error? They should be located in the wandb folder in the same directory as where the script was run. The wandb folder has folders formatted as run-DATETIME-ID associated with a single run.

wadeschulzyale · April 5, 2024, 4:24pm

@uma-wandb Would be happy to - is there an address I can send it to you at? The runs themselves completed fine, and not noticing anything in the run-level logs that demonstrates an error, which seems to be at the sweep level which does not send new runs to the client, and if the client is stopped and attempted reconnect, the above error/logs are created for the sweep ID.

Thanks,
Wade

wadeschulzyale · April 7, 2024, 10:49pm

Hi @uma-wandb - just wanted to check back on whether there was somewhere you’d like me to send the additional logs or any feedback on the underlying sweep scheduling.

Thanks,
Wade

fmamberti-wandb · April 10, 2024, 3:46pm

Hi @wadeschulzyale - truly sorry for the late reply here. Could you please send that to support@wandb.com and we can keep troubleshooting from there? Thanks!

uma-wandb · April 16, 2024, 10:16am

hey @wadeschulzyale - were you able to send over the debug logs to support@wandb.com? this would be super helpful in root causing this, thanks again!

uma-wandb · April 22, 2024, 9:00pm

Hi @wadeschulzyale , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Topic		Replies	Views
Encountering network error when running sweep W&B Help	6	580	June 27, 2023
New Wandb agents timeout after previous crash in sweep W&B Help sweeps	0	36	October 18, 2024
Runs log stops at 50 W&B Help sweeps , wandb	9	549	September 15, 2022
(Windows 11) `wandb.sweep()` gives ConnectionResetError: [WinError 10054] W&B Help sweeps	6	1506	January 17, 2023
Sweep agents sometimes become extremely slow W&B Help sweeps , wandb	6	1284	December 21, 2022

Issue with grid sweep

Related topics