Issue with grid sweep

Running into an issue that started late last week when trying to run a grid sweep. The estimated number of runs is 5292, but when the sweep reaches run number 300, the API server quits issuing new runs. The sweep is still in the started state and looks active, with the agents reporting a heartbeat, but on the client/agent, it looks as if the run is paused:

2024-03-31 23:55:40,348 - wandb.wandb_agent - INFO - Running runs: []

Pausing/resuming the run has no effect on this. If I stop the agent and try to relaunch it, the heartbeat disconnects but instead of connecting gives the following error:

wandb: Starting wandb agent 🕵️
wandb: Network error (ReadTimeout), entering retry loop.
client_loop: send disconnect: Broken pipe

The cli logs show:

socket.timeout: The read operation timed out

urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=20)

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=20)

The network connection itself is fine, I can create a new sweep from the same device and it will start a new set of runs, but similarly stops when it hits run 300 (have tried 3 separate times).

hey @wadeschulzyale - would it be possible to provide the debug.log​and debug-internal.log files associated with a run from the error? They should be located in the wandb folder in the same directory as where the script was run. The wandb folder has folders formatted as run-DATETIME-ID associated with a single run.

@uma-wandb Would be happy to - is there an address I can send it to you at? The runs themselves completed fine, and not noticing anything in the run-level logs that demonstrates an error, which seems to be at the sweep level which does not send new runs to the client, and if the client is stopped and attempted reconnect, the above error/logs are created for the sweep ID.

Thanks,
Wade

Hi @uma-wandb - just wanted to check back on whether there was somewhere you’d like me to send the additional logs or any feedback on the underlying sweep scheduling.

Thanks,
Wade

Hi @wadeschulzyale - truly sorry for the late reply here. Could you please send that to support@wandb.com and we can keep troubleshooting from there? Thanks!

hey @wadeschulzyale - were you able to send over the debug logs to support@wandb.com? this would be super helpful in root causing this, thanks again!

Hi @wadeschulzyale , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!