ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host


I’ve been using WANDB for hyperparameter tuning sweeps and successfully set up my code to run it from my local machine. I’m utilizing Spyder and a GPU for each sweep on my local machine. My goal was to conduct 50 sweeps, but after just two, I encountered the error below that caused all subsequent sweeps to crash. Initially, I suspected it might be a Windows firewall issue, but having ensured proper access and seeing the first two sweeps run successfully, I doubt it’s the cause. If it were a firewall issue, I wouldn’t have been able to run even two sweeps.

Could you advise on how to resolve this issue? Each sweep takes about one or two hours. Also, is there a limit on the time allowed for logging data to your database?

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

can someone please help me with this. I am not able to run more than 2 sweeps as it immediately crashes for the rest of all sweeps. This is frustrating and I do not know how to fix it.

Hi @bestwayyy, Could you let us know what version of wandb you are using?

Also, could you search through the debug-internal.log file from the local run folder of a run that is hitting this and share any relevant errors?

There isn’t a limit on time that the API can make calls so I don’t think this is happening from the server side. Do you go through a proxy or any other network infrastructure? I agree that Firewall would only make sense if you ran into this every time but there might other network infra that is timing out and closing the connection.

Thank you,

Hi @bestwayyy, I just wanted to follow up and see if this was still an issue?

Hi @bestwayyy, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Quick tip for anybody experiencing this. Reduce your logging frequency. E.g. if you’re doing >10its/sec and logging on every iteration then reduce to log on maybe every 10th iteration. This generally fixes this issue for me.

Note : I see this issue all the time, and it’s incredibly annoying as it will break your entire training run and the only fix is to restart the python interpreter (i wish there was something like wandb.reset() that would at least allow for fixing this at runtime)