I am running a hyperparameter grid search using sweeps. I launched 4 agents on the same machine but I noticed that after completing one run, one of the agents is struggling with the next run.
This agent seems to be still communicating with wandb.ai, the updated variable is regularly updated to the current time on the active run, and the agent itself has the same heartbeat as the others.
The agent should be in its 8th run by now, but is only at the second one and only computed two epochs. All runs should take the same amount of time as they have the same number of epochs and model architecture.
The “Logs” panel is also completely blank for this agent. And now that I’m writing this, it seems another agent is starting to slow down.
Also, I had the same issue in the previous sweep, this is the second time I run it with the same configuration. Previously, the issue appeared in the first few runs so I quickly stopped it and ran everything again.
Is it most likely an issue with the CPU resources being not well distributed (all threads are at 100% usage), or could it be a network issue? How can I investigate what’s happening?
I will add that the process seems to continue while using very little CPU, and it continues to log epochs. It has reached epoch 64 but only it is only logged up to epoch 4 on wandb.ai.
Perhaps this is a network issue that randomly target some processes?
Hi Ben, thank you for writing in! This shouldn’t be happening intermittently. Can you double check that you have a good internet connection? If your internet is good, can you send me the link to your sweep page and send me your debug logs found in your wandb run directory please?
Sorry for not replying the other time, I don’t know why the notifications didn’t get to me.
Anyways, I have the issue again, this time not in sweeps but in regular runs too.
See this project: Weights & Biases
glorious-armadillo-18 and dauntless-deluge-17 were launched at the same time and crunched through the iteration at the same speed initially, but then the second one slow down significantly.
They don’t have exactly the same computational cost, but it is glorious-armadillo-18 that should be a bit slower. Despite this, it is the one of the two that finished all 390 training experiences.
The internet connection is apparently all good, this time it isn’t even the same computing server as in my first post.
By the way, the logs are sometimes a bit mangled even though in the server log file it is all good.