Sweep agents sometimes become extremely slow

Hello,

I am running a hyperparameter grid search using sweeps. I launched 4 agents on the same machine but I noticed that after completing one run, one of the agents is struggling with the next run.

This agent seems to be still communicating with wandb.ai, the updated variable is regularly updated to the current time on the active run, and the agent itself has the same heartbeat as the others.

The agent should be in its 8th run by now, but is only at the second one and only computed two epochs. All runs should take the same amount of time as they have the same number of epochs and model architecture.

The “Logs” panel is also completely blank for this agent. And now that I’m writing this, it seems another agent is starting to slow down.

Also, I had the same issue in the previous sweep, this is the second time I run it with the same configuration. Previously, the issue appeared in the first few runs so I quickly stopped it and ran everything again.

Is it most likely an issue with the CPU resources being not well distributed (all threads are at 100% usage), or could it be a network issue? How can I investigate what’s happening?

I will add that the process seems to continue while using very little CPU, and it continues to log epochs. It has reached epoch 64 but only it is only logged up to epoch 4 on wandb.ai.

Perhaps this is a network issue that randomly target some processes?