Sweep agents sometimes become extremely slow

bencrulis · September 22, 2022, 8:07am

Hello,

I am running a hyperparameter grid search using sweeps. I launched 4 agents on the same machine but I noticed that after completing one run, one of the agents is struggling with the next run.

This agent seems to be still communicating with wandb.ai, the updated variable is regularly updated to the current time on the active run, and the agent itself has the same heartbeat as the others.

The agent should be in its 8th run by now, but is only at the second one and only computed two epochs. All runs should take the same amount of time as they have the same number of epochs and model architecture.

The “Logs” panel is also completely blank for this agent. And now that I’m writing this, it seems another agent is starting to slow down.

Also, I had the same issue in the previous sweep, this is the second time I run it with the same configuration. Previously, the issue appeared in the first few runs so I quickly stopped it and ran everything again.

Is it most likely an issue with the CPU resources being not well distributed (all threads are at 100% usage), or could it be a network issue? How can I investigate what’s happening?

bencrulis · September 22, 2022, 4:14pm

I will add that the process seems to continue while using very little CPU, and it continues to log epochs. It has reached epoch 64 but only it is only logged up to epoch 4 on wandb.ai.

Perhaps this is a network issue that randomly target some processes?

lesliewandb · September 27, 2022, 4:07pm

Hi Ben, thank you for writing in! This shouldn’t be happening intermittently. Can you double check that you have a good internet connection? If your internet is good, can you send me the link to your sweep page and send me your debug logs found in your wandb run directory please?

lesliewandb · September 30, 2022, 6:53pm

Hi Ben! Do you still need help here?

lesliewandb · October 5, 2022, 10:12pm

Hi again Ben! Since we haven’t heard from you, I’m going to close out this ticket. But please write back in if you’re still running into this issue

bencrulis · October 22, 2022, 9:56am

Hello again,

Sorry for not replying the other time, I don’t know why the notifications didn’t get to me.
Anyways, I have the issue again, this time not in sweeps but in regular runs too.

See this project: Weights & Biases
glorious-armadillo-18 and dauntless-deluge-17 were launched at the same time and crunched through the iteration at the same speed initially, but then the second one slow down significantly.

They don’t have exactly the same computational cost, but it is glorious-armadillo-18 that should be a bit slower. Despite this, it is the one of the two that finished all 390 training experiences.

The internet connection is apparently all good, this time it isn’t even the same computing server as in my first post.
By the way, the logs are sometimes a bit mangled even though in the server log file it is all good.

Thanks for your insight

system · December 21, 2022, 9:56am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elapsed time per epoch much slower for sweep than for individual runs W&B Help sweeps	11	858	July 21, 2023
Running Sweep stops running new runs every 5 runs and each sweep creates 5 agents and now only 1 W&B Help sweeps	2	72	August 12, 2024
Issue with grid sweep W&B Help sweeps	6	383	April 22, 2024
Sweep agent will always start another run after finishing (on SLURM) W&B Help sweeps	4	294	July 3, 2024
Sweeps while using MPI and SLURM W&B Help sweeps	6	1757	August 1, 2022

Sweep agents sometimes become extremely slow

Related topics