My sweep seems to have stopped because I was getting OOM errors on sweep artifacts, which I have now cleaned up. When I try to resume the sweep, I get:
There is no documentation on flapping that I can find, but in the codebase it appears to be a state that a sweep enters if n agents fail within k seconds. I have tried to export the env variable:
export WANDB_AGENT_DISABLE_FLAPPING=false
and then resuming my sweep again but I get the same error. How can I resume my sweep?
Hi @adamits, this seems to work for me if I just assign a new agent to the sweep. I intentionally got a run to hit the flapping threshold, fixed my script so it wouldn’t error, and then starting an agent and it started pulling new runs. Are you trying to resume the sweep from the UI?
Sorry I missed this. No it does not work for me. The flapping started because disk space filled up so a bunch of jobs errored out. I cleared disk space and started a new agent (calling the same sweep) with a python script:
Per your question, I checkout the W&B interface for controlling the agent. It thought the sweep was still running, I paused/resumed it in the GUI and I think its working now?
Thanks for the help. This seems like perhaps a minor bug.