Cannot resume sweep "Exception: Sweep already flapping."

Hi,

My sweep seems to have stopped because I was getting OOM errors on sweep artifacts, which I have now cleaned up. When I try to resume the sweep, I get:

Cannot resume sweep "Exception: Sweep already flapping."

There is no documentation on flapping that I can find, but in the codebase it appears to be a state that a sweep enters if n agents fail within k seconds. I have tried to export the env variable:

export WANDB_AGENT_DISABLE_FLAPPING=false

and then resuming my sweep again but I get the same error. How can I resume my sweep?

Hi @adamits, this seems to work for me if I just assign a new agent to the sweep. I intentionally got a run to hit the flapping threshold, fixed my script so it wouldn’t error, and then starting an agent and it started pulling new runs. Are you trying to resume the sweep from the UI?

Thank you,
Nate

Hi @adamits, I just wanted to follow up and see if starting a new agent worked for you?

Hi @nathank ,

Sorry I missed this. No it does not work for me. The flapping started because disk space filled up so a bunch of jobs errored out. I cleared disk space and started a new agent (calling the same sweep) with a python script:

wandb.agent(SWEEP_ID, function=run_train, project=PROJECT, count=max_num_runs)

I get:

{"errors":[{"message":"Sweep My_Entity/My_Project/My_ID is not running","path":["createAgent"]}],"data":{"createAgent":null}}

So from the terminal I call

wandb sweep --resume My_Entity/My_Project/My_ID

, and get

line 3300, in set_sweep_state
    raise Exception("Sweep already %s." % curr_state.lower())
Exception: Sweep already flapping.

Hi @nathank

Per your question, I checkout the W&B interface for controlling the agent. It thought the sweep was still running, I paused/resumed it in the GUI and I think its working now?

Thanks for the help. This seems like perhaps a minor bug.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.