Duplicate runs after 500 runs when using local controller

I deploy a wandb server in my local server, and use grid search to sweep hyperparameters with 4 parallel agents.

In my case, the size of the search space exceeds 500, and each run takes about 2 minutes to finish.

I always find that after 500 runs finish, the generated hyperparameter configurations of the newly started runs from the beginning again. That is, the configuration of the 501st run (or possibly the 502nd run) is the same as that of the first run, the configuration of the 502nd run (or possibly the 503rd run) is the same as that of the second run, and so on.

I also check the log of the local controller, and the number of runs keeps to be 500 as follows:

Sweep: t3muh8oq (grid) | Runs: 470 (Running: 2, Finished: 468)
Sweep: t3muh8oq (grid) | Runs: 471 (Running: 3, Finished: 468)
Sweep: t3muh8oq (grid) | Runs: 472 (Running: 4, Finished: 468)
Sweep: t3muh8oq (grid) | Runs: 472 (Running: 3, Finished: 469)
Sweep: t3muh8oq (grid) | Runs: 473 (Running: 4, Finished: 469)
Sweep: t3muh8oq (grid) | Runs: 473 (Running: 3, Finished: 470)
Sweep: t3muh8oq (grid) | Runs: 474 (Running: 4, Finished: 470)
Sweep: t3muh8oq (grid) | Runs: 474 (Running: 3, Finished: 471)
Sweep: t3muh8oq (grid) | Runs: 475 (Running: 3, Finished: 472)
Sweep: t3muh8oq (grid) | Runs: 476 (Running: 4, Finished: 472)
Sweep: t3muh8oq (grid) | Runs: 476 (Running: 2, Finished: 474)
Sweep: t3muh8oq (grid) | Runs: 477 (Running: 3, Finished: 474)
Sweep: t3muh8oq (grid) | Runs: 478 (Running: 4, Finished: 474)
Sweep: t3muh8oq (grid) | Runs: 478 (Running: 3, Finished: 475)
Sweep: t3muh8oq (grid) | Runs: 478 (Running: 2, Finished: 476)
Sweep: t3muh8oq (grid) | Runs: 479 (Running: 3, Finished: 476)
Sweep: t3muh8oq (grid) | Runs: 480 (Running: 4, Finished: 476)
Sweep: t3muh8oq (grid) | Runs: 480 (Running: 3, Finished: 477)
Sweep: t3muh8oq (grid) | Runs: 481 (Running: 3, Finished: 478)
Sweep: t3muh8oq (grid) | Runs: 482 (Running: 4, Finished: 478)
Sweep: t3muh8oq (grid) | Runs: 482 (Running: 3, Finished: 479)
Sweep: t3muh8oq (grid) | Runs: 483 (Running: 4, Finished: 479)
Sweep: t3muh8oq (grid) | Runs: 483 (Running: 3, Finished: 480)
Sweep: t3muh8oq (grid) | Runs: 484 (Running: 4, Finished: 480)
Sweep: t3muh8oq (grid) | Runs: 484 (Running: 2, Finished: 482)
Sweep: t3muh8oq (grid) | Runs: 485 (Running: 3, Finished: 482)
Sweep: t3muh8oq (grid) | Runs: 486 (Running: 3, Finished: 483)
Sweep: t3muh8oq (grid) | Runs: 487 (Running: 3, Finished: 484)
Sweep: t3muh8oq (grid) | Runs: 488 (Running: 4, Finished: 484)
Sweep: t3muh8oq (grid) | Runs: 488 (Running: 3, Finished: 485)
Sweep: t3muh8oq (grid) | Runs: 489 (Running: 3, Finished: 486)
Sweep: t3muh8oq (grid) | Runs: 490 (Running: 4, Finished: 486)
Sweep: t3muh8oq (grid) | Runs: 490 (Running: 3, Finished: 487)
Sweep: t3muh8oq (grid) | Runs: 491 (Running: 3, Finished: 488)
Sweep: t3muh8oq (grid) | Runs: 492 (Running: 4, Finished: 488)
Sweep: t3muh8oq (grid) | Runs: 492 (Running: 3, Finished: 489)
Sweep: t3muh8oq (grid) | Runs: 493 (Running: 3, Finished: 490)
Sweep: t3muh8oq (grid) | Runs: 494 (Running: 2, Finished: 492)
Sweep: t3muh8oq (grid) | Runs: 495 (Running: 3, Finished: 492)
Sweep: t3muh8oq (grid) | Runs: 496 (Running: 4, Finished: 492)
Sweep: t3muh8oq (grid) | Runs: 496 (Running: 3, Finished: 493)
Sweep: t3muh8oq (grid) | Runs: 497 (Running: 3, Finished: 494)
Sweep: t3muh8oq (grid) | Runs: 498 (Running: 3, Finished: 495)
Sweep: t3muh8oq (grid) | Runs: 499 (Running: 3, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 2, Finished: 498)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 2, Finished: 498)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 2, Finished: 498)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 2, Finished: 498)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 2, Finished: 498)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 2, Finished: 498)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 2, Finished: 498)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 4, Finished: 496)
Sweep: t3muh8oq (grid) | Runs: 500 (Running: 3, Finished: 497)

Hi @lanlin , apologies for the delay here. We were able to reproduce the issue on our end and confirmed that the upgrade to 0.30.0 introduced a regression. Would rolling back the upgrade to 0.29.0 be an option for you here? However, this regression was introduced for runs numbered #1, #2 and not > 500th.

Also, you can definitely see it converge to a single value in some cases where the parameter space is small. Could you please share your sweep config so that we can confirm the same?

Some more context: It tries to balance exploring the parameter space with returning values that maximize expected improvement. It is always trying to attain balance there. In cases where the parameter space is small, it finishes exploring reasonably fast. So, when it finds an extremum it can indeed converge there. Therefore, when we reach optimal, and it is possible that we’ll continue to suggest the same params. This is actually the expected functionality for bayes (only) with categorical parameters. Therefore, it would be really helpful in troubleshooting/reproducing this behavior on our end if you could share your sweep config and the code snippet you’re executing.

Hi @lanlin , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi,

I read the code of wandb local controller, and found that the controller requests the existing runs from the remote server when it needs to generate a new run.

After debugging, I found that the maximum number of runs that the remote server can return is truncated to 500. Thus, I modified the wandb local controller to maintain a list of the existing runs locally, and now it works normally.

发自我的手机

发件人: Anmol Mann via W&B Community notifications@wandb.discoursemail.com
日期: 2023年4月11日周二 22:03
收件人: llan.xjtu@foxmail.com
主 题: [W&B Community] [W&B Help] Duplicate runs after 500 runs when using
local controller

1 Like

Thanks for the update, @lanlin! I’m glad that you were able to find the root-cause of this issue.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.