When using launch jobs with sweep for hyperparameter tuning, the scheduler adds only one launch

Hello there, I’m trying to use launch and sweep for hyper parameter tuning.

I followed the documentation (Sweeps on Launch | Weights & Biases Documentation) step by step.

But once after creating sweep with launch configured, the wandb sweep scheduler doesn’t make more than one launch despite of “num_workers” parameter.

Expected action of scheduler was making the “num_workers” of launch at a time and enqueue those so that our agents can handle multiple runs.

But the real action is, after finishing one launch, it starts one launch.

Here’s my sweep config
job: hojung-shin/job-creation-demo/fashion-mnist-train:latest
method: bayes
metric:
goal: minimize
name: Step
parameters:
epochs:
distribution: int_uniform
max: 30
min: 3
learning_rate:
distribution: uniform
max: 0.002
min: 0.0005
lr:
distribution: uniform
max: 0.1583121666044808
min: 0.00327423841667506
steps_per_epoch:
distribution: int_uniform
max: 20
min: 5
program: train.py
scheduler:
job: wandb/sweep-jobs/job-wandb-sweep-scheduler:latest
num_workers: 4
settings:
method: bayes

and my scheduler logs

1 wandb: sched: Scheduler starting.
2 wandb: 2 of 2 files downloaded.
3 wandb: sched: Successfully loaded job (hojung-shin/job-creation-demo/fashion-mnist-train:latest) in scheduler
4 wandb: sched: Scheduler running
5 wandb: sched: Polling for new runs to launch
6 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
7 wandb: launch: Launching run into hojung-shin/job-creation-demo
8 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
9 wandb: launch: Added run to queue tutorial-run-queue.
10 wandb: launch: Launch spec:
11 wandb: {‘author’: ‘hojung-shin’,
12 wandb: ‘docker’: {},
13 wandb: ‘entity’: ‘hojung-shin’,
14 wandb: ‘git’: {},
15 wandb: ‘job’: ‘hojung-shin/job-creation-demo/fashion-mnist-train:latest’,
16 wandb: ‘overrides’: {‘run_config’: {‘epochs’: 14,
17 wandb: ‘learning_rate’: 0.0015804867401632372,
18 wandb: ‘lr’: 0.0032919708513930697,
19 wandb: ‘steps_per_epoch’: 9}},
20 wandb: ‘priority’: 2,
21 wandb: ‘project’: ‘job-creation-demo’,
22 wandb: ‘queue’: ‘tutorial-run-queue’,
23 wandb: ‘queue_entity’: ‘hojung-shin’,
24 wandb: ‘resource’: ‘local-container’,
25 wandb: ‘resource_args’: {‘local-container’: {‘builder’: {‘accelerator’: {‘base_image’: ‘tensorflow/tensorflow:latest-gpu’}},
26 wandb: ‘gpus’: ‘all’}},
27 wandb: ‘run_id’: ‘5dqhntkp’,
28 wandb: ‘sweep_id’: ‘yg2xehuj’}
29 wandb:
30 wandb: sched: Added run (5dqhntkp) to queue (tutorial-run-queue)
31 wandb: sched: Polling for new runs to launch
32 wandb: sched: Polling for new runs to launch
33 wandb: sched: Polling for new runs to launch
34 wandb: sched: Polling for new runs to launch
35 wandb: sched: Polling for new runs to launch
36 wandb: sched: Cleaning up finished run (5dqhntkp)
37 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
38 wandb: launch: Launching run into hojung-shin/job-creation-demo
39 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
40 wandb: launch: Added run to queue tutorial-run-queue.
41 wandb: launch: Launch spec:
42 wandb: {‘author’: ‘hojung-shin’,
43 wandb: ‘docker’: {},
44 wandb: ‘entity’: ‘hojung-shin’,
45 wandb: ‘git’: {},
46 wandb: ‘job’: ‘hojung-shin/job-creation-demo/fashion-mnist-train:latest’,
47 wandb: ‘overrides’: {‘run_config’: {‘epochs’: 15,
48 wandb: ‘learning_rate’: 0.001696871195012598,
49 wandb: ‘lr’: 0.10938424924464782,
50 wandb: ‘steps_per_epoch’: 12}},
51 wandb: ‘priority’: 2,
52 wandb: ‘project’: ‘job-creation-demo’,
53 wandb: ‘queue’: ‘tutorial-run-queue’,
54 wandb: ‘queue_entity’: ‘hojung-shin’,
55 wandb: ‘resource’: ‘local-container’,
56 wandb: ‘resource_args’: {‘local-container’: {‘builder’: {‘accelerator’: {‘base_image’: ‘tensorflow/tensorflow:latest-gpu’}},
57 wandb: ‘gpus’: ‘all’}},
58 wandb: ‘run_id’: ‘nod9ozh5’,
59 wandb: ‘sweep_id’: ‘yg2xehuj’}
60 wandb:
61 wandb: sched: Added run (nod9ozh5) to queue (tutorial-run-queue)
62 wandb: sched: Polling for new runs to launch
63 wandb: sched: Polling for new runs to launch
64 wandb: sched: Polling for new runs to launch
65 wandb: sched: Polling for new runs to launch
66 wandb: sched: Polling for new runs to launch
67 wandb: sched: Polling for new runs to launch
68 wandb: sched: Polling for new runs to launch
69 wandb: sched: Polling for new runs to launch
70 wandb: sched: Cleaning up finished run (nod9ozh5)

and here’s my scheduler’s config
{
“_wandb”: {
“desc”: null,
“value”: {
“t”: {
“1”: [
55
],
“2”: [
55
],
“3”: [
13,
16,
20,
23,
24
],
“4”: “3.9.18”,
“5”: “0.16.4.dev1”,
“8”: [
5
],
“13”: “linux-x86_64”
},
“start_time”: 1718762757,
“cli_version”: “0.16.4.dev1”,
“is_jupyter_run”: false,
“python_version”: “3.9.18”,
“launch_trace_id”: “UnVuUXVldWVJdGVtOjU3MTM1NTM4NQ==”,
“is_kaggle_kernel”: false,
“launch_queue_name”: “tutorial-run-queue”,
“launch_queue_entity”: “hojung-shin”
}
},
“launch”: {
“desc”: null,
“value”: {
“job”: “hojung-shin/job-creation-demo/fashion-mnist-train:latest”,
“queue”: “tutorial-run-queue”,
“entity”: “hojung-shin”,
“project”: “job-creation-demo”,
“priority”: 2,
“queue_entity”: “hojung-shin”,
“resource_args”: {
“local-container”: {
“gpus”: “all”,
“builder”: {
“accelerator”: {
“base_image”: “tensorflow/tensorflow:latest-gpu”
}
}
}
},
“template_variables”: {}
}
},
“settings”: {
“desc”: null,
“value”: {
“method”: “bayes”
}
},
“scheduler”: {
“desc”: null,
“value”: {
“job”: “wandb/sweep-jobs/job-wandb-sweep-scheduler:latest”,
“num_workers”: 4
}
},
“sweep_args”: {
“desc”: null,
“value”: {
“job”: “hojung-shin/job-creation-demo/fashion-mnist-train:latest”,
“queue”: “tutorial-run-queue”,
“author”: “hojung-shin”,
“project”: “job-creation-demo”,
“sweep_id”: “yg2xehuj”
}
}
}

Hey @hojung-shin, thanks for flagging this! Would you have any problems with sharing a link to your sweep so I can take a look?

Hello, @luis_bergua thanks for the reply.

Here’s my sweep link

I’ve changed my project to public.

if you cannot access to my sweep plz lemme know.

I’ve created more than 3 agents to the queue, but it seems that scheduler doesn’t add new run more than one to the launch queue, so that other agent registered on that queue can execute new run.

as you can see in my sweep config, I set the num_workers to 8.

Hey @hojung-shin, thanks for sharing this! I have been doing some tests on my end but was unable to reproduce this behavior. Would you have any problems with sharing a minimal reproduction example of the issue? Do you see it every time you try?

yes, I’ve tried many time.

Through CLI or Wandb web, it doesn’t change.

I made the reproduce code, in CLI environment.

it still gives me same results.

scheduler said we have 8 num workers, but it only utilize only one agent.

it just keeps polling