When using launch jobs with sweep for hyperparameter tuning, the scheduler adds only one launch

hojung-shin · June 19, 2024, 2:25am

Hello there, I’m trying to use launch and sweep for hyper parameter tuning.

I followed the documentation (Sweeps on Launch | Weights & Biases Documentation) step by step.

But once after creating sweep with launch configured, the wandb sweep scheduler doesn’t make more than one launch despite of “num_workers” parameter.

Expected action of scheduler was making the “num_workers” of launch at a time and enqueue those so that our agents can handle multiple runs.

But the real action is, after finishing one launch, it starts one launch.

Here’s my sweep config
job: hojung-shin/job-creation-demo/fashion-mnist-train:latest
method: bayes
metric:
goal: minimize
name: Step
parameters:
epochs:
distribution: int_uniform
max: 30
min: 3
learning_rate:
distribution: uniform
max: 0.002
min: 0.0005
lr:
distribution: uniform
max: 0.1583121666044808
min: 0.00327423841667506
steps_per_epoch:
distribution: int_uniform
max: 20
min: 5
program: train.py
scheduler:
job: wandb/sweep-jobs/job-wandb-sweep-scheduler:latest
num_workers: 4
settings:
method: bayes

and my scheduler logs

1 wandb: sched: Scheduler starting.
2 wandb: 2 of 2 files downloaded.
3 wandb: sched: Successfully loaded job (hojung-shin/job-creation-demo/fashion-mnist-train:latest) in scheduler
4 wandb: sched: Scheduler running
5 wandb: sched: Polling for new runs to launch
6 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
7 wandb: launch: Launching run into hojung-shin/job-creation-demo
8 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
9 wandb: launch: Added run to queue tutorial-run-queue.
10 wandb: launch: Launch spec:
11 wandb: {‘author’: ‘hojung-shin’,
12 wandb: ‘docker’: {},
13 wandb: ‘entity’: ‘hojung-shin’,
14 wandb: ‘git’: {},
15 wandb: ‘job’: ‘hojung-shin/job-creation-demo/fashion-mnist-train:latest’,
16 wandb: ‘overrides’: {‘run_config’: {‘epochs’: 14,
17 wandb: ‘learning_rate’: 0.0015804867401632372,
18 wandb: ‘lr’: 0.0032919708513930697,
19 wandb: ‘steps_per_epoch’: 9}},
20 wandb: ‘priority’: 2,
21 wandb: ‘project’: ‘job-creation-demo’,
22 wandb: ‘queue’: ‘tutorial-run-queue’,
23 wandb: ‘queue_entity’: ‘hojung-shin’,
24 wandb: ‘resource’: ‘local-container’,
25 wandb: ‘resource_args’: {‘local-container’: {‘builder’: {‘accelerator’: {‘base_image’: ‘tensorflow/tensorflow:latest-gpu’}},
26 wandb: ‘gpus’: ‘all’}},
27 wandb: ‘run_id’: ‘5dqhntkp’,
28 wandb: ‘sweep_id’: ‘yg2xehuj’}
29 wandb:
30 wandb: sched: Added run (5dqhntkp) to queue (tutorial-run-queue)
31 wandb: sched: Polling for new runs to launch
32 wandb: sched: Polling for new runs to launch
33 wandb: sched: Polling for new runs to launch
34 wandb: sched: Polling for new runs to launch
35 wandb: sched: Polling for new runs to launch
36 wandb: sched: Cleaning up finished run (5dqhntkp)
37 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
38 wandb: launch: Launching run into hojung-shin/job-creation-demo
39 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
40 wandb: launch: Added run to queue tutorial-run-queue.
41 wandb: launch: Launch spec:
42 wandb: {‘author’: ‘hojung-shin’,
43 wandb: ‘docker’: {},
44 wandb: ‘entity’: ‘hojung-shin’,
45 wandb: ‘git’: {},
46 wandb: ‘job’: ‘hojung-shin/job-creation-demo/fashion-mnist-train:latest’,
47 wandb: ‘overrides’: {‘run_config’: {‘epochs’: 15,
48 wandb: ‘learning_rate’: 0.001696871195012598,
49 wandb: ‘lr’: 0.10938424924464782,
50 wandb: ‘steps_per_epoch’: 12}},
51 wandb: ‘priority’: 2,
52 wandb: ‘project’: ‘job-creation-demo’,
53 wandb: ‘queue’: ‘tutorial-run-queue’,
54 wandb: ‘queue_entity’: ‘hojung-shin’,
55 wandb: ‘resource’: ‘local-container’,
56 wandb: ‘resource_args’: {‘local-container’: {‘builder’: {‘accelerator’: {‘base_image’: ‘tensorflow/tensorflow:latest-gpu’}},
57 wandb: ‘gpus’: ‘all’}},
58 wandb: ‘run_id’: ‘nod9ozh5’,
59 wandb: ‘sweep_id’: ‘yg2xehuj’}
60 wandb:
61 wandb: sched: Added run (nod9ozh5) to queue (tutorial-run-queue)
62 wandb: sched: Polling for new runs to launch
63 wandb: sched: Polling for new runs to launch
64 wandb: sched: Polling for new runs to launch
65 wandb: sched: Polling for new runs to launch
66 wandb: sched: Polling for new runs to launch
67 wandb: sched: Polling for new runs to launch
68 wandb: sched: Polling for new runs to launch
69 wandb: sched: Polling for new runs to launch
70 wandb: sched: Cleaning up finished run (nod9ozh5)

and here’s my scheduler’s config
{
“_wandb”: {
“desc”: null,
“value”: {
“t”: {
“1”: [
55
],
“2”: [
55
],
“3”: [
13,
16,
20,
23,
24
],
“4”: “3.9.18”,
“5”: “0.16.4.dev1”,
“8”: [
5
],
“13”: “linux-x86_64”
},
“start_time”: 1718762757,
“cli_version”: “0.16.4.dev1”,
“is_jupyter_run”: false,
“python_version”: “3.9.18”,
“launch_trace_id”: “UnVuUXVldWVJdGVtOjU3MTM1NTM4NQ==”,
“is_kaggle_kernel”: false,
“launch_queue_name”: “tutorial-run-queue”,
“launch_queue_entity”: “hojung-shin”
}
},
“launch”: {
“desc”: null,
“value”: {
“job”: “hojung-shin/job-creation-demo/fashion-mnist-train:latest”,
“queue”: “tutorial-run-queue”,
“entity”: “hojung-shin”,
“project”: “job-creation-demo”,
“priority”: 2,
“queue_entity”: “hojung-shin”,
“resource_args”: {
“local-container”: {
“gpus”: “all”,
“builder”: {
“accelerator”: {
“base_image”: “tensorflow/tensorflow:latest-gpu”
}
}
}
},
“template_variables”: {}
}
},
“settings”: {
“desc”: null,
“value”: {
“method”: “bayes”
}
},
“scheduler”: {
“desc”: null,
“value”: {
“job”: “wandb/sweep-jobs/job-wandb-sweep-scheduler:latest”,
“num_workers”: 4
}
},
“sweep_args”: {
“desc”: null,
“value”: {
“job”: “hojung-shin/job-creation-demo/fashion-mnist-train:latest”,
“queue”: “tutorial-run-queue”,
“author”: “hojung-shin”,
“project”: “job-creation-demo”,
“sweep_id”: “yg2xehuj”
}
}
}

luis_bergua · June 24, 2024, 2:16pm

Hey @hojung-shin, thanks for flagging this! Would you have any problems with sharing a link to your sweep so I can take a look?

hojung-shin · June 25, 2024, 1:13am

Hello, @luis_bergua thanks for the reply.

Here’s my sweep link

I’ve changed my project to public.

if you cannot access to my sweep plz lemme know.

hojung-shin · June 25, 2024, 1:15am

I’ve created more than 3 agents to the queue, but it seems that scheduler doesn’t add new run more than one to the launch queue, so that other agent registered on that queue can execute new run.

as you can see in my sweep config, I set the num_workers to 8.

luis_bergua · July 2, 2024, 9:29am

Hey @hojung-shin, thanks for sharing this! I have been doing some tests on my end but was unable to reproduce this behavior. Would you have any problems with sharing a minimal reproduction example of the issue? Do you see it every time you try?

hojung-shin · July 3, 2024, 1:55am

yes, I’ve tried many time.

Through CLI or Wandb web, it doesn’t change.

I made the reproduce code, in CLI environment.

it still gives me same results.

scheduler said we have 8 num workers, but it only utilize only one agent.

hojung-shin · July 3, 2024, 1:57am

it just keeps polling

Topic		Replies	Views
Multithreading support for Sweeps W&B Help sweeps , wandb	10	1339	January 1, 2024
Sweeps ending in just 1 epoch W&B Help sweeps , wandb	4	163	April 18, 2024
Using wandb sweep with torch.distributed.launch W&B Help sweeps , wandb	6	1393	July 24, 2022
Enqueue sweep to launch queue W&B Help sweeps	10	108	August 8, 2024
Sweep run not closing W&B Help sweeps	10	1152	September 14, 2022

When using launch jobs with sweep for hyperparameter tuning, the scheduler adds only one launch

Related topics