Hello there, I’m trying to use launch and sweep for hyper parameter tuning.
I followed the documentation (Sweeps on Launch | Weights & Biases Documentation) step by step.
But once after creating sweep with launch configured, the wandb sweep scheduler doesn’t make more than one launch despite of “num_workers” parameter.
Expected action of scheduler was making the “num_workers” of launch at a time and enqueue those so that our agents can handle multiple runs.
But the real action is, after finishing one launch, it starts one launch.
Here’s my sweep config
job: hojung-shin/job-creation-demo/fashion-mnist-train:latest
method: bayes
metric:
goal: minimize
name: Step
parameters:
epochs:
distribution: int_uniform
max: 30
min: 3
learning_rate:
distribution: uniform
max: 0.002
min: 0.0005
lr:
distribution: uniform
max: 0.1583121666044808
min: 0.00327423841667506
steps_per_epoch:
distribution: int_uniform
max: 20
min: 5
program: train.py
scheduler:
job: wandb/sweep-jobs/job-wandb-sweep-scheduler:latest
num_workers: 4
settings:
method: bayes
and my scheduler logs
1 wandb: sched: Scheduler starting.
2 wandb: 2 of 2 files downloaded.
3 wandb: sched: Successfully loaded job (hojung-shin/job-creation-demo/fashion-mnist-train:latest) in scheduler
4 wandb: sched: Scheduler running
5 wandb: sched: Polling for new runs to launch
6 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
7 wandb: launch: Launching run into hojung-shin/job-creation-demo
8 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
9 wandb: launch: Added run to queue tutorial-run-queue.
10 wandb: launch: Launch spec:
11 wandb: {‘author’: ‘hojung-shin’,
12 wandb: ‘docker’: {},
13 wandb: ‘entity’: ‘hojung-shin’,
14 wandb: ‘git’: {},
15 wandb: ‘job’: ‘hojung-shin/job-creation-demo/fashion-mnist-train:latest’,
16 wandb: ‘overrides’: {‘run_config’: {‘epochs’: 14,
17 wandb: ‘learning_rate’: 0.0015804867401632372,
18 wandb: ‘lr’: 0.0032919708513930697,
19 wandb: ‘steps_per_epoch’: 9}},
20 wandb: ‘priority’: 2,
21 wandb: ‘project’: ‘job-creation-demo’,
22 wandb: ‘queue’: ‘tutorial-run-queue’,
23 wandb: ‘queue_entity’: ‘hojung-shin’,
24 wandb: ‘resource’: ‘local-container’,
25 wandb: ‘resource_args’: {‘local-container’: {‘builder’: {‘accelerator’: {‘base_image’: ‘tensorflow/tensorflow:latest-gpu’}},
26 wandb: ‘gpus’: ‘all’}},
27 wandb: ‘run_id’: ‘5dqhntkp’,
28 wandb: ‘sweep_id’: ‘yg2xehuj’}
29 wandb:
30 wandb: sched: Added run (5dqhntkp) to queue (tutorial-run-queue)
31 wandb: sched: Polling for new runs to launch
32 wandb: sched: Polling for new runs to launch
33 wandb: sched: Polling for new runs to launch
34 wandb: sched: Polling for new runs to launch
35 wandb: sched: Polling for new runs to launch
36 wandb: sched: Cleaning up finished run (5dqhntkp)
37 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
38 wandb: launch: Launching run into hojung-shin/job-creation-demo
39 wandb: WARNING Launch spec contains both resource_args and template_variables, only one can be set. Using template_variables.
40 wandb: launch: Added run to queue tutorial-run-queue.
41 wandb: launch: Launch spec:
42 wandb: {‘author’: ‘hojung-shin’,
43 wandb: ‘docker’: {},
44 wandb: ‘entity’: ‘hojung-shin’,
45 wandb: ‘git’: {},
46 wandb: ‘job’: ‘hojung-shin/job-creation-demo/fashion-mnist-train:latest’,
47 wandb: ‘overrides’: {‘run_config’: {‘epochs’: 15,
48 wandb: ‘learning_rate’: 0.001696871195012598,
49 wandb: ‘lr’: 0.10938424924464782,
50 wandb: ‘steps_per_epoch’: 12}},
51 wandb: ‘priority’: 2,
52 wandb: ‘project’: ‘job-creation-demo’,
53 wandb: ‘queue’: ‘tutorial-run-queue’,
54 wandb: ‘queue_entity’: ‘hojung-shin’,
55 wandb: ‘resource’: ‘local-container’,
56 wandb: ‘resource_args’: {‘local-container’: {‘builder’: {‘accelerator’: {‘base_image’: ‘tensorflow/tensorflow:latest-gpu’}},
57 wandb: ‘gpus’: ‘all’}},
58 wandb: ‘run_id’: ‘nod9ozh5’,
59 wandb: ‘sweep_id’: ‘yg2xehuj’}
60 wandb:
61 wandb: sched: Added run (nod9ozh5) to queue (tutorial-run-queue)
62 wandb: sched: Polling for new runs to launch
63 wandb: sched: Polling for new runs to launch
64 wandb: sched: Polling for new runs to launch
65 wandb: sched: Polling for new runs to launch
66 wandb: sched: Polling for new runs to launch
67 wandb: sched: Polling for new runs to launch
68 wandb: sched: Polling for new runs to launch
69 wandb: sched: Polling for new runs to launch
70 wandb: sched: Cleaning up finished run (nod9ozh5)
and here’s my scheduler’s config
{
“_wandb”: {
“desc”: null,
“value”: {
“t”: {
“1”: [
55
],
“2”: [
55
],
“3”: [
13,
16,
20,
23,
24
],
“4”: “3.9.18”,
“5”: “0.16.4.dev1”,
“8”: [
5
],
“13”: “linux-x86_64”
},
“start_time”: 1718762757,
“cli_version”: “0.16.4.dev1”,
“is_jupyter_run”: false,
“python_version”: “3.9.18”,
“launch_trace_id”: “UnVuUXVldWVJdGVtOjU3MTM1NTM4NQ==”,
“is_kaggle_kernel”: false,
“launch_queue_name”: “tutorial-run-queue”,
“launch_queue_entity”: “hojung-shin”
}
},
“launch”: {
“desc”: null,
“value”: {
“job”: “hojung-shin/job-creation-demo/fashion-mnist-train:latest”,
“queue”: “tutorial-run-queue”,
“entity”: “hojung-shin”,
“project”: “job-creation-demo”,
“priority”: 2,
“queue_entity”: “hojung-shin”,
“resource_args”: {
“local-container”: {
“gpus”: “all”,
“builder”: {
“accelerator”: {
“base_image”: “tensorflow/tensorflow:latest-gpu”
}
}
}
},
“template_variables”: {}
}
},
“settings”: {
“desc”: null,
“value”: {
“method”: “bayes”
}
},
“scheduler”: {
“desc”: null,
“value”: {
“job”: “wandb/sweep-jobs/job-wandb-sweep-scheduler:latest”,
“num_workers”: 4
}
},
“sweep_args”: {
“desc”: null,
“value”: {
“job”: “hojung-shin/job-creation-demo/fashion-mnist-train:latest”,
“queue”: “tutorial-run-queue”,
“author”: “hojung-shin”,
“project”: “job-creation-demo”,
“sweep_id”: “yg2xehuj”
}
}
}