Hi, using WandB to train my models seem promising but what I need the most at the moment is hyper parameter tuning. I try to use Sweeps but it does not work and unfortunately I can’t figure out why. I should say that model training without HP tuning works normally so the issue must be in how I set up Sweeps.
Here is the code related to the model training:
def wandb_hp_space(trial):
# hyperparameters I want to optimize
return {
'method': 'random',
'metric': {'name': 'f1', 'goal': 'maximize'},
'parameters': {
'learning_rate': {'values': [2e-5, 5e-5, 2e-4,5e-4, 2e-3]},
'per_device_train_batch_size': {'values': [8,16,32]},
'per_device_eval_batch_size': {'values': [8,16,32]},
'weight_decay': {'values': [0.01, 0.001]}
}
}
def model_init(trial):
return AutoModelForTokenClassification.from_pretrained(
model_checkpoint,
num_labels=len(label_list),
id2label=id2label,
label2id=label2id).to('cuda:0')
training_args = TrainingArguments(
f"{model_name}-finetuned-{task}",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=5,
weight_decay=0.01,
# push_to_hub=True,
save_strategy="epoch",
load_best_model_at_end = True,
metric_for_best_model='f1',
report_to='wandb'
)
# define the trainer and its parameters
trainer = Trainer(
model=None,
args = training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
model_init = model_init
)
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="wandb",
hp_space=wandb_hp_space,
n_trials=20,
# compute_objective=compute_objective,
)
Here the Python error:
wandb: WARNING Calling wandb.login() after wandb.init() has no effect.
wandb: Waiting for W&B process to finish... (success).
wandb: 🚀 View run kind-vortex-4 at: https://wandb.ai/surechembl/uncategorized/runs/ppass2nv
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20231024_143149-ppass2nv/logs
wandb: Agent Starting Run: np5vpbxf with config:
wandb: learning_rate: 0.002
wandb: per_device_eval_batch_size: 8
wandb: per_device_train_batch_size: 8
wandb: weight_decay: 0.01
Exception in thread Thread-5:
Traceback (most recent call last):
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/agents/pyagent.py", line 298, in _run_job
self._function()
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/transformers/integrations/integration_utils.py", line 497, in _objective
run.config.update({"assignments": {}, "metric": metric})
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/wandb_config.py", line 186, in update
self._callback(data=sanitized)
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 370, in wrapper_fn
return func(self, *args, **kwargs)
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1326, in _config_callback
self._backend.interface.publish_config(key=key, val=val, data=data)
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 168, in publish_config
self._publish_config(cfg)
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 358, in _publish_config
self._publish(rec)
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
Here is debug-internal.log (bottom lines):
2023-10-24 14:32:23,118 INFO SenderThread:2030398 [sender.py:send_request_defer():609] handle sender defer: 12
2023-10-24 14:32:23,118 INFO SenderThread:2030398 [file_stream.py:finish():595] file stream finish called
2023-10-24 14:32:23,950 INFO SenderThread:2030398 [file_stream.py:finish():599] file stream finish is done
2023-10-24 14:32:23,950 INFO SenderThread:2030398 [sender.py:transition_state():613] send defer: 13
2023-10-24 14:32:23,951 DEBUG HandlerThread:2030398 [handler.py:handle_request():146] handle_request: defer
2023-10-24 14:32:23,951 INFO HandlerThread:2030398 [handler.py:handle_request_defer():172] handle defer: 13
2023-10-24 14:32:23,951 DEBUG SenderThread:2030398 [sender.py:send_request():407] send_request: defer
2023-10-24 14:32:23,952 INFO SenderThread:2030398 [sender.py:send_request_defer():609] handle sender defer: 13
2023-10-24 14:32:23,952 INFO SenderThread:2030398 [sender.py:transition_state():613] send defer: 14
2023-10-24 14:32:23,952 DEBUG HandlerThread:2030398 [handler.py:handle_request():146] handle_request: defer
2023-10-24 14:32:23,952 INFO HandlerThread:2030398 [handler.py:handle_request_defer():172] handle defer: 14
2023-10-24 14:32:23,953 DEBUG SenderThread:2030398 [sender.py:send():380] send: final
2023-10-24 14:32:23,953 DEBUG SenderThread:2030398 [sender.py:send():380] send: footer
2023-10-24 14:32:23,954 DEBUG SenderThread:2030398 [sender.py:send_request():407] send_request: defer
2023-10-24 14:32:23,954 INFO SenderThread:2030398 [sender.py:send_request_defer():609] handle sender defer: 14
2023-10-24 14:32:23,955 DEBUG HandlerThread:2030398 [handler.py:handle_request():146] handle_request: poll_exit
2023-10-24 14:32:23,955 DEBUG SenderThread:2030398 [sender.py:send_request():407] send_request: poll_exit
2023-10-24 14:32:23,956 DEBUG HandlerThread:2030398 [handler.py:handle_request():146] handle_request: server_info
2023-10-24 14:32:23,957 DEBUG SenderThread:2030398 [sender.py:send_request():407] send_request: server_info
2023-10-24 14:32:23,963 DEBUG HandlerThread:2030398 [handler.py:handle_request():146] handle_request: get_summary
2023-10-24 14:32:23,964 DEBUG HandlerThread:2030398 [handler.py:handle_request():146] handle_request: sampled_history
2023-10-24 14:32:23,965 DEBUG HandlerThread:2030398 [handler.py:handle_request():146] handle_request: internal_messages
2023-10-24 14:32:23,965 DEBUG HandlerThread:2030398 [handler.py:handle_request():146] handle_request: job_info
2023-10-24 14:32:24,583 DEBUG SenderThread:2030398 [sender.py:send_request():407] send_request: job_info
2023-10-24 14:32:24,583 INFO MainThread:2030398 [wandb_run.py:_footer_history_summary_info():3599] rendering history
2023-10-24 14:32:24,584 INFO MainThread:2030398 [wandb_run.py:_footer_history_summary_info():3631] rendering summary
2023-10-24 14:32:24,584 INFO MainThread:2030398 [wandb_run.py:_footer_sync_info():3558] logging synced files
2023-10-24 14:32:24,585 DEBUG HandlerThread:2030398 [handler.py:handle_request():146] handle_request: shutdown
2023-10-24 14:32:24,585 INFO HandlerThread:2030398 [handler.py:finish():866] shutting down handler
2023-10-24 14:32:24,966 INFO WriterThread:2030398 [datastore.py:close():294] close: /nfs/production/surechembl/model_training/tuning/wandb/run-20231024_143149-ppass2nv/run-ppass2nv.wandb
2023-10-24 14:32:25,583 INFO SenderThread:2030398 [sender.py:finish():1534] shutting down sender
2023-10-24 14:32:25,584 INFO SenderThread:2030398 [file_pusher.py:finish():175] shutting down file pusher
2023-10-24 14:32:25,584 INFO SenderThread:2030398 [file_pusher.py:join():181] waiting for file pusher
debug.log:
2023-10-24 14:31:49,768 INFO MainThread:2030259 [wandb_setup.py:_flush():76] Current SDK version is 0.15.12
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_setup.py:_flush():76] Configure stats pid to 2030259
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_setup.py:_flush():76] Loading settings from /homes/nbosc/.config/wandb/settings
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_setup.py:_flush():76] Loading settings from /nfs/production/arl/chembl/nbosc/surechembl/model_training/tuning/wandb/settings
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_setup.py:_flush():76] Loading settings from environment variables: {}
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'bioformer8L_wandb_tuning.py', 'program_abspath': '/nfs/production/sureche$
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_init.py:_log_setup():528] Logging user logs to /nfs/production/surechembl/model_training/tuning/wandb/run-20231024_143149-ppass2nv/logs/debug.log
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_init.py:_log_setup():529] Logging internal logs to /nfs/production/surechembl/model_training/tuning/wandb/run-20231024_143149-ppass2nv/logs/debug-internal.log
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_init.py:init():568] calling init triggers
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_init.py:init():575] wandb.init called with sweep_config: {}
config: {}
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_init.py:init():618] starting backend
2023-10-24 14:31:49,769 INFO MainThread:2030259 [wandb_init.py:init():622] setting up manager
2023-10-24 14:31:49,772 INFO MainThread:2030259 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2023-10-24 14:31:49,776 INFO MainThread:2030259 [wandb_init.py:init():628] backend started and connected
2023-10-24 14:31:49,782 INFO MainThread:2030259 [wandb_init.py:init():720] updated telemetry
2023-10-24 14:31:49,783 INFO MainThread:2030259 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
2023-10-24 14:31:50,208 INFO MainThread:2030259 [wandb_run.py:_on_init():2220] communicating current version
2023-10-24 14:31:50,278 INFO MainThread:2030259 [wandb_run.py:_on_init():2229] got version response
2023-10-24 14:31:50,279 INFO MainThread:2030259 [wandb_init.py:init():804] starting run threads in backend
2023-10-24 14:31:50,380 INFO MainThread:2030259 [wandb_run.py:_console_start():2199] atexit reg
2023-10-24 14:31:50,380 INFO MainThread:2030259 [wandb_run.py:_redirect():2054] redirect: wrap_raw
2023-10-24 14:31:50,380 INFO MainThread:2030259 [wandb_run.py:_redirect():2119] Wrapping output streams.
2023-10-24 14:31:50,380 INFO MainThread:2030259 [wandb_run.py:_redirect():2144] Redirects installed.
2023-10-24 14:31:50,381 INFO MainThread:2030259 [wandb_init.py:init():845] run started, returning control to user process
2023-10-24 14:32:12,563 INFO MainThread:2030259 [pyagent.py:run():314] Starting sweep agent: entity=None, project=None, count=20
2023-10-24 14:32:25,644 WARNING MsgRouterThr:2030259 [router.py:message_loop():77] message_loop has been closed
2023-10-24 14:32:26,484 INFO Thread-5 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:27,095 INFO Thread-6 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:37,130 INFO Thread-7 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:37,702 INFO Thread-8 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:47,740 INFO Thread-9 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:48,517 INFO Thread-10 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:58,555 INFO Thread-11 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:59,187 INFO Thread-12 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:09,219 INFO Thread-13 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:12,426 INFO Thread-14 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:22,461 INFO Thread-15 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:23,164 INFO Thread-16 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:33,200 INFO Thread-17 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:35,141 INFO Thread-18 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:45,176 INFO Thread-19 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:45,885 INFO Thread-20 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:55,921 INFO Thread-21 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:56,667 INFO Thread-22 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:34:06,704 INFO Thread-23 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:34:08,876 INFO Thread-24 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}