Hugging Face with Sweeps causes Broken pipe

Hi, using WandB to train my models seem promising but what I need the most at the moment is hyper parameter tuning. I try to use Sweeps but it does not work and unfortunately I can’t figure out why. I should say that model training without HP tuning works normally so the issue must be in how I set up Sweeps.

Here is the code related to the model training:

def wandb_hp_space(trial):
    # hyperparameters I want to optimize
    return {
        'method': 'random',
        'metric': {'name': 'f1', 'goal': 'maximize'},
        'parameters': {
            'learning_rate': {'values': [2e-5, 5e-5, 2e-4,5e-4, 2e-3]},
            'per_device_train_batch_size': {'values': [8,16,32]},
            'per_device_eval_batch_size': {'values': [8,16,32]},
            'weight_decay': {'values': [0.01, 0.001]}
        }
    }

def model_init(trial):
    return AutoModelForTokenClassification.from_pretrained(
        model_checkpoint,
        num_labels=len(label_list),
        id2label=id2label,
        label2id=label2id).to('cuda:0')

training_args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    weight_decay=0.01,
    # push_to_hub=True,
    save_strategy="epoch",
    load_best_model_at_end = True,
    metric_for_best_model='f1',
    report_to='wandb'
)

# define the trainer and its parameters
trainer = Trainer(
    model=None,
    args = training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    model_init = model_init
)

best_trial = trainer.hyperparameter_search(
    direction="maximize",
    backend="wandb",
    hp_space=wandb_hp_space,
    n_trials=20,
    # compute_objective=compute_objective,
)

Here the Python error:

wandb: WARNING Calling wandb.login() after wandb.init() has no effect.
wandb: Waiting for W&B process to finish... (success).
wandb: 🚀 View run kind-vortex-4 at: https://wandb.ai/surechembl/uncategorized/runs/ppass2nv
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20231024_143149-ppass2nv/logs
wandb: Agent Starting Run: np5vpbxf with config:
wandb:  learning_rate: 0.002
wandb:  per_device_eval_batch_size: 8
wandb:  per_device_train_batch_size: 8
wandb:  weight_decay: 0.01
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/agents/pyagent.py", line 298, in _run_job
    self._function()
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/transformers/integrations/integration_utils.py", line 497, in _objective
    run.config.update({"assignments": {}, "metric": metric})
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/wandb_config.py", line 186, in update
    self._callback(data=sanitized)
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 370, in wrapper_fn
    return func(self, *args, **kwargs)
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1326, in _config_callback
    self._backend.interface.publish_config(key=key, val=val, data=data)
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 168, in publish_config
    self._publish_config(cfg)
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 358, in _publish_config
    self._publish(rec)
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self.send_server_request(server_req)
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/nfs/production/surechembl/model_training/pubmedbert/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe

Here is debug-internal.log (bottom lines):

2023-10-24 14:32:23,118 INFO    SenderThread:2030398 [sender.py:send_request_defer():609] handle sender defer: 12
2023-10-24 14:32:23,118 INFO    SenderThread:2030398 [file_stream.py:finish():595] file stream finish called
2023-10-24 14:32:23,950 INFO    SenderThread:2030398 [file_stream.py:finish():599] file stream finish is done
2023-10-24 14:32:23,950 INFO    SenderThread:2030398 [sender.py:transition_state():613] send defer: 13
2023-10-24 14:32:23,951 DEBUG   HandlerThread:2030398 [handler.py:handle_request():146] handle_request: defer
2023-10-24 14:32:23,951 INFO    HandlerThread:2030398 [handler.py:handle_request_defer():172] handle defer: 13
2023-10-24 14:32:23,951 DEBUG   SenderThread:2030398 [sender.py:send_request():407] send_request: defer
2023-10-24 14:32:23,952 INFO    SenderThread:2030398 [sender.py:send_request_defer():609] handle sender defer: 13
2023-10-24 14:32:23,952 INFO    SenderThread:2030398 [sender.py:transition_state():613] send defer: 14
2023-10-24 14:32:23,952 DEBUG   HandlerThread:2030398 [handler.py:handle_request():146] handle_request: defer
2023-10-24 14:32:23,952 INFO    HandlerThread:2030398 [handler.py:handle_request_defer():172] handle defer: 14
2023-10-24 14:32:23,953 DEBUG   SenderThread:2030398 [sender.py:send():380] send: final
2023-10-24 14:32:23,953 DEBUG   SenderThread:2030398 [sender.py:send():380] send: footer
2023-10-24 14:32:23,954 DEBUG   SenderThread:2030398 [sender.py:send_request():407] send_request: defer
2023-10-24 14:32:23,954 INFO    SenderThread:2030398 [sender.py:send_request_defer():609] handle sender defer: 14
2023-10-24 14:32:23,955 DEBUG   HandlerThread:2030398 [handler.py:handle_request():146] handle_request: poll_exit
2023-10-24 14:32:23,955 DEBUG   SenderThread:2030398 [sender.py:send_request():407] send_request: poll_exit
2023-10-24 14:32:23,956 DEBUG   HandlerThread:2030398 [handler.py:handle_request():146] handle_request: server_info
2023-10-24 14:32:23,957 DEBUG   SenderThread:2030398 [sender.py:send_request():407] send_request: server_info
2023-10-24 14:32:23,963 DEBUG   HandlerThread:2030398 [handler.py:handle_request():146] handle_request: get_summary
2023-10-24 14:32:23,964 DEBUG   HandlerThread:2030398 [handler.py:handle_request():146] handle_request: sampled_history
2023-10-24 14:32:23,965 DEBUG   HandlerThread:2030398 [handler.py:handle_request():146] handle_request: internal_messages
2023-10-24 14:32:23,965 DEBUG   HandlerThread:2030398 [handler.py:handle_request():146] handle_request: job_info
2023-10-24 14:32:24,583 DEBUG   SenderThread:2030398 [sender.py:send_request():407] send_request: job_info
2023-10-24 14:32:24,583 INFO    MainThread:2030398 [wandb_run.py:_footer_history_summary_info():3599] rendering history
2023-10-24 14:32:24,584 INFO    MainThread:2030398 [wandb_run.py:_footer_history_summary_info():3631] rendering summary
2023-10-24 14:32:24,584 INFO    MainThread:2030398 [wandb_run.py:_footer_sync_info():3558] logging synced files
2023-10-24 14:32:24,585 DEBUG   HandlerThread:2030398 [handler.py:handle_request():146] handle_request: shutdown
2023-10-24 14:32:24,585 INFO    HandlerThread:2030398 [handler.py:finish():866] shutting down handler
2023-10-24 14:32:24,966 INFO    WriterThread:2030398 [datastore.py:close():294] close: /nfs/production/surechembl/model_training/tuning/wandb/run-20231024_143149-ppass2nv/run-ppass2nv.wandb
2023-10-24 14:32:25,583 INFO    SenderThread:2030398 [sender.py:finish():1534] shutting down sender
2023-10-24 14:32:25,584 INFO    SenderThread:2030398 [file_pusher.py:finish():175] shutting down file pusher
2023-10-24 14:32:25,584 INFO    SenderThread:2030398 [file_pusher.py:join():181] waiting for file pusher

debug.log:

2023-10-24 14:31:49,768 INFO    MainThread:2030259 [wandb_setup.py:_flush():76] Current SDK version is 0.15.12
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_setup.py:_flush():76] Configure stats pid to 2030259
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_setup.py:_flush():76] Loading settings from /homes/nbosc/.config/wandb/settings
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_setup.py:_flush():76] Loading settings from /nfs/production/arl/chembl/nbosc/surechembl/model_training/tuning/wandb/settings
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_setup.py:_flush():76] Loading settings from environment variables: {}
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'bioformer8L_wandb_tuning.py', 'program_abspath': '/nfs/production/sureche$
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_init.py:_log_setup():528] Logging user logs to /nfs/production/surechembl/model_training/tuning/wandb/run-20231024_143149-ppass2nv/logs/debug.log
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_init.py:_log_setup():529] Logging internal logs to /nfs/production/surechembl/model_training/tuning/wandb/run-20231024_143149-ppass2nv/logs/debug-internal.log
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_init.py:init():568] calling init triggers
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_init.py:init():575] wandb.init called with sweep_config: {}
config: {}
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_init.py:init():618] starting backend
2023-10-24 14:31:49,769 INFO    MainThread:2030259 [wandb_init.py:init():622] setting up manager
2023-10-24 14:31:49,772 INFO    MainThread:2030259 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2023-10-24 14:31:49,776 INFO    MainThread:2030259 [wandb_init.py:init():628] backend started and connected
2023-10-24 14:31:49,782 INFO    MainThread:2030259 [wandb_init.py:init():720] updated telemetry
2023-10-24 14:31:49,783 INFO    MainThread:2030259 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
2023-10-24 14:31:50,208 INFO    MainThread:2030259 [wandb_run.py:_on_init():2220] communicating current version
2023-10-24 14:31:50,278 INFO    MainThread:2030259 [wandb_run.py:_on_init():2229] got version response
2023-10-24 14:31:50,279 INFO    MainThread:2030259 [wandb_init.py:init():804] starting run threads in backend
2023-10-24 14:31:50,380 INFO    MainThread:2030259 [wandb_run.py:_console_start():2199] atexit reg
2023-10-24 14:31:50,380 INFO    MainThread:2030259 [wandb_run.py:_redirect():2054] redirect: wrap_raw
2023-10-24 14:31:50,380 INFO    MainThread:2030259 [wandb_run.py:_redirect():2119] Wrapping output streams.
2023-10-24 14:31:50,380 INFO    MainThread:2030259 [wandb_run.py:_redirect():2144] Redirects installed.
2023-10-24 14:31:50,381 INFO    MainThread:2030259 [wandb_init.py:init():845] run started, returning control to user process
2023-10-24 14:32:12,563 INFO    MainThread:2030259 [pyagent.py:run():314] Starting sweep agent: entity=None, project=None, count=20
2023-10-24 14:32:25,644 WARNING MsgRouterThr:2030259 [router.py:message_loop():77] message_loop has been closed
2023-10-24 14:32:26,484 INFO    Thread-5  :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:27,095 INFO    Thread-6  :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:37,130 INFO    Thread-7  :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:37,702 INFO    Thread-8  :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:47,740 INFO    Thread-9  :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:48,517 INFO    Thread-10 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:58,555 INFO    Thread-11 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:32:59,187 INFO    Thread-12 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:09,219 INFO    Thread-13 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:12,426 INFO    Thread-14 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:22,461 INFO    Thread-15 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:23,164 INFO    Thread-16 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:33,200 INFO    Thread-17 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:35,141 INFO    Thread-18 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:45,176 INFO    Thread-19 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:45,885 INFO    Thread-20 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:55,921 INFO    Thread-21 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:33:56,667 INFO    Thread-22 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:34:06,704 INFO    Thread-23 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}
2023-10-24 14:34:08,876 INFO    Thread-24 :2030259 [wandb_run.py:_config_callback():1324] config_cb None None {'assignments': {}, 'metric': 'eval/loss'}

I changed my code following this tutorial instead and now it works.

2 Likes

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.