How to correctly use wandb hyperparameter tuning with Huggingface?

oschan77 · March 15, 2023, 4:27am

Hi everyone , I am using wandb with Huggingface in a AWS Sagemaker notebook and I am refering to the tutorial here: Define sweep configuration for hyperparameter tuning. and Hyperparameter Search using Trainer API.

My codes works well without hyperparameter search, but all runs failed after I enable hyperparameter search.

This is the error message from one of the failed runs:

Run 0ilv70r3 errored: ValueError("boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([[nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n ...,\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan]], device='cuda:0', dtype=torch.float16)") wandb: ERROR Run 0ilv70r3 errored: ValueError("boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([[nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n ...,\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan]], device='cuda:0', dtype=torch.float16)")

My model is an object detection model. It seems that the outputs do not fit. I wonder how can I solve this issue.

Here are some useful snippets of my code:

def wandb_hp_space(trial):
    return {
        "method": "bayes",
        "metric": {"name": "loss", "goal": "minimize"},
        "parameters": {
            "learning_rate": {"distribution": "log_uniform", "min": 1e-6, "max": 1e-4},
            "per_device_train_batch_size": {"values": [8, 16]},
        },
    }

    training_args = TrainingArguments(
        output_dir=args.output_dir,
        overwrite_output_dir=True,
        per_device_train_batch_size=args.per_device_train_batch_size,
        weight_decay=args.weight_decay,
        warmup_steps=args.warmup_steps,
        save_total_limit=args.save_total_limit,
        learning_rate=args.learning_rate,
        fp16=True,
        save_strategy="epoch",
        logging_strategy='epoch',
        remove_unused_columns=False,
        push_to_hub=True,
        hub_model_id=args.hub_model_id,
        hub_token=args.hub_token,
        hub_strategy="every_save",
        report_to="wandb",
    )

    def model_init(trial):
        return AutoModelForObjectDetection.from_pretrained(
            args.pretrained_model,
            id2label=CLASS_ID_TO_NAME,
            label2id=CLASS_NAME_TO_ID,
            ignore_mismatched_sizes=True,
        )

    trainer = Trainer(
        model=None,
        model_init=model_init,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        data_collator=collate_fn,
        tokenizer=image_processor,
    )

    trainer.hyperparameter_search(
        hp_space=wandb_hp_space,
        n_trials=5,
        direction="minimize",
        backend="wandb",
    )

I would greatly appreciate any guidance or advice on how to resolve this issue. Thank you very much in advance for your help!

nathank · March 20, 2023, 5:32pm

Hi @oschan77, could you print all of the args that your script is receiving in your model_init? From what I can tell, the search space config looks good but I wanted to check that all the arg values are valid since this should be the only difference between a hyperparam search and a standard run.

Thank you,
Nate

oschan77 · March 20, 2023, 5:57pm

Hi @nathank ! Thank you for your help! This is all the args that the script receive:

  parser.add_argument("--per_device_train_batch_size", type=int, default=4)
  parser.add_argument("--warmup_steps", type=int, default=100)
  parser.add_argument("--save_total_limit", type=int, default=2)
  parser.add_argument("--pretrained_model", type=str, default="facebook/detr-resnet-50")
  parser.add_argument("--learning_rate", type=float, default=1e-5)
  parser.add_argument("--weight_decay", type=float, default=1e-4)
  parser.add_argument("--image_resize_ratio", type=float, default=0.25)
  parser.add_argument("--hub_model_id", type=str, default=None)
  parser.add_argument("--hub_token", type=str, default=None)
  parser.add_argument("--wandb_token", type=str, default=None)
  parser.add_argument("--wandb_project_name", type=str, default="detr-algae-v0")
  parser.add_argument("--wandb_run_name", type=str, default=None)
  parser.add_argument("--output_dir", type=str, default=os.environ["SM_MODEL_DIR"])
  parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
  parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
  parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])

oschan77 · March 27, 2023, 7:43am

Hi @nathank , any ideas? I am still facing the same issue. Thanks!

nathank · April 19, 2023, 6:18pm

Hi @oschan77, can you print out the shape and values of your boxes1 during your training to confirm these are Nan’s?

Also, it’s possible that some of the parameters being suggested by the sweep are not within bounds that work for your model. Could you share your sweep config?

nathank · April 28, 2023, 3:05pm

Hi @oschan77, I wanted to follow up and see if this is still an issue?

system · June 27, 2023, 3:06pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hugging Face with Sweeps causes Broken pipe W&B Help sweeps	2	876	December 24, 2023
Error while hyperparameter search W&B Help sweeps	5	917	September 24, 2022
Notebook/full code for "Hyperparameter Optimization for HuggingFace" W&B Help sweeps	5	1000	July 30, 2023
What is the official way to run a wandb sweep with hugging face (HF) transformers? W&B Help sweeps	7	2270	September 4, 2023
Sweeps: Waiting for W&B process to finish... (failed 1) W&B Help sweeps , projects , wandb	7	4110	May 31, 2023

How to correctly use wandb hyperparameter tuning with Huggingface?

Related topics