How to correctly use wandb hyperparameter tuning with Huggingface?

Hi everyone :wave:, I am using wandb with Huggingface in a AWS Sagemaker notebook and I am refering to the tutorial here: Define sweep configuration for hyperparameter tuning. and Hyperparameter Search using Trainer API.

My codes works well without hyperparameter search, but all runs failed after I enable hyperparameter search.

This is the error message from one of the failed runs:

Run 0ilv70r3 errored: ValueError("boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([[nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n ...,\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan]], device='cuda:0', dtype=torch.float16)") wandb: ERROR Run 0ilv70r3 errored: ValueError("boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([[nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n ...,\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan]], device='cuda:0', dtype=torch.float16)")

My model is an object detection model. It seems that the outputs do not fit. I wonder how can I solve this issue.

Here are some useful snippets of my code:

def wandb_hp_space(trial):
    return {
        "method": "bayes",
        "metric": {"name": "loss", "goal": "minimize"},
        "parameters": {
            "learning_rate": {"distribution": "log_uniform", "min": 1e-6, "max": 1e-4},
            "per_device_train_batch_size": {"values": [8, 16]},
        },
    }

    training_args = TrainingArguments(
        output_dir=args.output_dir,
        overwrite_output_dir=True,
        per_device_train_batch_size=args.per_device_train_batch_size,
        weight_decay=args.weight_decay,
        warmup_steps=args.warmup_steps,
        save_total_limit=args.save_total_limit,
        learning_rate=args.learning_rate,
        fp16=True,
        save_strategy="epoch",
        logging_strategy='epoch',
        remove_unused_columns=False,
        push_to_hub=True,
        hub_model_id=args.hub_model_id,
        hub_token=args.hub_token,
        hub_strategy="every_save",
        report_to="wandb",
    )

    def model_init(trial):
        return AutoModelForObjectDetection.from_pretrained(
            args.pretrained_model,
            id2label=CLASS_ID_TO_NAME,
            label2id=CLASS_NAME_TO_ID,
            ignore_mismatched_sizes=True,
        )

    trainer = Trainer(
        model=None,
        model_init=model_init,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        data_collator=collate_fn,
        tokenizer=image_processor,
    )

    trainer.hyperparameter_search(
        hp_space=wandb_hp_space,
        n_trials=5,
        direction="minimize",
        backend="wandb",
    )

I would greatly appreciate any guidance or advice on how to resolve this issue. Thank you very much in advance for your help! :pray: :pray:

Hi @oschan77, could you print all of the args that your script is receiving in your model_init? From what I can tell, the search space config looks good but I wanted to check that all the arg values are valid since this should be the only difference between a hyperparam search and a standard run.

Thank you,
Nate

Hi @nathank ! Thank you for your help! This is all the args that the script receive:

  parser.add_argument("--per_device_train_batch_size", type=int, default=4)
  parser.add_argument("--warmup_steps", type=int, default=100)
  parser.add_argument("--save_total_limit", type=int, default=2)
  parser.add_argument("--pretrained_model", type=str, default="facebook/detr-resnet-50")
  parser.add_argument("--learning_rate", type=float, default=1e-5)
  parser.add_argument("--weight_decay", type=float, default=1e-4)
  parser.add_argument("--image_resize_ratio", type=float, default=0.25)
  parser.add_argument("--hub_model_id", type=str, default=None)
  parser.add_argument("--hub_token", type=str, default=None)
  parser.add_argument("--wandb_token", type=str, default=None)
  parser.add_argument("--wandb_project_name", type=str, default="detr-algae-v0")
  parser.add_argument("--wandb_run_name", type=str, default=None)
  parser.add_argument("--output_dir", type=str, default=os.environ["SM_MODEL_DIR"])
  parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
  parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
  parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])

Hi @nathank , any ideas? I am still facing the same issue. Thanks!

Hi @oschan77, can you print out the shape and values of your boxes1 during your training to confirm these are Nanโ€™s?

Also, itโ€™s possible that some of the parameters being suggested by the sweep are not within bounds that work for your model. Could you share your sweep config?

Hi @oschan77, I wanted to follow up and see if this is still an issue?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.