Hi everyone , I am using wandb with Huggingface in a AWS Sagemaker notebook and I am refering to the tutorial here: Define sweep configuration for hyperparameter tuning. and Hyperparameter Search using Trainer API.
My codes works well without hyperparameter search, but all runs failed after I enable hyperparameter search.
This is the error message from one of the failed runs:
Run 0ilv70r3 errored: ValueError("boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([[nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n ...,\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan]], device='cuda:0', dtype=torch.float16)") wandb: ERROR Run 0ilv70r3 errored: ValueError("boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([[nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n ...,\n [nan, nan, nan, nan],\n [nan, nan, nan, nan],\n [nan, nan, nan, nan]], device='cuda:0', dtype=torch.float16)")
My model is an object detection model. It seems that the outputs do not fit. I wonder how can I solve this issue.
Here are some useful snippets of my code:
def wandb_hp_space(trial):
return {
"method": "bayes",
"metric": {"name": "loss", "goal": "minimize"},
"parameters": {
"learning_rate": {"distribution": "log_uniform", "min": 1e-6, "max": 1e-4},
"per_device_train_batch_size": {"values": [8, 16]},
},
}
training_args = TrainingArguments(
output_dir=args.output_dir,
overwrite_output_dir=True,
per_device_train_batch_size=args.per_device_train_batch_size,
weight_decay=args.weight_decay,
warmup_steps=args.warmup_steps,
save_total_limit=args.save_total_limit,
learning_rate=args.learning_rate,
fp16=True,
save_strategy="epoch",
logging_strategy='epoch',
remove_unused_columns=False,
push_to_hub=True,
hub_model_id=args.hub_model_id,
hub_token=args.hub_token,
hub_strategy="every_save",
report_to="wandb",
)
def model_init(trial):
return AutoModelForObjectDetection.from_pretrained(
args.pretrained_model,
id2label=CLASS_ID_TO_NAME,
label2id=CLASS_NAME_TO_ID,
ignore_mismatched_sizes=True,
)
trainer = Trainer(
model=None,
model_init=model_init,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
data_collator=collate_fn,
tokenizer=image_processor,
)
trainer.hyperparameter_search(
hp_space=wandb_hp_space,
n_trials=5,
direction="minimize",
backend="wandb",
)
I would greatly appreciate any guidance or advice on how to resolve this issue. Thank you very much in advance for your help!