Checkpoint path error using wandb.agent

Hello, I’m running my jupyter notebook using sweeps and wandb.agent but most of my sweep runs are failing with the following error message:

wandb: Adding directory to artifact (/home/phdomingues/masters/results/ViT/UNIFESP/masked/SN-UNIFESP/checkpoint-77)... Done. 2.4s
wandb: Adding directory to artifact (/home/phdomingues/masters/results/ViT/UNIFESP/masked/SN-UNIFESP/checkpoint-154)... Done. 3.2s
Traceback (most recent call last):
  File "/tmp/ipykernel_4543/1672163409.py", line 32, in train
    trainer.train()
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/trainer.py", line 2311, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/trainer.py", line 2733, in _maybe_log_save_evaluate
    self.control = self.callback_handler.on_save(self.args, self.state, self.control)
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/trainer_callback.py", line 487, in on_save
    return self.call_event("on_save", args, state, control)
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/trainer_callback.py", line 498, in call_event
    result = getattr(callback, event)(
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/integrations/integration_utils.py", line 847, in on_save
    artifact.add_dir(artifact_path)
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/wandb/sdk/artifacts/artifact.py", line 1226, in add_dir
    raise ValueError("Path is not a directory: {}".format(local_path))
ValueError: Path is not a directory: /home/phdomingues/masters/results/ViT/UNIFESP/masked/SN-UNIFESP/checkpoint-231

A snippet of my code:

def train(config=None):
    with wandb.init(config=config):
        config = wandb.config

        training_args = TrainingArguments(
            output_dir=output_dir,
            report_to='wandb',
            save_strategy='epoch',
            evaluation_strategy='epoch',
            logging_strategy='epoch',
            learning_rate=config.learning_rate,
            weight_decay=config.weight_decay,
            num_train_epochs=config.epochs,
            per_device_train_batch_size=config.batch_size,
            per_device_eval_batch_size=2,
            save_total_limit=2,
            remove_unused_columns=False,
            push_to_hub=False,
            fp16=True,
            load_best_model_at_end=True,
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            data_collator=collate_fn,
            compute_metrics=partial(compute_metrics, metrics=[load_metric(m, trust_remote_code=True) for m in METRICS]),
            train_dataset=ds['train'],
            eval_dataset=ds['test'],
            tokenizer=processor,
        )
        trainer.train()

wandb.agent(sweep_id, train, count=20)
wandb.finish()

It seems to me that the library fails to create a checkpoint directory and breaks when trying to access it, or maybe it removes it too soon…

I would appreciate it if someone could help me figure out what is happening and how to solve it.

Hi @pdomingues Good day and thank you for reaching out to us! Happy to help you on this.

It seems that the error is triggered by an artifact directory not being recognized.

ValueError: Path is not a directory: /home/phdomingues/masters/results/ViT/UNIFESP/masked/SN-UNIFESP/checkpoint-231

As a first step of troubleshooting, have you checked if this directory exist?

Hello @paulo-sabile and thanks for the response.
I can confirm that the directory does not exist when the error is thrown.

Thank you for clarifying this @pdomingues

To further troubleshoot this error, could you please provide us a copy of your debug-internal.log and debug.log for the affected run. These files are under your local folder wandb/run-_-/logs in the same directory where you’re running your code. These files will help us with more details about this error.

Can you also please share your current SDK version, you can get this by running wandb --version

Thank you in advance!

Hello @paulo-sabile , sdk version is 0.17.2 and I tried downgrading to 0.15, but the error persists.

Unfortunately, I don’t have the exact logs of the error I posted anymore, so I reproduced the experiment again.

Here you can see the error is still the same:

Run fahuw2h2 errored:
Traceback (most recent call last):
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 307, in _run_job
    self._function()
  File "/tmp/ipykernel_476/1672163409.py", line 32, in train
    trainer.train()
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/trainer.py", line 2311, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/trainer.py", line 2733, in _maybe_log_save_evaluate
    self.control = self.callback_handler.on_save(self.args, self.state, self.control)
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/trainer_callback.py", line 487, in on_save
    return self.call_event("on_save", args, state, control)
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/trainer_callback.py", line 498, in call_event
    result = getattr(callback, event)(
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/transformers/integrations/integration_utils.py", line 847, in on_save
    artifact.add_dir(artifact_path)
  File "/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/wandb/sdk/artifacts/artifact.py", line 1226, in add_dir
    raise ValueError("Path is not a directory: {}".format(local_path))
ValueError: Path is not a directory: /home/phdomingues/masters/results/ViT/UNIFESP/masked/FP-UNIFESP/checkpoint-60

The forum won’t let me upload the files so here is a link to both uploaded to my Google Drive.

I couldn’t find any errors on the logs for some reason, but I made sure they are the correct ones and you can even check the run id (fahuw2h2) in debug-internal.log.

Thanks again for the response.

Hi @pdomingues Good day and thank you for patiently waiting for our update.

We reviewed this and we think that the local directory doesn’t exist yet when adding the artifact. this might be a concurrency issue.

To further investigate this, we would like to ask the following:

  • What is the size of that directory from your checkpoints?
  • Could you please try to add a call like time.sleep(100) just to check if that would resolve the issue?

Thank you!

Hello @paulo-sabile,

I also suspect concurrency problems, but since it’s not coming directly from my code, I’m not sure how to proceed.

As for what you asked:

  • I’m constantly recreating the directory so it doesn’t get too large from models being saved after every test I run, but the parent directory is 13GB.
  • I’ve tried adding multiple sleep calls to my code. Still, I can only control execution before and after training (since I’m using the wandb-huggingface interface), and the problem occurs mid-training, so this didn’t solve anything. I’ve also tried tracking the file and the exact line where the error occurs (/home/phdomingues/.miniconda3/envs/wandb/lib/python3.8/site-packages/wandb/sdk/artifacts/artifact.py) and adding sleep and retry logic to that, together with some print statements to ensure everything was executed, but no success either.

Unrelated to what you asked, but another test I made recently was switching to PyTorch CPU, but that yielded nothing as well.

Again, thanks for the responses, and no worries about the waiting time. I’m already glad anyone is giving me a hand with this.

Thank you for confirming this @pdomingues . We’ll further review this and will get back to you for an update.