How to reenable automatic synchronisation

I made a change to my script and now I have to manually synchronise my runs, my script contains

if args.dry_run:
    os.environ['WANDB_MODE'] = 'dryrun'

wandb.init(project=args.project_name, notes=args.notes)

# log all experimental args to wandb

The change I made was the first line, setting WANDB_MODE=dryrun. From that point on I cannot re-enable automatic synchronisation.

I’ve run wandb online and run my script with dryrun=False. I also realised that this doesn’t unset WANDB_MODE so I tried setting it to ‘online’ when dryrun==False. But it always ends up logging to wandb/offline-run-* and I have to manually sync it.

Is there another step to re-enable sync’ing?

I fixed the original problem, I had a bug in my argument parser that meant that dry_run was always true.

However I still cannot get all my metrics synchronised. I’m using the Huggingface trainer, the trainer first trains the models and logs the training metrics, then performed a separate evaluation step.

When I look at the local wandb-summary.json it contains all the metrics, including eval/loss but none of the evaluation metrics are available on the cloud.

Also if I look at files/output.log it’s different to the cloud version - it’s complete and the cloud version is truncated part way through the evaluation.

But if I run wandb sync --sync-all it says “nothing to sync”, despite there being a clear difference - it seems to think the run has already ended too early. Perhaps there’s a bug in the Huggingface integration that causes it to mark the run complete after training but before evaluation?

For example this is the log on the cloud

09/05/2022 14:01:29 - INFO - __main__ -   *** Evaluate ***
***** Running Evaluation *****
  Num examples = 3820356
  Batch size = 1024
██████████████████████████████████████████████████████████████████████████████████████████▎ | 3665/3731 [11:10<00:13,  4.92it/s]

Compared to the local version

100%|███████████████████████████████████████████████████████████████████████████████████████████▉| 3728/3731 [11:23<00:00,  5.47it/s]
***** eval metrics *****
  epoch                   =        1.0
  eval_loss               =     4.4664
  eval_runtime            = 0:11:23.99
  eval_samples            =    3820356
  eval_samples_per_second =   5585.376
  eval_steps_per_second   =      5.455
100%|████████████████████████████████████████████████████████████████████████████████████████████| 3731/3731 [11:23<00:00,  5.46it/s]
09/05/2022 14:12:54 - INFO - __main__ -   *** Train Finished ***


$ wandb sync --sync-all
wandb: ERROR Nothing to sync.

I’ve found a way around this - I’m not really sure why it’s happening but I noticed that the huggingface trainer logs the metrics at the end of training as follows:

                if not args.load_best_model_at_end
                else {
                    f"eval/{args.metric_for_best_model}": state.best_metric,
                    "train/total_floss": state.total_flos,

Meaning it logs the validation loss, but only if you train with load_best_model_at_end=True and set save_strategy==evaluation_strategy (epoch or steps) and save_steps=eval_steps.

Doing this means I didn’t need to perform the separate eval step since it’s already logged the evaluation loss from the best model during training.

Hi @david-waterworth thank you for reporting this. Has this been now resolved for you with the arguments you mentioned at your last message? Regarding the initial post, it seems that the dryrun mode was due to your argument parser. Does this mean you are now running in online mode? that would explain why the syncing with --sync-all would output the message nothing to sync.

Hi, yes it’s working now thanks.

Hi @david-waterworth thank you for confirming this, and glad the issue is now resolved for you! I will close the ticket for now, but please feel free to re-open the ticket by posting here any further questions related to this issue and we will be happy to keep investigating.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.