I made a change to my script and now I have to manually synchronise my runs, my script contains
os.environ['WANDB_MODE'] = 'dryrun'
# log all experimental args to wandb
The change I made was the first line, setting
WANDB_MODE=dryrun. From that point on I cannot re-enable automatic synchronisation.
wandb online and run my script with
dryrun=False. I also realised that this doesn’t unset WANDB_MODE so I tried setting it to ‘online’ when
dryrun==False. But it always ends up logging to
wandb/offline-run-* and I have to manually sync it.
Is there another step to re-enable sync’ing?
I fixed the original problem, I had a bug in my argument parser that meant that dry_run was always true.
However I still cannot get all my metrics synchronised. I’m using the Huggingface trainer, the trainer first trains the models and logs the training metrics, then performed a separate evaluation step.
When I look at the local
wandb-summary.json it contains all the metrics, including
eval/loss but none of the evaluation metrics are available on the cloud.
Also if I look at
files/output.log it’s different to the cloud version - it’s complete and the cloud version is truncated part way through the evaluation.
But if I run
wandb sync --sync-all it says “nothing to sync”, despite there being a clear difference - it seems to think the run has already ended too early. Perhaps there’s a bug in the Huggingface integration that causes it to mark the run complete after training but before evaluation?
For example this is the log on the cloud
09/05/2022 14:01:29 - INFO - __main__ - *** Evaluate ***
***** Running Evaluation *****
Num examples = 3820356
Batch size = 1024
██████████████████████████████████████████████████████████████████████████████████████████▎ | 3665/3731 [11:10<00:13, 4.92it/s]
Compared to the local version
100%|███████████████████████████████████████████████████████████████████████████████████████████▉| 3728/3731 [11:23<00:00, 5.47it/s]
***** eval metrics *****
epoch = 1.0
eval_loss = 4.4664
eval_runtime = 0:11:23.99
eval_samples = 3820356
eval_samples_per_second = 5585.376
eval_steps_per_second = 5.455
100%|████████████████████████████████████████████████████████████████████████████████████████████| 3731/3731 [11:23<00:00, 5.46it/s]
09/05/2022 14:12:54 - INFO - __main__ - *** Train Finished ***
$ wandb sync --sync-all
wandb: ERROR Nothing to sync.
I’ve found a way around this - I’m not really sure why it’s happening but I noticed that the huggingface trainer logs the metrics at the end of training as follows:
if not args.load_best_model_at_end
Meaning it logs the validation loss, but only if you train with
load_best_model_at_end=True and set
save_strategy==evaluation_strategy (epoch or steps) and
Doing this means I didn’t need to perform the separate eval step since it’s already logged the evaluation loss from the best model during training.
Hi @david-waterworth thank you for reporting this. Has this been now resolved for you with the arguments you mentioned at your last message? Regarding the initial post, it seems that the
dryrun mode was due to your argument parser. Does this mean you are now running in
online mode? that would explain why the syncing with
--sync-all would output the message
nothing to sync.
Hi, yes it’s working now thanks.
Hi @david-waterworth thank you for confirming this, and glad the issue is now resolved for you! I will close the ticket for now, but please feel free to re-open the ticket by posting here any further questions related to this issue and we will be happy to keep investigating.