I am running a lot of runs per day (~2K sometimes) and have been encountering some strange errors in a handful of my runs. I am doing this on a large computation cluster, so to avoid putting too much strain on the network for every run I
- set wandb to run offline (
- set the
WANDB_DIRto be a tmp directory (
- Run my run as normal (runs are relatively short often taking ~2-20 minutes)
- Sync my wandb runs (
wandb sync $WANDB_DIR/wandb/offline*)
- Clean up my tmpdir (
rm -rf $WANDB_DIR)
The full script is below:
my_config= # some config unique to this run export WANDB_MODE="offline" export WANDB_DIR=$(mktemp -d) python train.py --config $my_config wandb sync $WANDB_DIR/wandb/offline* rm -rf $WANDB_DIR
In 99% of runs this works totally fine, however in a handful I get messages like:
Syncing: https://wandb.ai/some_run ... wandb: WARNING .wandb file is incomplete (invalid padding), be sure to sync this run again once it's finished done.
If I actually look at
some_run, it seems totally normal and I don’t see any missing data. Furthermore the
wandb sync command returns 0 exit code so I would assume all is well despite the error message. But the existence of the error is concerning and I am not sure the best way to deal with it or if it needs to be delt with at all. I am grateful for any advice people have!