Hello all!
I am running a lot of runs per day (~2K sometimes) and have been encountering some strange errors in a handful of my runs. I am doing this on a large computation cluster, so to avoid putting too much strain on the network for every run I
- set wandb to run offline (
export WANDB_MODE="offline"
) - set the
WANDB_DIR
to be a tmp directory (WANDB_DIR=$(mktemp -d)
) - Run my run as normal (runs are relatively short often taking ~2-20 minutes)
- Sync my wandb runs (
wandb sync $WANDB_DIR/wandb/offline*
) - Clean up my tmpdir (
rm -rf $WANDB_DIR
)
The full script is below:
my_config= # some config unique to this run
export WANDB_MODE="offline"
export WANDB_DIR=$(mktemp -d)
python train.py --config $my_config
wandb sync $WANDB_DIR/wandb/offline*
rm -rf $WANDB_DIR
In 99% of runs this works totally fine, however in a handful I get messages like:
Syncing: https://wandb.ai/some_run ... wandb: WARNING .wandb file is incomplete (invalid padding), be sure to sync this run again once it's finished
done.
If I actually look at some_run
, it seems totally normal and I don’t see any missing data. Furthermore the wandb sync
command returns 0 exit code so I would assume all is well despite the error message. But the existence of the error is concerning and I am not sure the best way to deal with it or if it needs to be delt with at all. I am grateful for any advice people have!