Sync problems

I run wandb on a cluster. Job nodes don’t have internet so I need to run wandb sync --sync-all from login nodes.
If I kill the job, the command keep syncing the killed runs over and over again: I run the command, the run is synced, I re-run the command (to sync new runs) and the old runs are synced again.
I suspect it’s because how the job was killed, that maybe wandb.finish() was not called. Could that be it?

I then try to selectively sync only runs belonging to a project with wandb sync -p myproj but it just says

wandb: Number of runs to be synced: 46
wandb: Showing 5 runs to be synced:
wandb:   wandb/offline-run-20230523_123938
... [list of the 5 runs]
wandb: NOTE: use wandb sync --clean to delete 81 synced runs from local directory.
wandb: NOTE: use wandb sync --sync-all to sync 46 unsynced runs from local directory.

But the 5 runs are not actually synced.
I also try tried cleaning with wandb sync --clean but nothing happens, despite it says that there are 81 runs that can be deleted.

Hi @parisi, thanks for reporting this! I’ll try to reproduce this on my end to see exactly what’s going on here so, to make sure, the steps you’re following are:

  1. Execute some offline runs (let’s say 10 for example)
  2. Run wandb sync --sync-all
  3. Kill that before all runs are synced
  4. Re-run wandb sync --sync-all
  5. Runs synced initially are synced again (so they are synced twice)

In general, I would recommend you to both assign the run to a variable like run = wandb.init() and also call run.finish() at the end to ensure no processes are running on your background.

Thanks. I have try / except to call finish(), and it works if I kill my run with CTRL+C or if it crashes. However, if I close the terminal my run still appears ar “running” on wanbd UI.
I guess this is also what happens if I kill SLURM jobs with scancel.