I run wandb on a cluster. Job nodes don’t have internet so I need to run wandb sync --sync-all
from login nodes.
If I kill the job, the command keep syncing the killed runs over and over again: I run the command, the run is synced, I re-run the command (to sync new runs) and the old runs are synced again.
I suspect it’s because how the job was killed, that maybe wandb.finish()
was not called. Could that be it?
I then try to selectively sync only runs belonging to a project with wandb sync -p myproj
but it just says
wandb: Number of runs to be synced: 46
wandb: Showing 5 runs to be synced:
wandb: wandb/offline-run-20230523_123938
... [list of the 5 runs]
wandb: NOTE: use wandb sync --clean to delete 81 synced runs from local directory.
wandb: NOTE: use wandb sync --sync-all to sync 46 unsynced runs from local directory.
But the 5 runs are not actually synced.
I also try tried cleaning with wandb sync --clean
but nothing happens, despite it says that there are 81 runs that can be deleted.