I run wandb on a cluster. Job nodes don’t have internet so I need to run wandb sync --sync-all from login nodes.
If I kill the job, the command keep syncing the killed runs over and over again: I run the command, the run is synced, I re-run the command (to sync new runs) and the old runs are synced again.
I suspect it’s because how the job was killed, that maybe wandb.finish() was not called. Could that be it?
I then try to selectively sync only runs belonging to a project with wandb sync -p myproj but it just says
wandb: Number of runs to be synced: 46
wandb: Showing 5 runs to be synced:
wandb: wandb/offline-run-20230523_123938
... [list of the 5 runs]
wandb: NOTE: use wandb sync --clean to delete 81 synced runs from local directory.
wandb: NOTE: use wandb sync --sync-all to sync 46 unsynced runs from local directory.
But the 5 runs are not actually synced.
I also try tried cleaning with wandb sync --clean but nothing happens, despite it says that there are 81 runs that can be deleted.
Hi @parisi, thanks for reporting this! I’ll try to reproduce this on my end to see exactly what’s going on here so, to make sure, the steps you’re following are:
Execute some offline runs (let’s say 10 for example)
Run wandb sync --sync-all
Kill that before all runs are synced
Re-run wandb sync --sync-all
Runs synced initially are synced again (so they are synced twice)
In general, I would recommend you to both assign the run to a variable like run = wandb.init() and also call run.finish() at the end to ensure no processes are running on your background.
Thanks. I have try / except to call finish(), and it works if I kill my run with CTRL+C or if it crashes. However, if I close the terminal my run still appears ar “running” on wanbd UI.
I guess this is also what happens if I kill SLURM jobs with scancel.
Thanks for sharing this @parisi. This is probably because when you close the terminal, finish()hasn’t been called yet so the run will keep running. Would it work for you to use try / except and kill the run with CTRL+C if needed?
My code already has try/except, and I think they work fine with CTRL+C but maybe not when I use SLURM scancel. I will investigate this more.
Isn’t there a way to detect “dead” runs? Like passing a timeout argument to init such that if the run is not synced after X time it will be considered finished?
EDIT: Yes, my code can catch CTRL+C without problems, but scancel maybe sends a different signal.
Hi @parisi, thanks for your answer! Currently this isn’t possible but I can create a feature request to add that argument to the wandb.init() function, would you mind giving me some details about your use case so I can share that with our Product Team?
Sure, thanks!
Having this argument can fix potential problem if the machine where the code is running crashes and the finish command is never launched. If this doesn’t happen, the server sees the run as still “running”. Also, if one restarts the machine and does wandb sync --sync-all it keeps resynching the dead run. This is a waste of time.