I run wandb on a cluster. Job nodes don’t have internet so I need to run
wandb sync --sync-all from login nodes.
If I kill the job, the command keep syncing the killed runs over and over again: I run the command, the run is synced, I re-run the command (to sync new runs) and the old runs are synced again.
I suspect it’s because how the job was killed, that maybe
wandb.finish() was not called. Could that be it?
I then try to selectively sync only runs belonging to a project with
wandb sync -p myproj but it just says
wandb: Number of runs to be synced: 46
wandb: Showing 5 runs to be synced:
... [list of the 5 runs]
wandb: NOTE: use wandb sync --clean to delete 81 synced runs from local directory.
wandb: NOTE: use wandb sync --sync-all to sync 46 unsynced runs from local directory.
But the 5 runs are not actually synced.
I also try tried cleaning with
wandb sync --clean but nothing happens, despite it says that there are 81 runs that can be deleted.
Hi @parisi, thanks for reporting this! I’ll try to reproduce this on my end to see exactly what’s going on here so, to make sure, the steps you’re following are:
- Execute some offline runs (let’s say 10 for example)
wandb sync --sync-all
- Kill that before all runs are synced
wandb sync --sync-all
- Runs synced initially are synced again (so they are synced twice)
In general, I would recommend you to both assign the run to a variable like
run = wandb.init() and also call
run.finish() at the end to ensure no processes are running on your background.
Thanks. I have
try / except to call
finish(), and it works if I kill my run with CTRL+C or if it crashes. However, if I close the terminal my run still appears ar “running” on wanbd UI.
I guess this is also what happens if I kill SLURM jobs with
Thanks for sharing this @parisi. This is probably because when you close the terminal,
finish()hasn’t been called yet so the run will keep running. Would it work for you to use
try / except and kill the run with CTRL+C if needed?
My code already has try/except, and I think they work fine with CTRL+C but maybe not when I use SLURM scancel. I will investigate this more.
Isn’t there a way to detect “dead” runs? Like passing a timeout argument to init such that if the run is not synced after X time it will be considered finished?
EDIT: Yes, my code can catch CTRL+C without problems, but scancel maybe sends a different signal.
Hi @parisi, thanks for your answer! Currently this isn’t possible but I can create a feature request to add that argument to the wandb.init() function, would you mind giving me some details about your use case so I can share that with our Product Team?
Having this argument can fix potential problem if the machine where the code is running crashes and the
finish command is never launched. If this doesn’t happen, the server sees the run as still “running”. Also, if one restarts the machine and does
wandb sync --sync-all it keeps resynching the dead run. This is a waste of time.
Thanks @parisi! I just shared this feedback with our Product Team.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.