Sync problems

parisi · May 23, 2023, 6:54pm

I run wandb on a cluster. Job nodes don’t have internet so I need to run wandb sync --sync-all from login nodes.
If I kill the job, the command keep syncing the killed runs over and over again: I run the command, the run is synced, I re-run the command (to sync new runs) and the old runs are synced again.
I suspect it’s because how the job was killed, that maybe wandb.finish() was not called. Could that be it?

I then try to selectively sync only runs belonging to a project with wandb sync -p myproj but it just says

wandb: Number of runs to be synced: 46
wandb: Showing 5 runs to be synced:
wandb:   wandb/offline-run-20230523_123938
... [list of the 5 runs]
wandb: NOTE: use wandb sync --clean to delete 81 synced runs from local directory.
wandb: NOTE: use wandb sync --sync-all to sync 46 unsynced runs from local directory.

But the 5 runs are not actually synced.
I also try tried cleaning with wandb sync --clean but nothing happens, despite it says that there are 81 runs that can be deleted.

luis_bergua · May 26, 2023, 3:16pm

Hi @parisi, thanks for reporting this! I’ll try to reproduce this on my end to see exactly what’s going on here so, to make sure, the steps you’re following are:

Execute some offline runs (let’s say 10 for example)
Run wandb sync --sync-all
Kill that before all runs are synced
Re-run wandb sync --sync-all
Runs synced initially are synced again (so they are synced twice)

In general, I would recommend you to both assign the run to a variable like run = wandb.init() and also call run.finish() at the end to ensure no processes are running on your background.

parisi · May 26, 2023, 4:28pm

Thanks. I have try / except to call finish(), and it works if I kill my run with CTRL+C or if it crashes. However, if I close the terminal my run still appears ar “running” on wanbd UI.
I guess this is also what happens if I kill SLURM jobs with scancel.

luis_bergua · June 2, 2023, 1:00pm

Thanks for sharing this @parisi. This is probably because when you close the terminal, finish()hasn’t been called yet so the run will keep running. Would it work for you to use try / except and kill the run with CTRL+C if needed?

parisi · June 2, 2023, 10:14pm

My code already has try/except, and I think they work fine with CTRL+C but maybe not when I use SLURM scancel. I will investigate this more.
Isn’t there a way to detect “dead” runs? Like passing a timeout argument to init such that if the run is not synced after X time it will be considered finished?

EDIT: Yes, my code can catch CTRL+C without problems, but scancel maybe sends a different signal.

luis_bergua · June 7, 2023, 12:58pm

Hi @parisi, thanks for your answer! Currently this isn’t possible but I can create a feature request to add that argument to the wandb.init() function, would you mind giving me some details about your use case so I can share that with our Product Team?

parisi · June 7, 2023, 7:59pm

Sure, thanks!
Having this argument can fix potential problem if the machine where the code is running crashes and the finish command is never launched. If this doesn’t happen, the server sees the run as still “running”. Also, if one restarts the machine and does wandb sync --sync-all it keeps resynching the dead run. This is a waste of time.

luis_bergua · June 9, 2023, 12:05pm

Thanks @parisi! I just shared this feedback with our Product Team.

system · August 8, 2023, 12:06pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wrong result after wandb sync W&B Help wandb	7	1037	March 13, 2023
Synced runs still shows as unsynced W&B Help wandb	2	316	May 1, 2024
Sync local offline runs to the dashboard while deleting old folders W&B Help wandb	4	1711	October 6, 2023
Waiting for W&B process to finish (success) W&B Help dashboard , projects , questions , wandb , beginner-friendly	4	1425	September 26, 2022
Impossible to sync offline runs (.wandb file is empty) W&B Help wandb	3	1067	April 28, 2023

Sync problems

Related topics