What happens if the code crashes in the middle and there was no time to fo a .finish?

brando · September 10, 2021, 7:47pm

I use DDP a lot and was worried something bad might happen with wandb if my code crashes in the middle.

What happens if the code crashes in the middle? Would there be further processing I need to do to make sure my computer, experiment, resources, account etc are ok?

related: DDP example is not calling .finish in either log_all nor log with lead worker (rank0) · Issue #88 · wandb/examples · GitHub

harveenchadha · September 11, 2021, 5:40pm

As per my understanding, wandb runs in a seperate process altogether from training, so even if you training gets crashed due to some reason your wandb process will log this this as well on the dashboard.

A wandb run can be in any one of the stages: running, finished, crashed.

sauravmaheshkar · September 13, 2021, 5:50am

In my experience, it’s best to use a context manager such as with() or just define your run variable within a function and have the run.finish() within the function as well.

For example,

def train_model(...):
    run = wandb.init(....)
    ....
    run.finish()

Also, after a particular run crashes, wandb logs all the prior information anyways and you can always resume the run.

brando · September 13, 2021, 4:51pm

what is the pro vs cons of doing wandb.run and wandb.finish instead of creating a run object?

charlesfrye · September 13, 2021, 5:44pm

wandb.run and wandb.finish refer to global state – the "current wandb.Run". This can be confusing and can introduce long-range dependencies in your code, especially when you’re also doing multiprocessing of your own, e.g. DDP.

So I prefer to be more explicit and to use actual wandb.Run objects, just as I prefer to use actual Figures and Axes in matplotlib, as opposed to relying on the .gcf/.gca magic.

charlesfrye · September 13, 2021, 5:48pm

Yep, this is the solution if the right way to resolve the crash is to restart the experiment and keep going. But note that you’ll need to be able to restore the state of your model + training setup, which can be tricky.

You can also just sync the log information, without restarting a run, with wandb sync. This is useful in the case that the wandb backend process doesn’t finish syncing before it is killed, e.g. by the OS, by another Ctrl+C.

system · April 20, 2022, 6:02pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Finish() is going into loop in distributed setting W&B Help wandb	3	252	February 1, 2024
Wandb.finish() crashes W&B Help wandb	6	608	October 13, 2023
What is the correct way to resume a paused or crashed run? W&B Help dashboard , sweeps , questions , wandb , beginner-friendly	4	3476	June 9, 2023
Run.finish() hangs W&B Help	5	1312	July 3, 2023
Sync issue after training W&B Help wandb	6	93	August 20, 2024

What happens if the code crashes in the middle and there was no time to fo a .finish?

Related topics