What happens if the code crashes in the middle and there was no time to fo a .finish?

I use DDP a lot and was worried something bad might happen with wandb if my code crashes in the middle.

What happens if the code crashes in the middle? Would there be further processing I need to do to make sure my computer, experiment, resources, account etc are ok?

related: DDP example is not calling .finish in either log_all nor log with lead worker (rank0) · Issue #88 · wandb/examples · GitHub

As per my understanding, wandb runs in a seperate process altogether from training, so even if you training gets crashed due to some reason your wandb process will log this this as well on the dashboard.

A wandb run can be in any one of the stages: running, finished, crashed.

2 Likes

In my experience, it’s best to use a context manager such as with() or just define your run variable within a function and have the run.finish() within the function as well.

For example,

def train_model(...):
    run = wandb.init(....)
    ....
    run.finish()

Also, after a particular run crashes, wandb logs all the prior information anyways and you can always resume the run.

2 Likes

what is the pro vs cons of doing wandb.run and wandb.finish instead of creating a run object?

1 Like

wandb.run and wandb.finish refer to global state – the "current wandb.Run". This can be confusing and can introduce long-range dependencies in your code, especially when you’re also doing multiprocessing of your own, e.g. DDP.

So I prefer to be more explicit and to use actual wandb.Run objects, just as I prefer to use actual Figures and Axes in matplotlib, as opposed to relying on the .gcf/.gca magic.

2 Likes

Yep, this is the solution if the right way to resolve the crash is to restart the experiment and keep going. But note that you’ll need to be able to restore the state of your model + training setup, which can be tricky.

You can also just sync the log information, without restarting a run, with wandb sync. This is useful in the case that the wandb backend process doesn’t finish syncing before it is killed, e.g. by the OS, by another Ctrl+C.

2 Likes