What is the correct way to resume a paused or crashed run?

Hi I am new to using WandB. I have my project setup with Tensorflow and am logging to WandB by syncing my Tensorboard wandb.init(project='my-project', sync_tensorboard=True).

Sometimes this run may crash or I have to pause the run to retrieve certain artifacts. Then when the run reinitiates how do I ensure that this is not logged as a new run in WandB? but instead just a continuation of the previous one. The step counters also seem to be reset when this happens, even though the step counters are accurate in tensorboard

Hi @amnikhil, thanks for writing in! Here you can have a look at out docs about resuming runs but basically you need to set arguments resume and run_id when calling the init function as wandb.init(id=run_id, resume="must"). Please let me know if this is useful for you!

1 Like

Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi there, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.